I recently found and fixed a bug in a site I was working on that resulted in millions of duplicate rows of data in a table that will be quite large even without them (still in the millions). I can easily find these duplicate rows and can run a single delete query to kill them all. The problem is that trying to delete this many rows in one shot locks up the table for a long time, which I would like to avoid if possible. The only ways I can see to get rid of these rows, without taking down the site (by locking up the table) are:
我最近发现并修复了我正在处理的一个站点上的一个bug,该bug导致在一个表中有数百万行重复的数据,即使没有这些数据,这些数据也会非常大(仍然有数百万)。我可以很容易地找到这些重复的行,并且可以运行一个delete查询来将它们全部删除。问题是,试图一次删除这么多行会使表锁定很长一段时间,如果可能的话,我希望避免这种情况。我能看到的唯一的方法是去掉这些行,而不是通过关闭这个站点(通过锁定表):
- Write a script that will execute thousands of smaller delete queries in a loop. This will theoretically get around the locked table issue because other queries will be able to make it into the queue and run in between the deletes. But it will still spike the load on the database quite a bit and will take a long time to run.
- 编写一个脚本,在循环中执行数千个较小的delete查询。这在理论上可以解决锁表问题,因为其他查询可以将它放入队列,并在删除之间运行。但是它仍然会增加数据库的负载,并且需要很长时间才能运行。
- Rename the table and recreate the existing table (it'll now be empty). Then do my cleanup on the renamed table. Rename the new table, name the old one back and merge the new rows into the renamed table. This is way takes considerably more steps, but should get the job done with minimal interruption. The only tricky part here is that the table in question is a reporting table, so once it's renamed out of the way and the empty one put in its place all historic reports go away until I put it back in place. Plus the merging process could be a bit of a pain because of the type of data being stored. Overall this is my likely choice right now.
- 重命名该表并重新创建现有表(现在该表为空)。然后在重命名的表上执行清理。重命名新表,命名旧表,并将新行合并到重新命名的表中。这种方法需要更多的步骤,但是应该以最小的中断完成工作。这里唯一棘手的部分是,所讨论的表是一个报告表,因此一旦它被重新命名,并且空的那个放在它的位置上,所有的历史报告都会消失,直到我把它放回原位。此外,由于存储的数据类型,合并过程可能会有点麻烦。总的来说,这是我现在可能的选择。
I was just wondering if anyone else has had this problem before and, if so, how you dealt with it without taking down the site and, hopefully, with minimal if any interruption to the users? If I go with number 2, or a different, similar, approach, I can schedule the stuff to run late at night and do the merge early the next morning and just let the users know ahead of time, so that's not a huge deal. I'm just looking to see if anyone has any ideas for a better, or easier, way to do the cleanup.
我只是想知道是否有其他人以前有过这个问题,如果有的话,你如何在不破坏网站的情况下处理它,希望,在对用户的任何干扰中最小化?如果我选择2号,或者另一种类似的方法,我可以安排这些东西在深夜运行,然后在第二天早上进行合并,让用户提前知道,所以这没什么大不了的。我只是想看看是否有人有更好或更容易的方法来清理。
8 个解决方案
#1
100
DELETE FROM `table`
WHERE (whatever criteria)
ORDER BY `id`
LIMIT 1000
Wash, rinse, repeat until zero rows affected. Maybe in a script that sleeps for a second or three between iterations.
清洗,冲洗,重复,直到零行受影响。也许是在迭代之间休眠一到三次的脚本中。
#2
7
I'd also recommend adding some constraints to your table to make sure that this doesn't happen to you again. A million rows, at 1000 per shot, will take 1000 repetitions of a script to complete. If the script runs once every 3.6 seconds you'll be done in an hour. No worries. Your clients are unlikely to notice.
我还建议向您的表中添加一些约束,以确保不会再次发生这种情况。一百万行,每拍1000行,需要对脚本重复1000次才能完成。如果脚本每3.6秒运行一次,您将在一个小时内完成。不用担心。你的客户不太可能会注意到。
#3
7
the following deletes 1,000,000 records, one at a time.
下面的语句每次删除一个记录,删除1,000,000条记录。
for i in `seq 1 1000`; do
mysql -e "select id from table_name where (condition) order by id desc limit 1000 " | sed 's;/|;;g' | awk '{if(NR>1)print "delete from table_name where id = ",$1,";" }' | mysql;
done
you could group them together and do delete table_name where IN (id1,id2,..idN) im sure too w/o much difficulty
你可以把它们组合在一起,在(id1,id2,.. .idN)中删除table_name,我肯定太困难了
#4
5
I had a use case of deleting 1M+ rows in the 25M+ rows Table in the MySQL. Tried different approaches like batch deletes (described above).
I've found out that the fastest way (copy of required records to new table):
我有一个用例,在MySQL的25M+行表中删除1M+行。尝试了不同的方法,如批量删除(如上所述)。我发现最快的方法(需要的记录拷贝到新表):
- Create Temporary Table that holds just ids.
- 创建仅保存id的临时表。
CREATE TABLE id_temp_table ( temp_id int);
创建表id_temp_table (temp_id int);
- Insert ids that should be removed:
- 插入应该删除的id:
insert into id_temp_table (temp_id) select.....
插入到id_temp_table (temp_id)选择……
-
Create New table table_new
创建新表table_new
-
Insert all records from table to table_new without unnecessary rows that are in id_temp_table
将所有记录从表插入到table_new,而不需要id_temp_table中的不必要行
insert into table_new .... where table_id NOT IN (select distinct(temp_id) from id_temp_table);
插入table_new ....其中table_id不在(从id_temp_table中选择distinct(temp_id));
- Rename tables
- 重命名表
The whole process took ~1hr. In my use case simple delete of batch on 100 records took 10 mins.
整个过程需要1小时。在我的用例中,对100条记录简单地删除批处理需要10分钟。
#5
2
I'd use mk-archiver from the excellent Maatkit utilities package (a bunch of Perl scripts for MySQL management) Maatkit is from Baron Schwartz, the author of the O'Reilly "High Performance MySQL" book.
我将从优秀的Maatkit实用程序包(MySQL管理的一堆Perl脚本)中使用mk-archiver, Maatkit来自于Baron Schwartz,他是O'Reilly“高性能MySQL”一书的作者。
The goal is a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much. You can insert the data into another table, which need not be on the same server. You can also write it to a file in a format suitable for LOAD DATA INFILE. Or you can do neither, in which case it's just an incremental DELETE.
目标是一个低影响、只向前的工作,以便在不影响OLTP查询的情况下从表中删除旧数据。您可以将数据插入到另一个表中,该表不必位于同一服务器上。您还可以将它以适合于载入数据的格式写入文件。或者你也不能,在这种情况下,它只是一个增量的删除。
It's already built for archiving your unwanted rows in small batches and as a bonus, it can save the deleted rows to a file in case you screw up the query that selects the rows to remove.
它已经为小批量存档您不想要的行而构建,并且作为额外的奖励,它可以将被删除的行保存到一个文件中,以防您将选择要删除的行的查询弄乱。
No installation required, just grab http://www.maatkit.org/get/mk-archiver and run perldoc on it (or read the web site) for documentation.
无需安装,只需获取http://www.maatkit.org/get/mk-archiver并在其上运行perldoc(或阅读web站点)以获取文档。
#6
2
According to the mysql documentation, TRUNCATE TABLE
is a fast alternative to DELETE FROM
. Try this:
根据mysql文档,truncatetable是一种快速的删除方法。试试这个:
TRUNCATE TABLE table_name
I tried this on 50M rows and it was done within two mins.
我在50米的行上做了这个实验,两分钟内就完成了。
Note: Truncate operations are not transaction-safe; an error occurs when attempting one in the course of an active transaction or active table lock
注意:截断操作不是事务安全的;在活动事务或活动表锁过程中尝试一个错误时发生
#7
1
Do it in batches of lets say 2000 rows at a time. Commit in-between. A million rows isn't that much and this will be fast, unless you have many indexes on the table.
每次以2000行为单位的批处理。提交中间。一百万行不是那么多,这将是快速的,除非您有许多索引在表上。
#8
1
For us, the DELETE WHERE %s ORDER BY %s LIMIT %d
answer was not an option, because the WHERE criteria was slow (a non-indexed column), and would hit master.
对于我们来说,删除%s订单的%s限制%d答案不是一个选项,因为WHERE条件比较慢(一个没有索引的列),并且会击中master。
SELECT from a read-replica a list of primary keys that you wish to delete. Export with this kind of format:
从读取复制中选择要删除的主键的列表。以这种格式输出:
00669163-4514-4B50-B6E9-50BA232CA5EB
00679DE5-7659-4CD4-A919-6426A2831F35
Use the following bash script to grab this input and chunk it into DELETE statements [requires bash ≥ 4 because of mapfile
built-in]:
使用下面的bash脚本抓住这个输入块成DELETE语句(需要bash≥4因为mapfile内置的):
sql-chunker.sh
(remember to chmod +x
me, and change the shebang to point to your bash 4 executable):
sql-chunker。sh(请记住chmod +x me,并将shebang更改为指向您的bash 4可执行文件):
#!/usr/local/Cellar/bash/4.4.12/bin/bash
# Expected input format:
: <<!
00669163-4514-4B50-B6E9-50BA232CA5EB
00669DE5-7659-4CD4-A919-6426A2831F35
!
if [ -z "$1" ]
then
echo "No chunk size supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi
if [ -z "$2" ]
then
echo "No file supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi
function join_by {
local d=$1
shift
echo -n "$1"
shift
printf "%s" "${@/#/$d}"
}
while mapfile -t -n "$1" ary && ((${#ary[@]})); do
printf "DELETE FROM my_cool_table WHERE id IN ('%s');\n" `join_by "','" "${ary[@]}"`
done < "$2"
Invoke like so:
调用如下所示:
./sql-chunker.sh 1000 ids.txt > batch_1000.sql
This will give you a file with output formatted like so (I've used a batch size of 2):
这将给您一个输出格式类似so(我使用的批处理大小为2)的文件:
DELETE FROM my_cool_table WHERE id IN ('006CC671-655A-432E-9164-D3C64191EDCE','006CD163-794A-4C3E-8206-D05D1A5EE01E');
DELETE FROM my_cool_table WHERE id IN ('006CD837-F1AD-4CCA-82A4-74356580CEBC','006CDA35-F132-4F2C-8054-0F1D6709388A');
Then execute the statements like so:
然后执行下列语句:
mysql --login-path=master billing < batch_1000.sql
For those unfamiliar with login-path
, it's just a shortcut to login without typing password in the command line.
对于不熟悉登录路径的人来说,这只是在命令行中不输入密码的快捷方式。
#1
100
DELETE FROM `table`
WHERE (whatever criteria)
ORDER BY `id`
LIMIT 1000
Wash, rinse, repeat until zero rows affected. Maybe in a script that sleeps for a second or three between iterations.
清洗,冲洗,重复,直到零行受影响。也许是在迭代之间休眠一到三次的脚本中。
#2
7
I'd also recommend adding some constraints to your table to make sure that this doesn't happen to you again. A million rows, at 1000 per shot, will take 1000 repetitions of a script to complete. If the script runs once every 3.6 seconds you'll be done in an hour. No worries. Your clients are unlikely to notice.
我还建议向您的表中添加一些约束,以确保不会再次发生这种情况。一百万行,每拍1000行,需要对脚本重复1000次才能完成。如果脚本每3.6秒运行一次,您将在一个小时内完成。不用担心。你的客户不太可能会注意到。
#3
7
the following deletes 1,000,000 records, one at a time.
下面的语句每次删除一个记录,删除1,000,000条记录。
for i in `seq 1 1000`; do
mysql -e "select id from table_name where (condition) order by id desc limit 1000 " | sed 's;/|;;g' | awk '{if(NR>1)print "delete from table_name where id = ",$1,";" }' | mysql;
done
you could group them together and do delete table_name where IN (id1,id2,..idN) im sure too w/o much difficulty
你可以把它们组合在一起,在(id1,id2,.. .idN)中删除table_name,我肯定太困难了
#4
5
I had a use case of deleting 1M+ rows in the 25M+ rows Table in the MySQL. Tried different approaches like batch deletes (described above).
I've found out that the fastest way (copy of required records to new table):
我有一个用例,在MySQL的25M+行表中删除1M+行。尝试了不同的方法,如批量删除(如上所述)。我发现最快的方法(需要的记录拷贝到新表):
- Create Temporary Table that holds just ids.
- 创建仅保存id的临时表。
CREATE TABLE id_temp_table ( temp_id int);
创建表id_temp_table (temp_id int);
- Insert ids that should be removed:
- 插入应该删除的id:
insert into id_temp_table (temp_id) select.....
插入到id_temp_table (temp_id)选择……
-
Create New table table_new
创建新表table_new
-
Insert all records from table to table_new without unnecessary rows that are in id_temp_table
将所有记录从表插入到table_new,而不需要id_temp_table中的不必要行
insert into table_new .... where table_id NOT IN (select distinct(temp_id) from id_temp_table);
插入table_new ....其中table_id不在(从id_temp_table中选择distinct(temp_id));
- Rename tables
- 重命名表
The whole process took ~1hr. In my use case simple delete of batch on 100 records took 10 mins.
整个过程需要1小时。在我的用例中,对100条记录简单地删除批处理需要10分钟。
#5
2
I'd use mk-archiver from the excellent Maatkit utilities package (a bunch of Perl scripts for MySQL management) Maatkit is from Baron Schwartz, the author of the O'Reilly "High Performance MySQL" book.
我将从优秀的Maatkit实用程序包(MySQL管理的一堆Perl脚本)中使用mk-archiver, Maatkit来自于Baron Schwartz,他是O'Reilly“高性能MySQL”一书的作者。
The goal is a low-impact, forward-only job to nibble old data out of the table without impacting OLTP queries much. You can insert the data into another table, which need not be on the same server. You can also write it to a file in a format suitable for LOAD DATA INFILE. Or you can do neither, in which case it's just an incremental DELETE.
目标是一个低影响、只向前的工作,以便在不影响OLTP查询的情况下从表中删除旧数据。您可以将数据插入到另一个表中,该表不必位于同一服务器上。您还可以将它以适合于载入数据的格式写入文件。或者你也不能,在这种情况下,它只是一个增量的删除。
It's already built for archiving your unwanted rows in small batches and as a bonus, it can save the deleted rows to a file in case you screw up the query that selects the rows to remove.
它已经为小批量存档您不想要的行而构建,并且作为额外的奖励,它可以将被删除的行保存到一个文件中,以防您将选择要删除的行的查询弄乱。
No installation required, just grab http://www.maatkit.org/get/mk-archiver and run perldoc on it (or read the web site) for documentation.
无需安装,只需获取http://www.maatkit.org/get/mk-archiver并在其上运行perldoc(或阅读web站点)以获取文档。
#6
2
According to the mysql documentation, TRUNCATE TABLE
is a fast alternative to DELETE FROM
. Try this:
根据mysql文档,truncatetable是一种快速的删除方法。试试这个:
TRUNCATE TABLE table_name
I tried this on 50M rows and it was done within two mins.
我在50米的行上做了这个实验,两分钟内就完成了。
Note: Truncate operations are not transaction-safe; an error occurs when attempting one in the course of an active transaction or active table lock
注意:截断操作不是事务安全的;在活动事务或活动表锁过程中尝试一个错误时发生
#7
1
Do it in batches of lets say 2000 rows at a time. Commit in-between. A million rows isn't that much and this will be fast, unless you have many indexes on the table.
每次以2000行为单位的批处理。提交中间。一百万行不是那么多,这将是快速的,除非您有许多索引在表上。
#8
1
For us, the DELETE WHERE %s ORDER BY %s LIMIT %d
answer was not an option, because the WHERE criteria was slow (a non-indexed column), and would hit master.
对于我们来说,删除%s订单的%s限制%d答案不是一个选项,因为WHERE条件比较慢(一个没有索引的列),并且会击中master。
SELECT from a read-replica a list of primary keys that you wish to delete. Export with this kind of format:
从读取复制中选择要删除的主键的列表。以这种格式输出:
00669163-4514-4B50-B6E9-50BA232CA5EB
00679DE5-7659-4CD4-A919-6426A2831F35
Use the following bash script to grab this input and chunk it into DELETE statements [requires bash ≥ 4 because of mapfile
built-in]:
使用下面的bash脚本抓住这个输入块成DELETE语句(需要bash≥4因为mapfile内置的):
sql-chunker.sh
(remember to chmod +x
me, and change the shebang to point to your bash 4 executable):
sql-chunker。sh(请记住chmod +x me,并将shebang更改为指向您的bash 4可执行文件):
#!/usr/local/Cellar/bash/4.4.12/bin/bash
# Expected input format:
: <<!
00669163-4514-4B50-B6E9-50BA232CA5EB
00669DE5-7659-4CD4-A919-6426A2831F35
!
if [ -z "$1" ]
then
echo "No chunk size supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi
if [ -z "$2" ]
then
echo "No file supplied. Invoke: ./sql-chunker.sh 1000 ids.txt"
fi
function join_by {
local d=$1
shift
echo -n "$1"
shift
printf "%s" "${@/#/$d}"
}
while mapfile -t -n "$1" ary && ((${#ary[@]})); do
printf "DELETE FROM my_cool_table WHERE id IN ('%s');\n" `join_by "','" "${ary[@]}"`
done < "$2"
Invoke like so:
调用如下所示:
./sql-chunker.sh 1000 ids.txt > batch_1000.sql
This will give you a file with output formatted like so (I've used a batch size of 2):
这将给您一个输出格式类似so(我使用的批处理大小为2)的文件:
DELETE FROM my_cool_table WHERE id IN ('006CC671-655A-432E-9164-D3C64191EDCE','006CD163-794A-4C3E-8206-D05D1A5EE01E');
DELETE FROM my_cool_table WHERE id IN ('006CD837-F1AD-4CCA-82A4-74356580CEBC','006CDA35-F132-4F2C-8054-0F1D6709388A');
Then execute the statements like so:
然后执行下列语句:
mysql --login-path=master billing < batch_1000.sql
For those unfamiliar with login-path
, it's just a shortcut to login without typing password in the command line.
对于不熟悉登录路径的人来说,这只是在命令行中不输入密码的快捷方式。