快速删除匹配行的方法?

时间:2020-11-27 18:06:11

I'm a relative novice when it comes to databases. We are using MySQL and I'm currently trying to speed up a SQL statement that seems to take a while to run. I looked around on SO for a similar question but didn't find one.

在数据库方面,我是一个相对的新手。我们正在使用MySQL,我目前正在尝试加快SQL语句的运行,这似乎需要一段时间。我四下张望,想找个类似的问题,但没有找到。

The goal is to remove all the rows in table A that have a matching id in table B.

目标是删除表A中具有匹配id的所有行。

I'm currently doing the following:

我目前正在做以下工作:

DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id);

There are approximately 100K rows in table a and about 22K rows in table b. The column 'id' is the PK for both tables.

表a中大约有100K行,表b中大约有22K行。

This statement takes about 3 minutes to run on my test box - Pentium D, XP SP3, 2GB ram, MySQL 5.0.67. This seems slow to me. Maybe it isn't, but I was hoping to speed things up. Is there a better/faster way to accomplish this?

这个语句在我的测试框上运行大约3分钟——奔腾D, XP SP3, 2GB ram, MySQL 5.0.67。这在我看来太慢了。也许不是,但我希望能加快速度。有更好/更快的方法来实现这个目标吗?


EDIT:

编辑:

Some additional information that might be helpful. Tables A and B have the same structure as I've done the following to create table B:

一些可能有用的附加信息。表A和表B的结构与我创建表B的结构相同:

CREATE TABLE b LIKE a;

Table a (and thus table b) has a few indexes to help speed up queries that are made against it. Again, I'm a relative novice at DB work and still learning. I don't know how much of an effect, if any, this has on things. I assume that it does have an effect as the indexes have to be cleaned up too, right? I was also wondering if there were any other DB settings that might affect the speed.

表a(因此表b)有一些索引来帮助加快对它的查询。再说一遍,我是DB工作的新手,还在学习。我不知道这对事物有多大的影响,如果有的话。我假设它确实有影响因为索引也必须被清除,对吗?我还想知道是否有其他DB设置会影响速度。

Also, I'm using INNO DB.

同样,我使用INNO DB。


Here is some additional info that might be helpful to you.

这里有一些可能对你有帮助的附加信息。

Table A has a structure similar to this (I've sanitized this a bit):

表A的结构与此类似(我对此做了一点清理):

DROP TABLE IF EXISTS `frobozz`.`a`;
CREATE TABLE  `frobozz`.`a` (
  `id` bigint(20) unsigned NOT NULL auto_increment,
  `fk_g` varchar(30) NOT NULL,
  `h` int(10) unsigned default NULL,
  `i` longtext,
  `j` bigint(20) NOT NULL,
  `k` bigint(20) default NULL,
  `l` varchar(45) NOT NULL,
  `m` int(10) unsigned default NULL,
  `n` varchar(20) default NULL,
  `o` bigint(20) NOT NULL,
  `p` tinyint(1) NOT NULL,
  PRIMARY KEY  USING BTREE (`id`),
  KEY `idx_l` (`l`),
  KEY `idx_h` USING BTREE (`h`),
  KEY `idx_m` USING BTREE (`m`),
  KEY `idx_fk_g` USING BTREE (`fk_g`),
  KEY `fk_g_frobozz` (`id`,`fk_g`),
  CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`)
) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC;

I suspect that part of the issue is there are a number of indexes for this table. Table B looks similar to table B, though it only contains the columns id and h.

我怀疑问题的一部分是这个表有很多索引。表B看起来类似于表B,尽管它只包含列id和h。

Also, the profiling results are as follows:

分析结果如下:

starting 0.000018
checking query cache for query 0.000044
checking permissions 0.000005
Opening tables 0.000009
init 0.000019
optimizing 0.000004
executing 0.000043
end 0.000005
end 0.000002
query end 0.000003
freeing items 0.000007
logging slow query 0.000002
cleaning up 0.000002

SOLVED

解决了

Thanks to all the responses and comments. They certainly got me to think about the problem. Kudos to dotjoe for getting me to step away from the problem by asking the simple question "Do any other tables reference a.id?"

感谢所有的回复和评论。他们肯定让我思考这个问题。为了让我远离这个问题,我问了一个简单的问题:“是否有其他的表格引用a。id?”

The problem was that there was a DELETE TRIGGER on table A which called a stored procedure to update two other tables, C and D. Table C had a FK back to a.id and after doing some stuff related to that id in the stored procedure, it had the statement,

问题是表a上有一个DELETE触发器,它调用了一个存储过程来更新另外两个表,C和d。id在存储过程中做了一些与id相关的事情后,它有一个语句,

DELETE FROM c WHERE c.id = theId;

I looked into the EXPLAIN statement and rewrote this as,

我研究了解释陈述,把它重写为,

EXPLAIN SELECT * FROM c WHERE c.other_id = 12345;

So, I could see what this was doing and it gave me the following info:

我可以看到它在做什么它给了我以下信息:

id            1
select_type   SIMPLE
table         c
type          ALL
possible_keys NULL
key           NULL
key_len       NULL
ref           NULL
rows          2633
Extra         using where

This told me that it was a painful operation to make and since it was going to get called 22500 times (for the given set of data being deleted), that was the problem. Once I created an INDEX on that other_id column and reran the EXPLAIN, I got:

这告诉我这是一个痛苦的操作,因为它将被调用22500次(对于给定的数据集被删除),这就是问题所在。一旦我在other_id列上创建一个索引并重新运行EXPLAIN,我得到:

id            1
select_type   SIMPLE
table         c
type          ref
possible_keys Index_1
key           Index_1
key_len       8
ref           const
rows          1
Extra         

Much better, in fact really great.

好多了,事实上真的很棒。

I added that Index_1 and my delete times are in line with the times reported by mattkemp. This was a really subtle error on my part due to shoe-horning some additional functionality at the last minute. It turned out that most of the suggested alternative DELETE/SELECT statements, as Daniel stated, ended up taking essentially the same amount of time and as soulmerge mentioned, the statement was pretty much the best I was going to be able to construct based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.

我添加了Index_1和我的删除时间与mattkemp报告的时间一致。这对我来说是一个非常微妙的错误,因为在最后一分钟的时候,一些附加的功能出现了问题。事实证明,大多数建议的可选的DELETE/SELECT语句,如Daniel所述,最终花费的时间基本上是相同的,正如灵魂归并所提到的,该语句基本上是我根据需要构建的最好的语句。一旦我为另一个表C提供了一个索引,我的删除就会很快。

Postmortem:
Two lessons learned came out of this exercise. First, it is clear that I didn't leverage the power of the EXPLAIN statement to get a better idea of the impact of my SQL queries. That's a rookie mistake, so I'm not going to beat myself up about that one. I'll learn from that mistake. Second, the offending code was the result of a 'get it done quick' mentality and inadequate design/testing led to this problem not showing up sooner. Had I generated several sizable test data sets to use as test input for this new functionality, I'd have not wasted my time nor yours. My testing on the DB side was lacking the depth that my application side has in place. Now I've got the opportunity to improve that.

验尸:从这次演习中我们得到了两个教训。首先,很明显,我没有利用EXPLAIN语句的强大功能来更好地了解SQL查询的影响。这是一个菜鸟犯的错误,所以我不会因此而自责。我将从那个错误中吸取教训。其次,违规的代码是“快速完成”思想的结果,设计/测试的不足导致了这个问题的出现。如果我生成了几个相当大的测试数据集作为这个新功能的测试输入,我就不会浪费我的时间,也不会浪费您的时间。我在DB端进行的测试缺乏应用程序端的深度。现在我有机会改进它。

Reference: EXPLAIN Statement

参考:EXPLAIN语句

14 个解决方案

#1


70  

Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem - most of them will be optimized to the same execution plan anyway.

从InnoDB中删除数据是最昂贵的操作。正如您已经发现的那样,查询本身并不是问题所在——无论如何,大多数查询都将优化为相同的执行计划。

While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.

虽然很难理解为什么删除所有的情况是最慢的,但是有一个相当简单的解释。InnoDB是一个事务存储引擎。这意味着,如果查询中途被中止,那么所有记录仍然保持原样,就好像什么都没有发生一样。一旦它完成了,所有的一切都会在瞬间消失。在删除期间,连接到服务器的其他客户端将看到记录,直到删除完成。

To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values - one for each snapshot. This is also why COUNTing on InnoDB takes some time - it depends on the snapshot state you see at that time.

为了实现这一点,InnoDB使用一种名为MVCC的技术(多版本并发控制)。它的基本功能是向每个连接提供整个数据库的快照视图,就像事务的第一个语句启动时一样。为了实现这一点,InnoDB内部的每个记录都可以有多个值——每个快照对应一个值。这也是为什么依赖InnoDB需要一些时间——这取决于您当时看到的快照状态。

For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.

对于删除事务,根据查询条件标识的每条记录都会被标记为删除。由于其他客户机可能同时访问数据,因此不能立即从表中删除它们,因为它们必须看到各自的快照,以保证删除的原子性。

Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.

一旦所有记录被标记为删除,事务将被成功提交。而且,即使这样,在删除事务之前使用快照值的所有其他事务结束之前,也不能立即从实际数据页面中删除它们。

So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will "hear" your hard disk working while the statement runs. This is caused by accessing all the rows. To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record. With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.

所以实际上你的3分钟并没有那么慢,考虑到所有的记录都必须被修改以使它们能够以安全的方式被删除。当语句运行时,您可能会“听到”您的硬盘工作。这是由访问所有行引起的。为了提高性能,您可以尝试为您的服务器增加InnoDB缓冲池大小,并尝试在删除时限制对数据库的其他访问,从而减少InnoDB必须维护每个记录的历史版本的数量。使用额外的内存,InnoDB可能能够将您的表(大部分)读入内存,并避免一些磁盘查找时间。

#2


9  

Your time of three minutes seems really slow. My guess is that the id column is not being indexed properly. If you could provide the exact table definition you're using that would be helpful.

你三分钟的时间似乎过得很慢。我的猜测是id列没有被正确地索引。如果您可以提供您正在使用的表定义,那将会很有帮助。

I created a simple python script to produce test data and ran multiple different versions of the delete query against the same data set. Here's my table definitions:

我创建了一个简单的python脚本,用于生成测试数据,并针对相同的数据集运行多个不同版本的delete查询。

drop table if exists a;
create table a
 (id bigint unsigned  not null primary key,
  data varchar(255) not null) engine=InnoDB;

drop table if exists b;
create table b like a;

I then inserted 100k rows into a and 25k rows into b (22.5k of which were also in a). Here's the results of the various delete commands. I dropped and repopulated the table between runs by the way.

然后我将100k行插入到a中,25k行插入到b中(22.5k行也在a中)。顺便说一下,我在运行之间删除并重新填充了这个表。

mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (1.14 sec)

mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (0.81 sec)

mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (0.97 sec)

mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (0.81 sec)

All the tests were run on an Intel Core2 quad-core 2.5GHz, 2GB RAM with Ubuntu 8.10 and MySQL 5.0. Note, that the execution of one sql statement is still single threaded.

所有的测试都在Intel Core2四核2.5GHz上运行,2GB的RAM和Ubuntu 8.10和MySQL 5.0上运行。注意,一条sql语句的执行仍然是单线程的。


Update:

更新:

I updated my tests to use itsmatt's schema. I slightly modified it by remove auto increment (I'm generating synthetic data) and character set encoding (wasn't working - didn't dig into it).

我更新了我的测试以使用它的马特模式。我通过删除自动增量(生成合成数据)和字符集编码(不工作——不深入)对它进行了一些修改。

Here's my new table definitions:

下面是我的新表定义:

drop table if exists a;
drop table if exists b;
drop table if exists c;

create table c (id varchar(30) not null primary key) engine=InnoDB;

create table a (
  id bigint(20) unsigned not null primary key,
  c_id varchar(30) not null,
  h int(10) unsigned default null,
  i longtext,
  j bigint(20) not null,
  k bigint(20) default null,
  l varchar(45) not null,
  m int(10) unsigned default null,
  n varchar(20) default null,
  o bigint(20) not null,
  p tinyint(1) not null,
  key l_idx (l),
  key h_idx (h),
  key m_idx (m),
  key c_id_idx (id, c_id),
  key c_id_fk (c_id),
  constraint c_id_fk foreign key (c_id) references c(id)
) engine=InnoDB row_format=dynamic;

create table b like a;

I then reran the same tests with 100k rows in a and 25k rows in b (and repopulating between runs).

然后,我重新运行相同的测试,在a中有100k行,在b中有25k行(并在运行之间重新填充)。

mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (11.90 sec)

mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (11.48 sec)

mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (12.21 sec)

mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (12.33 sec)

As you can see this is quite a bit slower than before, probably due to the multiple indexes. However, it is nowhere near the three minute mark.

正如您所看到的,这比以前要慢一些,可能是由于多个索引。然而,离三分钟的成绩还差得很远。

Something else that you might want to look at is moving the longtext field to the end of the schema. I seem to remember that mySQL performs better if all the size restricted fields are first and text, blob, etc are at the end.

您可能想要查看的其他内容是将longtext字段移动到模式的末尾。我似乎记得,如果所有大小受限的字段都是first,而文本、blob等都在末尾,那么mySQL的性能会更好。

#3


7  

Try this:

试试这个:

DELETE a
FROM a
INNER JOIN b
 on a.id = b.id

Using subqueries tend to be slower then joins as they are run for each record in the outer query.

使用子查询比使用连接要慢,因为它们是为外部查询中的每个记录运行的。

#4


4  

This is what I always do, when I have to operate with super large data (here: a sample test table with 150000 rows):

这是我经常做的,当我需要处理超大数据时(这里是一个有150000行的测试表):

drop table if exists employees_bak;
create table employees_bak like employees;
insert into employees_bak 
    select * from employees
    where emp_no > 100000;

rename table employees to employees_todelete;
rename table employees_bak to employees;

In this case the sql filters 50000 rows into the backup table. The query cascade performs on my slow machine in 5 seconds. You can replace the insert into select by your own filter query.

在这种情况下,sql将50000行过滤到备份表中。查询级联在我的慢速机器上执行5秒。您可以使用自己的筛选器查询将插入替换为select。

That is the trick to perform mass deletion on big databases!;=)

这就是在大型数据库上执行大规模删除的技巧!

#5


3  

You're doing your subquery on 'b' for every row in 'a'.

对于a中的每一行,你都在b上进行子查询。

Try:

试一试:

DELETE FROM a USING a LEFT JOIN b ON a.id = b.id WHERE b.id IS NOT NULL;

#6


3  

Try this out:

试试这个:

DELETE QUICK A.* FROM A,B WHERE A.ID=B.ID

It is much faster than normal queries.

它比普通查询快得多。

Refer for Syntax : http://dev.mysql.com/doc/refman/5.0/en/delete.html

请参阅语法:http://dev.mysql.com/doc/refman/5.0/en/delete.html

#7


3  

I know this question has been pretty much solved due to OP's indexing omissions but I would like to offer this additional advice, which is valid for a more generic case of this problem.

我知道由于OP的省略,这个问题已经得到了很大的解决,但是我想提供这个额外的建议,这对于这个问题的更一般的情况是有效的。

I have personally dealt with having to delete many rows from one table that exist in another and in my experience it's best to do the following, especially if you expect lots of rows to be deleted. This technique most importantly will improve replication slave lag, as the longer each single mutator query runs, the worse the lag would be (replication is single threaded).

我个人已经处理过必须从一个表中删除存在于另一个表中的许多行,根据我的经验,最好执行以下操作,特别是如果希望删除大量行。最重要的是,这种技术将改进复制从延迟,因为每个单独的mutator查询运行的时间越长,延迟就越糟糕(复制是单线程的)。

So, here it is: do a SELECT first, as a separate query, remembering the IDs returned in your script/application, then continue on deleting in batches (say, 50,000 rows at a time). This will achieve the following:

它是这样的:首先执行SELECT,作为单独的查询,记住在脚本/应用程序中返回的id,然后继续批量删除(比如每次删除50,000行)。这将实现以下目标:

  • each one of the delete statements will not lock the table for too long, thus not letting replication lag to get out of control. It is especially important if you rely on your replication to provide you relatively up-to-date data. The benefit of using batches is that if you find that each DELETE query still takes too long, you can adjust it to be smaller without touching any DB structures.
  • 每个delete语句都不会太长时间地锁定表,因此不会让复制延迟失控。如果您依赖您的复制来提供相对最新的数据,这一点尤为重要。使用批次的好处是,如果您发现每个删除查询仍然花费太长时间,您可以调整它以使其更小,而不需要触摸任何DB结构。
  • another benefit of using a separate SELECT is that the SELECT itself might take a long time to run, especially if it can't for whatever reason use the best DB indexes. If the SELECT is inner to a DELETE, when the whole statement migrates to the slaves, it will have to do the SELECT all over again, potentially lagging the slaves because it has to do the long select all over again. Slave lag, again, suffers badly. If you use a separate SELECT query, this problem goes away, as all you're passing is a list of IDs.
  • 使用单独的SELECT的另一个好处是,SELECT本身可能需要很长时间才能运行,特别是如果它不能出于任何原因使用最好的DB索引。如果SELECT是DELETE的内部语句,那么当整个语句迁移到从语句时,它将不得不重新执行SELECT,这可能会使从语句滞后,因为它必须重新执行long SELECT。奴隶的落后同样受到严重的影响。如果您使用一个单独的SELECT查询,这个问题就会消失,因为您传递的只是一个id列表。

Let me know if there's a fault in my logic somewhere.

如果我的逻辑有问题,请告诉我。

For more discussion on replication lag and ways to fight it, similar to this one, see MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It

有关复制延迟和与此类似的方法的更多讨论,请参见MySQL Slave lag (Delay)解释和7种对抗它的方法

P.S. One thing to be careful about is, of course, potential edits to the table between the times the SELECT finishes and DELETEs start. I will let you handle such details by using transactions and/or logic pertinent to your application.

当然,需要注意的一件事是,在选择结束和删除开始之间,对表的潜在编辑。我将允许您使用与应用程序相关的事务和/或逻辑来处理这些细节。

#8


2  

DELETE FROM a WHERE id IN (SELECT id FROM b)

#9


2  

Maybe you should rebuild the indicies before running such a hugh query. Well, you should rebuild them periodically.

在运行这样一个hugh查询之前,也许您应该重新构建这些独立特性。你应该定期重建它们。

REPAIR TABLE a QUICK;
REPAIR TABLE b QUICK;

and then run any of the above queries (i.e.)

然后运行上面的任何查询(例如)

DELETE FROM a WHERE id IN (SELECT id FROM b)

#10


2  

The query itself is already in an optimal form, updating the indexes causes the whole operation to take that long. You could disable the keys on that table before the operation, that should speed things up. You can turn them back on at a later time, if you don't need them immediately.

查询本身已经处于最佳状态,更新索引会导致整个操作花费那么长时间。您可以在操作之前禁用该表上的键,这会加快速度。如果您不需要它们,您可以在稍后的时候重新打开它们。

Another approach would be adding a deleted flag-column to your table and adjusting other queries so they take that value into account. The fastest boolean type in mysql is CHAR(0) NULL (true = '', false = NULL). That would be a fast operation, you can delete the values afterwards.

另一种方法是向表中添加已删除的旗舰列,并调整其他查询,以便它们考虑该值。mysql中最快的布尔类型是CHAR(0) NULL (true = ", false = NULL)。这将是一个快速的操作,您可以在之后删除这些值。

The same thoughts expressed in sql statements:

sql语句也表达了同样的想法:

ALTER TABLE a ADD COLUMN deleted CHAR(0) NULL DEFAULT NULL;

-- The following query should be faster than the delete statement:
UPDATE a INNER JOIN b SET a.deleted = '';

-- This is the catch, you need to alter the rest
-- of your queries to take the new column into account:
SELECT * FROM a WHERE deleted IS NULL;

-- You can then issue the following queries in a cronjob
-- to clean up the tables:
DELETE FROM a WHERE deleted IS NOT NULL;

If that, too, is not what you want, you can have a look at what the mysql docs have to say about the speed of delete statements.

如果这也不是您想要的,那么您可以看看mysql文档对删除语句速度的说明。

#11


2  

BTW, after posting the above on my blog, Baron Schwartz from Percona brought to my attention that his maatkit already has a tool just for this purpose - mk-archiver. http://www.maatkit.org/doc/mk-archiver.html.

顺便说一句,在我的博客上发布了上述内容后,来自Percona的Baron Schwartz让我注意到,他的maatkit已经有了一个工具——mk-archiver。http://www.maatkit.org/doc/mk-archiver.html。

It is most likely your best tool for the job.

这很可能是你工作的最佳工具。

#12


1  

Obviously the SELECT query that builds the foundation of your DELETE operation is quite fast so I'd think that either the foreign key constraint or the indexes are the reasons for your extremely slow query.

显然,构建删除操作基础的SELECT查询速度非常快,所以我认为,要么是外键约束,要么是索引,导致查询速度极慢。

Try

试一试

SET foreign_key_checks = 0;
/* ... your query ... */
SET foreign_key_checks = 1;

This would disable the checks on the foreign key. Unfortunately you cannot disable (at least I don't know how) the key-updates with an InnoDB table. With a MyISAM table you could do something like

这将禁用对外键的检查。不幸的是,您不能禁用(至少我不知道如何)使用InnoDB表的键更新。有了MyISAM表,你可以做一些类似的事情。

ALTER TABLE a DISABLE KEYS
/* ... your query ... */
ALTER TABLE a ENABLE KEYS 

I actually did not test if these settings would affect the query duration. But it's worth a try.

实际上,我没有测试这些设置是否会影响查询持续时间。但是值得一试。

#13


0  

Connect datebase using terminal and execute command below, look at the result time each of them, you'll find that times of delete 10, 100, 1000, 10000, 100000 records are not Multiplied.

使用终端连接datebase并执行下面的命令,查看它们的结果时间,您会发现删除10、100、1000、10000、100000记录的次数都没有相乘。

  DELETE FROM #{$table_name} WHERE id < 10;
  DELETE FROM #{$table_name} WHERE id < 100;
  DELETE FROM #{$table_name} WHERE id < 1000;
  DELETE FROM #{$table_name} WHERE id < 10000;
  DELETE FROM #{$table_name} WHERE id < 100000;

The time of deleting 10 thousand records is not 10 times as much as deleting 100 thousand records. Then, except for finding a way delete records more faster, there are some indirect methods.

删除10,000条记录的时间不是删除100,000条记录的10倍。然后,除了找到一种删除记录更快的方法,还有一些间接的方法。

1, We can rename the table_name to table_name_bak, and then select records from table_name_bak to table_name.

1,我们可以将table_name重命名为table_name_bak,然后从table_name_bak中选择记录到table_name。

2, To delete 10000 records, we can delete 1000 records 10 times. There is an example ruby script to do it.

2、要删除10000条记录,我们可以删除1000条记录10次。有一个ruby脚本示例来完成它。

#!/usr/bin/env ruby
require 'mysql2'


$client = Mysql2::Client.new(
  :as => :array,
  :host => '10.0.0.250',
  :username => 'mysql',
  :password => '123456',
  :database => 'test'
)


$ids = (1..1000000).to_a
$table_name = "test"

until $ids.empty?
  ids = $ids.shift(1000).join(", ")
  puts "delete =================="
  $client.query("
                DELETE FROM #{$table_name}
                WHERE id IN ( #{ids} )
                ")
end

#14


-2  

The basic technique for deleting multiple Row form MySQL in single table through the id field

通过id字段在单表中删除多行MySQL的基本技术

DELETE FROM tbl_name WHERE id <= 100 AND id >=200; This query is responsible for deleting the matched condition between 100 AND 200 from the certain table

从tbl_name中删除id <= 100和id >=200;这个查询负责从某个表中删除100到200之间的匹配条件

#1


70  

Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem - most of them will be optimized to the same execution plan anyway.

从InnoDB中删除数据是最昂贵的操作。正如您已经发现的那样,查询本身并不是问题所在——无论如何,大多数查询都将优化为相同的执行计划。

While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.

虽然很难理解为什么删除所有的情况是最慢的,但是有一个相当简单的解释。InnoDB是一个事务存储引擎。这意味着,如果查询中途被中止,那么所有记录仍然保持原样,就好像什么都没有发生一样。一旦它完成了,所有的一切都会在瞬间消失。在删除期间,连接到服务器的其他客户端将看到记录,直到删除完成。

To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values - one for each snapshot. This is also why COUNTing on InnoDB takes some time - it depends on the snapshot state you see at that time.

为了实现这一点,InnoDB使用一种名为MVCC的技术(多版本并发控制)。它的基本功能是向每个连接提供整个数据库的快照视图,就像事务的第一个语句启动时一样。为了实现这一点,InnoDB内部的每个记录都可以有多个值——每个快照对应一个值。这也是为什么依赖InnoDB需要一些时间——这取决于您当时看到的快照状态。

For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.

对于删除事务,根据查询条件标识的每条记录都会被标记为删除。由于其他客户机可能同时访问数据,因此不能立即从表中删除它们,因为它们必须看到各自的快照,以保证删除的原子性。

Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.

一旦所有记录被标记为删除,事务将被成功提交。而且,即使这样,在删除事务之前使用快照值的所有其他事务结束之前,也不能立即从实际数据页面中删除它们。

So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will "hear" your hard disk working while the statement runs. This is caused by accessing all the rows. To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record. With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.

所以实际上你的3分钟并没有那么慢,考虑到所有的记录都必须被修改以使它们能够以安全的方式被删除。当语句运行时,您可能会“听到”您的硬盘工作。这是由访问所有行引起的。为了提高性能,您可以尝试为您的服务器增加InnoDB缓冲池大小,并尝试在删除时限制对数据库的其他访问,从而减少InnoDB必须维护每个记录的历史版本的数量。使用额外的内存,InnoDB可能能够将您的表(大部分)读入内存,并避免一些磁盘查找时间。

#2


9  

Your time of three minutes seems really slow. My guess is that the id column is not being indexed properly. If you could provide the exact table definition you're using that would be helpful.

你三分钟的时间似乎过得很慢。我的猜测是id列没有被正确地索引。如果您可以提供您正在使用的表定义,那将会很有帮助。

I created a simple python script to produce test data and ran multiple different versions of the delete query against the same data set. Here's my table definitions:

我创建了一个简单的python脚本,用于生成测试数据,并针对相同的数据集运行多个不同版本的delete查询。

drop table if exists a;
create table a
 (id bigint unsigned  not null primary key,
  data varchar(255) not null) engine=InnoDB;

drop table if exists b;
create table b like a;

I then inserted 100k rows into a and 25k rows into b (22.5k of which were also in a). Here's the results of the various delete commands. I dropped and repopulated the table between runs by the way.

然后我将100k行插入到a中,25k行插入到b中(22.5k行也在a中)。顺便说一下,我在运行之间删除并重新填充了这个表。

mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (1.14 sec)

mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (0.81 sec)

mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (0.97 sec)

mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (0.81 sec)

All the tests were run on an Intel Core2 quad-core 2.5GHz, 2GB RAM with Ubuntu 8.10 and MySQL 5.0. Note, that the execution of one sql statement is still single threaded.

所有的测试都在Intel Core2四核2.5GHz上运行,2GB的RAM和Ubuntu 8.10和MySQL 5.0上运行。注意,一条sql语句的执行仍然是单线程的。


Update:

更新:

I updated my tests to use itsmatt's schema. I slightly modified it by remove auto increment (I'm generating synthetic data) and character set encoding (wasn't working - didn't dig into it).

我更新了我的测试以使用它的马特模式。我通过删除自动增量(生成合成数据)和字符集编码(不工作——不深入)对它进行了一些修改。

Here's my new table definitions:

下面是我的新表定义:

drop table if exists a;
drop table if exists b;
drop table if exists c;

create table c (id varchar(30) not null primary key) engine=InnoDB;

create table a (
  id bigint(20) unsigned not null primary key,
  c_id varchar(30) not null,
  h int(10) unsigned default null,
  i longtext,
  j bigint(20) not null,
  k bigint(20) default null,
  l varchar(45) not null,
  m int(10) unsigned default null,
  n varchar(20) default null,
  o bigint(20) not null,
  p tinyint(1) not null,
  key l_idx (l),
  key h_idx (h),
  key m_idx (m),
  key c_id_idx (id, c_id),
  key c_id_fk (c_id),
  constraint c_id_fk foreign key (c_id) references c(id)
) engine=InnoDB row_format=dynamic;

create table b like a;

I then reran the same tests with 100k rows in a and 25k rows in b (and repopulating between runs).

然后,我重新运行相同的测试,在a中有100k行,在b中有25k行(并在运行之间重新填充)。

mysql> DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE a.id=b.id);
Query OK, 22500 rows affected (11.90 sec)

mysql> DELETE FROM a USING a LEFT JOIN b ON a.id=b.id WHERE b.id IS NOT NULL;
Query OK, 22500 rows affected (11.48 sec)

mysql> DELETE a FROM a INNER JOIN b on a.id=b.id;
Query OK, 22500 rows affected (12.21 sec)

mysql> DELETE QUICK a.* FROM a,b WHERE a.id=b.id;
Query OK, 22500 rows affected (12.33 sec)

As you can see this is quite a bit slower than before, probably due to the multiple indexes. However, it is nowhere near the three minute mark.

正如您所看到的,这比以前要慢一些,可能是由于多个索引。然而,离三分钟的成绩还差得很远。

Something else that you might want to look at is moving the longtext field to the end of the schema. I seem to remember that mySQL performs better if all the size restricted fields are first and text, blob, etc are at the end.

您可能想要查看的其他内容是将longtext字段移动到模式的末尾。我似乎记得,如果所有大小受限的字段都是first,而文本、blob等都在末尾,那么mySQL的性能会更好。

#3


7  

Try this:

试试这个:

DELETE a
FROM a
INNER JOIN b
 on a.id = b.id

Using subqueries tend to be slower then joins as they are run for each record in the outer query.

使用子查询比使用连接要慢,因为它们是为外部查询中的每个记录运行的。

#4


4  

This is what I always do, when I have to operate with super large data (here: a sample test table with 150000 rows):

这是我经常做的,当我需要处理超大数据时(这里是一个有150000行的测试表):

drop table if exists employees_bak;
create table employees_bak like employees;
insert into employees_bak 
    select * from employees
    where emp_no > 100000;

rename table employees to employees_todelete;
rename table employees_bak to employees;

In this case the sql filters 50000 rows into the backup table. The query cascade performs on my slow machine in 5 seconds. You can replace the insert into select by your own filter query.

在这种情况下,sql将50000行过滤到备份表中。查询级联在我的慢速机器上执行5秒。您可以使用自己的筛选器查询将插入替换为select。

That is the trick to perform mass deletion on big databases!;=)

这就是在大型数据库上执行大规模删除的技巧!

#5


3  

You're doing your subquery on 'b' for every row in 'a'.

对于a中的每一行,你都在b上进行子查询。

Try:

试一试:

DELETE FROM a USING a LEFT JOIN b ON a.id = b.id WHERE b.id IS NOT NULL;

#6


3  

Try this out:

试试这个:

DELETE QUICK A.* FROM A,B WHERE A.ID=B.ID

It is much faster than normal queries.

它比普通查询快得多。

Refer for Syntax : http://dev.mysql.com/doc/refman/5.0/en/delete.html

请参阅语法:http://dev.mysql.com/doc/refman/5.0/en/delete.html

#7


3  

I know this question has been pretty much solved due to OP's indexing omissions but I would like to offer this additional advice, which is valid for a more generic case of this problem.

我知道由于OP的省略,这个问题已经得到了很大的解决,但是我想提供这个额外的建议,这对于这个问题的更一般的情况是有效的。

I have personally dealt with having to delete many rows from one table that exist in another and in my experience it's best to do the following, especially if you expect lots of rows to be deleted. This technique most importantly will improve replication slave lag, as the longer each single mutator query runs, the worse the lag would be (replication is single threaded).

我个人已经处理过必须从一个表中删除存在于另一个表中的许多行,根据我的经验,最好执行以下操作,特别是如果希望删除大量行。最重要的是,这种技术将改进复制从延迟,因为每个单独的mutator查询运行的时间越长,延迟就越糟糕(复制是单线程的)。

So, here it is: do a SELECT first, as a separate query, remembering the IDs returned in your script/application, then continue on deleting in batches (say, 50,000 rows at a time). This will achieve the following:

它是这样的:首先执行SELECT,作为单独的查询,记住在脚本/应用程序中返回的id,然后继续批量删除(比如每次删除50,000行)。这将实现以下目标:

  • each one of the delete statements will not lock the table for too long, thus not letting replication lag to get out of control. It is especially important if you rely on your replication to provide you relatively up-to-date data. The benefit of using batches is that if you find that each DELETE query still takes too long, you can adjust it to be smaller without touching any DB structures.
  • 每个delete语句都不会太长时间地锁定表,因此不会让复制延迟失控。如果您依赖您的复制来提供相对最新的数据,这一点尤为重要。使用批次的好处是,如果您发现每个删除查询仍然花费太长时间,您可以调整它以使其更小,而不需要触摸任何DB结构。
  • another benefit of using a separate SELECT is that the SELECT itself might take a long time to run, especially if it can't for whatever reason use the best DB indexes. If the SELECT is inner to a DELETE, when the whole statement migrates to the slaves, it will have to do the SELECT all over again, potentially lagging the slaves because it has to do the long select all over again. Slave lag, again, suffers badly. If you use a separate SELECT query, this problem goes away, as all you're passing is a list of IDs.
  • 使用单独的SELECT的另一个好处是,SELECT本身可能需要很长时间才能运行,特别是如果它不能出于任何原因使用最好的DB索引。如果SELECT是DELETE的内部语句,那么当整个语句迁移到从语句时,它将不得不重新执行SELECT,这可能会使从语句滞后,因为它必须重新执行long SELECT。奴隶的落后同样受到严重的影响。如果您使用一个单独的SELECT查询,这个问题就会消失,因为您传递的只是一个id列表。

Let me know if there's a fault in my logic somewhere.

如果我的逻辑有问题,请告诉我。

For more discussion on replication lag and ways to fight it, similar to this one, see MySQL Slave Lag (Delay) Explained And 7 Ways To Battle It

有关复制延迟和与此类似的方法的更多讨论,请参见MySQL Slave lag (Delay)解释和7种对抗它的方法

P.S. One thing to be careful about is, of course, potential edits to the table between the times the SELECT finishes and DELETEs start. I will let you handle such details by using transactions and/or logic pertinent to your application.

当然,需要注意的一件事是,在选择结束和删除开始之间,对表的潜在编辑。我将允许您使用与应用程序相关的事务和/或逻辑来处理这些细节。

#8


2  

DELETE FROM a WHERE id IN (SELECT id FROM b)

#9


2  

Maybe you should rebuild the indicies before running such a hugh query. Well, you should rebuild them periodically.

在运行这样一个hugh查询之前,也许您应该重新构建这些独立特性。你应该定期重建它们。

REPAIR TABLE a QUICK;
REPAIR TABLE b QUICK;

and then run any of the above queries (i.e.)

然后运行上面的任何查询(例如)

DELETE FROM a WHERE id IN (SELECT id FROM b)

#10


2  

The query itself is already in an optimal form, updating the indexes causes the whole operation to take that long. You could disable the keys on that table before the operation, that should speed things up. You can turn them back on at a later time, if you don't need them immediately.

查询本身已经处于最佳状态,更新索引会导致整个操作花费那么长时间。您可以在操作之前禁用该表上的键,这会加快速度。如果您不需要它们,您可以在稍后的时候重新打开它们。

Another approach would be adding a deleted flag-column to your table and adjusting other queries so they take that value into account. The fastest boolean type in mysql is CHAR(0) NULL (true = '', false = NULL). That would be a fast operation, you can delete the values afterwards.

另一种方法是向表中添加已删除的旗舰列,并调整其他查询,以便它们考虑该值。mysql中最快的布尔类型是CHAR(0) NULL (true = ", false = NULL)。这将是一个快速的操作,您可以在之后删除这些值。

The same thoughts expressed in sql statements:

sql语句也表达了同样的想法:

ALTER TABLE a ADD COLUMN deleted CHAR(0) NULL DEFAULT NULL;

-- The following query should be faster than the delete statement:
UPDATE a INNER JOIN b SET a.deleted = '';

-- This is the catch, you need to alter the rest
-- of your queries to take the new column into account:
SELECT * FROM a WHERE deleted IS NULL;

-- You can then issue the following queries in a cronjob
-- to clean up the tables:
DELETE FROM a WHERE deleted IS NOT NULL;

If that, too, is not what you want, you can have a look at what the mysql docs have to say about the speed of delete statements.

如果这也不是您想要的,那么您可以看看mysql文档对删除语句速度的说明。

#11


2  

BTW, after posting the above on my blog, Baron Schwartz from Percona brought to my attention that his maatkit already has a tool just for this purpose - mk-archiver. http://www.maatkit.org/doc/mk-archiver.html.

顺便说一句,在我的博客上发布了上述内容后,来自Percona的Baron Schwartz让我注意到,他的maatkit已经有了一个工具——mk-archiver。http://www.maatkit.org/doc/mk-archiver.html。

It is most likely your best tool for the job.

这很可能是你工作的最佳工具。

#12


1  

Obviously the SELECT query that builds the foundation of your DELETE operation is quite fast so I'd think that either the foreign key constraint or the indexes are the reasons for your extremely slow query.

显然,构建删除操作基础的SELECT查询速度非常快,所以我认为,要么是外键约束,要么是索引,导致查询速度极慢。

Try

试一试

SET foreign_key_checks = 0;
/* ... your query ... */
SET foreign_key_checks = 1;

This would disable the checks on the foreign key. Unfortunately you cannot disable (at least I don't know how) the key-updates with an InnoDB table. With a MyISAM table you could do something like

这将禁用对外键的检查。不幸的是,您不能禁用(至少我不知道如何)使用InnoDB表的键更新。有了MyISAM表,你可以做一些类似的事情。

ALTER TABLE a DISABLE KEYS
/* ... your query ... */
ALTER TABLE a ENABLE KEYS 

I actually did not test if these settings would affect the query duration. But it's worth a try.

实际上,我没有测试这些设置是否会影响查询持续时间。但是值得一试。

#13


0  

Connect datebase using terminal and execute command below, look at the result time each of them, you'll find that times of delete 10, 100, 1000, 10000, 100000 records are not Multiplied.

使用终端连接datebase并执行下面的命令,查看它们的结果时间,您会发现删除10、100、1000、10000、100000记录的次数都没有相乘。

  DELETE FROM #{$table_name} WHERE id < 10;
  DELETE FROM #{$table_name} WHERE id < 100;
  DELETE FROM #{$table_name} WHERE id < 1000;
  DELETE FROM #{$table_name} WHERE id < 10000;
  DELETE FROM #{$table_name} WHERE id < 100000;

The time of deleting 10 thousand records is not 10 times as much as deleting 100 thousand records. Then, except for finding a way delete records more faster, there are some indirect methods.

删除10,000条记录的时间不是删除100,000条记录的10倍。然后,除了找到一种删除记录更快的方法,还有一些间接的方法。

1, We can rename the table_name to table_name_bak, and then select records from table_name_bak to table_name.

1,我们可以将table_name重命名为table_name_bak,然后从table_name_bak中选择记录到table_name。

2, To delete 10000 records, we can delete 1000 records 10 times. There is an example ruby script to do it.

2、要删除10000条记录,我们可以删除1000条记录10次。有一个ruby脚本示例来完成它。

#!/usr/bin/env ruby
require 'mysql2'


$client = Mysql2::Client.new(
  :as => :array,
  :host => '10.0.0.250',
  :username => 'mysql',
  :password => '123456',
  :database => 'test'
)


$ids = (1..1000000).to_a
$table_name = "test"

until $ids.empty?
  ids = $ids.shift(1000).join(", ")
  puts "delete =================="
  $client.query("
                DELETE FROM #{$table_name}
                WHERE id IN ( #{ids} )
                ")
end

#14


-2  

The basic technique for deleting multiple Row form MySQL in single table through the id field

通过id字段在单表中删除多行MySQL的基本技术

DELETE FROM tbl_name WHERE id <= 100 AND id >=200; This query is responsible for deleting the matched condition between 100 AND 200 from the certain table

从tbl_name中删除id <= 100和id >=200;这个查询负责从某个表中删除100到200之间的匹配条件