从大型表中删除大量数据

I have two tables. Let's call them KEY and VALUE.
KEY is small, somewhere around 1.000.000 records.
VALUE is huge, say 1.000.000.000 records.

我有两个表。我们称它们为键和值。KEY值很小，大约有1000.000个记录。价值是巨大的，比如1.000.000.000条记录。

Between them there is a connection such that each KEY might have many VALUES. It's not a foreign key but basically the same meaning.

它们之间有一个连接，每个键可能有许多值。它不是外键，但基本上意思是一样的。

The DDL looks like this

DDL如下所示

create table KEY (
 key_id int,
 primary key (key_id)
);

create table VALUE (
 key_id int,
 value_id int,
 primary key (key_id, value_id)
);

Now, my problem. About half of all key_ids in VALUE have been deleted from KEY and I need to delete them in a orderly fashion while both tables are still under high load.

现在,我的问题。值中大约有一半的key_id已经从KEY中删除，我需要在两个表仍然处于高负载的情况下有序地删除它们。

It would be easy to do

这很容易做到

delete v 
  from VALUE v
  left join KEY k using (key_id)
 where k.key_id is null;

However, as it's not allowed to have a limit on multi table delete I don't like this approach. Such a delete would take hours to run and that makes it impossible to throttle the deletes.

然而，由于它不允许对多表删除有限制，所以我不喜欢这种方法。这样的删除操作需要数小时才能运行，因此不可能限制删除操作。

Another approach is to create cursor to find all missing key_ids and delete them one by one with a limit. That seems very slow and kind of backwards.

另一种方法是创建游标来查找所有丢失的key_id，并按限制逐个删除它们。这看起来很慢而且有点倒退。

Are there any other options? Some nice tricks that could help?

还有其他选择吗?一些有用的技巧?

Thanks.

谢谢。

12 个解决方案

#1

What about this for having a limit?

这个有限度呢?

delete x 
  from `VALUE` x
  join (select key_id, value_id
          from `VALUE` v
          left join `KEY` k using (key_id)
         where k.key_id is null
         limit 1000) y
    on x.key_id = y.key_id AND x.value_id = y.value_id;

#2

Any solution that tries to delete so much data in one transaction is going to overwhelm the rollback segment and cause a lot of performance problems.

任何试图在一个事务中删除如此多数据的解决方案都将淹没回滚段并导致大量性能问题。

A good tool to help is pt-archiver. It performs incremental operations on moderate-sized batches of rows, as efficiently as possible. pt-archiver can copy, move, or delete rows depending on options.

一个很好的帮助工具是pt-archiver。它尽可能高效地对中等大小的行执行增量操作。pt-archiver可以根据选项复制、移动或删除行。

The documentation includes an example of deleting orphaned rows, which is exactly your scenario:

文档包括一个删除孤立行的示例，这就是您的场景:

pt-archiver --source h=host,D=db,t=VALUE --purge \
  --where 'NOT EXISTS(SELECT * FROM `KEY` WHERE key_id=`VALUE`.key_id)' \
  --limit 1000 --commit-each

Executing this will take significantly longer to delete the data, but it won't use too many resources, and without interrupting service on your existing database. I have used it successfully to purge hundreds of millions of rows of outdated data.

执行此操作将花费更长的时间来删除数据，但是它不会使用太多的资源，并且不会中断现有数据库上的服务。我已经成功地使用它清除了数以亿计的过时数据。

pt-archiver is part of the Percona Toolkit for MySQL, a free (GPL) set of scripts that help common tasks with MySQL and compatible databases.

pt-archiver是Percona MySQL工具包的一部分，这是一套免费的(GPL)脚本，可以帮助使用MySQL和兼容数据库的常见任务。

#3

Directly from MySQL documentation

直接从MySQL文档

If you are deleting many rows from a large table, you may exceed the lock table size for an InnoDB table. To avoid this problem, or simply to minimize the time that the table remains locked, the following strategy (which does not use DELETE at all) might be helpful:

如果要从大型表中删除许多行，则可能会超过InnoDB表的锁表大小。为了避免这个问题，或者简单地减少表上锁的时间，下面的策略(根本不使用DELETE)可能会有帮助:

Select the rows not to be deleted into an empty table that has the same structure as the original table:

选择不被删除的行，并将其删除到与原始表具有相同结构的空表中:
INSERT INTO t_copy SELECT * FROM t WHERE ... ;
Use RENAME TABLE to atomically move the original table out of the way and rename the copy to the original name:

使用RENAME TABLE原子化地将原始表移开，并将副本重命名为原始名称:
RENAME TABLE t TO t_old, t_copy TO t;
Drop the original table:

删除原表:
DROP TABLE t_old;
No other sessions can access the tables involved while RENAME TABLE executes, so the rename operation is not subject to concurrency problems. See Section 12.1.9, “RENAME TABLE Syntax”.

在执行重命名表时，没有其他会话可以访问相关的表，因此重命名操作不存在并发问题。参见第12.1.9节“重命名表语法”。

So in Your case You may do

你可以这么做

INSERT INTO value_copy SELECT * FROM VALUE WHERE key_id IN
    (SELECT key_id FROM `KEY`);

RENAME TABLE value TO value_old, value_copy TO value;

DROP TABLE value_old;

And according to what they wrote here RENAME operation is quick and number of records doesn't affect it.

根据他们在这里所写的重命名操作是快速的，记录的数量不会影响它。

#4

First, examine your data. Find the keys which have too many values to be deleted "fast". Then find out which times during the day you have the smallest load on the system. Perform the deletion of the "bad" keys during that time. For the rest, start deleting them one by one with some downtime between deletes so that you don't put to much pressure on the database while you do it.

首先,检查你的数据。查找有太多值要“快速”删除的键。然后找出一天中哪个时候系统负载最小。在此期间执行“坏”键的删除。对于其余的，开始一个一个地删除它们，在删除之间有一些停机时间，这样在执行时就不会对数据库造成太大的压力。

#5

May be instead of limit divide whole set of rows into small parts by key_id:

可能不是用key_id将整组行划分成小的部分:

delete v 
  from VALUE v
  left join KEY k using (key_id)
 where k.key_id is null and v.key_id > 0 and v.key_id < 100000;

then delete rows with key_id in 100000..200000 and so on.

然后在100000中删除带有key_id的行。200000年,等等。

#6

You can try to delete in separated transaction batches. This is for MSSQL, but should be similar.

您可以尝试在不同的事务批次中删除。这是针对MSSQL的，但应该类似。

declare @i INT
declare @step INT
set @i = 0
set @step = 100000

while (@i< (select max(VALUE.key_id) from VALUE))
BEGIN
  BEGIN TRANSACTION
  delete from VALUE where
    VALUE.key_id between @i and @i+@step and
    not exists(select 1 from KEY where KEY.key_id = VALUE.key_id and KEY.key_id between @i and @i+@step)

  set @i = (@i+@step)
  COMMIT TRANSACTION
END

#7

Create a temporary table!

创建一个临时表!

drop table if exists batch_to_delete;
create temporary table batch_to_delete as
select v.* from `VALUE` v
left join `KEY` k on k.key_id = v.key_id
where k.key_id is null
limit 10000; -- tailor batch size to your taste

-- optional but may help for large batch size
create index batch_to_delete_ix_key on batch_to_delete(key_id); 
create index batch_to_delete_ix_value on batch_to_delete(value_id);

-- do the actual delete
delete v from `VALUE` v
join batch_to_delete d on d.key_id = v.key_id and d.value_id = v.value_id;

#8

To me this is a kind of task the progress of which I would want to see in a log file. And I would avoid solving this in pure SQL, I would use some scripting in Python or other similar language. Another thing that would bother me is that lots of LEFT JOINs with WHERE IS NOT NULL between the tables might cause unwanted locks, so I would avoid JOINs either.

对我来说，这是一项我想在日志文件中看到的任务。我将避免在纯SQL中解决这个问题，我将使用Python或其他类似语言中的一些脚本。另一件令我困扰的事情是，表之间的许多左连接都不为空，这可能会导致不需要的锁，所以我也会避免连接。

Here is some pseudo code:

下面是一些伪代码:

max_key = select_db('SELECT MAX(key) FROM VALUE')
while max_key > 0:
    cur_range = range(max_key, max_key-100, -1)
    good_keys = select_db('SELECT key FROM KEY WHERE key IN (%s)' % cur_range)
    keys_to_del = set(cur_range) - set(good_keys)
    while 1:
        deleted_count = update_db('DELETE FROM VALUE WHERE key IN (%s) LIMIT 1000' % keys_to_del)
        db_commit
        log_something
        if not deleted_count:
            break
    max_key -= 100

This should not bother the rest of the system very much, but may take long. Another issue is to optimize the table after you deleted all those rows, but this is another story.

这应该不会对系统的其他部分造成太大的困扰，但可能需要很长时间。另一个问题是在删除所有这些行之后优化表，但这是另一个故事。

#9

If the target columns are properly indexed this should go fast,

如果目标列被正确的索引，它应该会快速，

DELETE FROM `VALUE`
WHERE NOT EXISTS(SELECT 1 FROM `key` k WHERE k.key_id = `VALUE`.key_id)
-- ORDER BY key_id, value_id -- order by PK is good idea, but check the performance first.
LIMIT 1000

Alter the limit from 10 to 10000 to get acceptable performance, and rerun it several times.

将限制从10更改为10000，以获得可接受的性能，并重新运行它几次。

Also take in mind that this mass deletes will perform locks and backups for each row .. multiple the execution time for each row several times ...

还要记住，这个大规模删除将为每一行执行锁和备份。将每一行的执行时间乘以几次……

There are some advanced methods to prevent this, but the easiest workaround is just to put a transaction around this query.

有一些高级的方法可以防止这种情况发生，但是最简单的方法就是围绕这个查询放置一个事务。

#10

Do you have SLAVE or Dev/Test environment with same data?

您是否具有相同数据的从属或开发/测试环境?

The first step is to find out your data distribution if you are worried about a particular key having 1 million value_ids

第一步是找到您的数据分布，如果您担心某个键有100万的value_ids。

SELECT v.key_id, COUNT(IFNULL(k.key_id,1)) AS cnt 
FROM `value` v  LEFT JOIN `key` k USING (key_id) 
WHERE k.key_id IS NULL 
GROUP BY v.key_id ;

EXPLAIN PLAN for above query is much better than adding

上述查询的解释计划要比添加好得多

ORDER BY COUNT(IFNULL(k.key_id,1)) DESC ;

Since you don't have partitioning on key_id (too many partitions in your case) and want to keep database running during your delete process, the option is to delete in chucks with SLEEP() between different key_id deletes to avoid overwhelming server. Don't forget to keep an eye on your binary logs to avoid disk filling.

由于key_id上没有分区(在您的例子中有太多的分区)，并且希望在删除过程中保持数据库运行，所以可以在不同的key_id删除之间使用SLEEP()在chucks中删除，以避免淹没服务器。不要忘记注意你的二进制日志，以避免磁盘填充。

The quickest way is :

最快的方法是:

Stop application so data is not changed.
停止应用程序，使数据不被更改。
Dump key_id and value_id from VALUE table with only matching key_id in KEY table by using

通过使用将key_id和value_id从值表中转储，只在键表中匹配key_id

mysqldump YOUR_DATABASE_NAME value --where="key_id in (select key_id from YOUR_DATABASE_NAME.key)" --lock-all --opt --quick --quote-names --skip-extended-insert > VALUE_DATA.txt

mysqldump YOUR_DATABASE_NAME值——其中="key_id in(从YOUR_DATABASE_NAME.key中选择key_id)"-锁定-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
Truncate VALUE table

截断值表
Load data exported in step 2
加载步骤2中导出的数据
Start Application
启动应用程序

As always, try this in Dev/Test environment with Prod data and same infrastructure so you can calculate downtime.

与往常一样，在开发/测试环境中尝试使用Prod数据和相同的基础设施，这样您就可以计算停机时间。

Hope this helps.

希望这个有帮助。

#11

I am just curious what the effect would be of adding a non-unique index on key_id in table VALUE. Selectivity is not high at all (~0.001) but I am curious how that would affect the join performance.

我只是好奇在表值中添加一个非唯一索引对key_id的影响。选择性并不高(~0.001)，但我很好奇这会如何影响连接性能。

#12

Why don't you split your VALUE table into several ones according to some rule like key_id module some power of 2 (like 256 for example)?

为什么不根据key_id模块之类的规则将值表拆分为若干个值(例如256)?

#1