通过ID删除数百万行的最佳方式

时间:2023-01-15 13:44:51

I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days.

我需要从PG数据库中删除大约200万行。我有一个需要删除的id列表。然而,我要做的任何事情都需要几天的时间。

I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted. (I had to select 100 id's from an ID table, delete where IN that list, delete from ids table the 100 I selected).

我试着把它们放在一个表格里,每批100个。4天后,它仍然在运行,只删除了297268行。(我必须从id表中选择100个id,删除该列表中的where,从ids表中删除我选择的100)。

I tried:

我试着:

DELETE FROM tbl WHERE id IN (select * from ids)

That's taking forever, too. Hard to gauge how long, since I can't see it's progress till done, but the query was still running after 2 days.

这也是永远,。很难估计它的运行时间,因为在完成之前我看不到它的进展,但是查询在2天后仍然在运行。

Just kind of looking for the most effective way to delete from a table when I know the specific ID's to delete, and there are millions of IDs.

当我知道要删除的特定ID时,我就会寻找最有效的方法从表中删除,有数百万个ID。

7 个解决方案

#1


60  

It all depends ...

要看情况而定…

  • Delete all indexes (except the one on the ID which you need for the delete)
    Recreate them afterwards (= much faster than incremental updates to indexes)

    删除所有索引(除了需要删除的ID之外的索引),然后重新创建它们(=对索引的增量更新要快得多)

  • Check if you have triggers that can safely be deleted / disabled temporarily

    检查是否有可以安全删除/暂时禁用的触发器

  • Do foreign keys reference your table? Can they be deleted? Temporarily deleted?

    外键是否引用您的表?可以删除吗?暂时删除吗?

  • Depending on your autovacuum settings it may help to run VACUUM ANALYZE before the operation.

    根据您的自动真空设置,可能有助于运行真空分析之前的操作。

  • Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all.

    假设对涉及的表没有并发的写访问,或者您可能必须对表进行独占锁定,或者这个路由可能根本不适合您。

  • Some of the points listed in the related chapter of the manual Populating a Database may also be of use, depending on your setup.

    在填充数据库手册的相关章节中列出的一些要点也可能有用,这取决于您的设置。

  • If you delete large portions of the table and the rest fits into RAM, the fastest and easiest way would be this:

    如果删除表的大部分内容,其余的部分放入RAM中,最快、最简单的方法是:

SET temp_buffers = '1000MB'; -- or whatever you can spare temporarily

CREATE TEMP TABLE tmp AS
SELECT t.*
FROM   tbl t
LEFT   JOIN del_list d USING (id)
WHERE  d.id IS NULL;      -- copy surviving rows into temporary table

TRUNCATE tbl;             -- empty table - truncate is very fast for big tables

INSERT INTO tbl
SELECT * FROM tmp;        -- insert back surviving rows.

This way you don't have to recreate views, foreign keys or other depending objects. Read about the temp_buffers setting in the manual. This method is fast as long as the table fits into memory, or at least most of it. Be aware that you can lose data if your server crashes in the middle of this operation. You can wrap all of it into a transaction to make it safer.

这样,您就不必重新创建视图、外键或其他依赖对象。阅读手册中的temp_buffers设置。这个方法只要表适合于内存,或者至少是大部分内存,就会很快。注意,如果服务器在此操作过程中崩溃,可能会丢失数据。您可以将所有信息打包到事务中,以使其更安全。

Run ANALYZE afterwards. Or VACUUM ANALYZE if you did not go the truncate route, or VACUUM FULL ANALYZE if you want to bring it to minimum size. For big tables consider the alternatives CLUSTER / pg_repack:

运行分析。或者真空分析,如果你不走截线,或者真空完全分析,如果你想把它降到最小。对于大型表,考虑替代集群/ pg_repack:

For small tables, a simple DELETE instead of TRUNCATE is often faster:

对于小表,简单的删除而不是截断通常更快:

DELETE FROM tbl t
USING  del_list d
WHERE  t.id = d.id;

Read the Notes section for TRUNCATE in the manual. In particular (as Pedro also pointed out in his comment):

请阅读手册中有关截断的说明部分。特别是(正如佩德罗在他的评论中指出的那样):

TRUNCATE cannot be used on a table that has foreign-key references from other tables, unless all such tables are also truncated in the same command. [...]

除非所有此类表都在同一个命令中被截断,否则不能在具有来自其他表的外键引用的表上使用truncatetable。[…]

And:

和:

TRUNCATE will not fire any ON DELETE triggers that might exist for the tables.

truncatetable不会触发任何用于表的DELETE触发器。

#2


3  

We know the update/delete performance of PostgreSQL is not as powerful as Oracle. When we need to delete millions or 10's of millions of rows, it's really difficult and takes a long time.

我们知道PostgreSQL的更新/删除性能不如Oracle强大。当我们需要删除数百万或10百万行的时候,这真的很困难,需要很长时间。

However, we can still do this in production dbs. The following is my idea:

但是,我们仍然可以在产品dbs中这样做。以下是我的想法:

First, we should create a log table with 2 columns - id & flag (id refers to the id you want to delete; flag can be Y or null, with Y signifying the record is successfully deleted).

首先,我们应该创建一个包含两列的日志表——id & flag (id是指要删除的id);标记可以是Y或null, Y表示记录被成功删除)。

Later, we create a function. We do the delete task every 10,000 rows. You can see more details on my blog. Though it's in Chinese, you can still can get the info you want from the SQL code there.

稍后,我们将创建一个函数。我们每10,000行做删除任务。你可以在我的博客上看到更多的细节。虽然它是中文的,但是您仍然可以从SQL代码中获得您想要的信息。

Make sure the id column of both tables are indexes, as it will run faster.

确保两个表的id列都是索引,因为它会运行得更快。

#3


2  

You may try copying all the data in the table except the IDs you want to delete onto a new table, then renaming then swapping the tables (provided you have enough resources to do it).

您可以尝试将表中除要删除的id之外的所有数据复制到新表中,然后重命名然后交换表(如果您有足够的资源)。

This is not an expert advice.

这不是一个专家的建议。

#4


2  

Two possible answers:

两个可能的答案:

  1. Your table may have lots of constraint or triggers attached to it when you try to delete a record. It will incur much processor cycles and checking from other tables.

    当您试图删除一条记录时,您的表可能有许多约束或触发器附加到它。它将产生大量的处理器周期和来自其他表的检查。

  2. You may need to put this statement inside a transaction.

    您可能需要将此语句放入事务中。

#5


1  

The easiest way to do this would be to drop all your constraints and then do the delete.

最简单的方法是删除所有的约束,然后删除。

#6


1  

First make sure you have an index on the ID fields, both in the table you want to delete from and the table you are using for deletion IDs.

首先,确保ID字段上有一个索引,包括要删除的表和用于删除ID的表。

100 at a time seems too small. Try 1000 or 10000.

一次100英镑似乎太小了。1000年或10000年。

There's no need to delete anything from the deletion ID table. Add a new column for a Batch number and fill it with 1000 for batch 1, 1000 for batch 2, etc. and make sure the deletion query includes the batch number.

不需要从删除ID表中删除任何内容。为批号添加一个新列,并在第1批、第2批处理中填充1000,并确保删除查询包含批号。

#7


0  

If the table you're deleting from is referenced by some_other_table (and you don't want to drop the foreign keys even temporarily), make sure you have an index on the referencing column in some_other_table!

如果您正在删除的表被some_other_table引用(您甚至不希望临时删除外键),请确保在some_other_table的引用列上有一个索引!

I had a similar problem and used auto_explain with auto_explain.log_nested_statements = true, which revealed that the delete was actually doing seq_scans on some_other_table:

我遇到了类似的问题,使用auto_explain和auto_explain。log_nested_statements = true,说明delete实际上是在some_other_table上执行seq_scan:

    Query Text: SELECT 1 FROM ONLY "public"."some_other_table" x WHERE $1 OPERATOR(pg_catalog.=) "id" FOR KEY SHARE OF x    
    LockRows  (cost=[...])  
      ->  Seq Scan on some_other_table x  (cost=[...])  
            Filter: ($1 = id)

Apparently it's trying to lock the referencing rows in the other table (which shouldn't exist, or the delete will fail). After I created indexes on the referencing tables, the delete was orders of magnitude faster.

显然,它试图锁定另一个表中的引用行(不应该存在,否则删除将失败)。在我在引用表上创建索引之后,删除操作的速度要快很多。

#1


60  

It all depends ...

要看情况而定…

  • Delete all indexes (except the one on the ID which you need for the delete)
    Recreate them afterwards (= much faster than incremental updates to indexes)

    删除所有索引(除了需要删除的ID之外的索引),然后重新创建它们(=对索引的增量更新要快得多)

  • Check if you have triggers that can safely be deleted / disabled temporarily

    检查是否有可以安全删除/暂时禁用的触发器

  • Do foreign keys reference your table? Can they be deleted? Temporarily deleted?

    外键是否引用您的表?可以删除吗?暂时删除吗?

  • Depending on your autovacuum settings it may help to run VACUUM ANALYZE before the operation.

    根据您的自动真空设置,可能有助于运行真空分析之前的操作。

  • Assuming no concurrent write access to involved tables or you may have to lock tables exclusively or this route may not be for you at all.

    假设对涉及的表没有并发的写访问,或者您可能必须对表进行独占锁定,或者这个路由可能根本不适合您。

  • Some of the points listed in the related chapter of the manual Populating a Database may also be of use, depending on your setup.

    在填充数据库手册的相关章节中列出的一些要点也可能有用,这取决于您的设置。

  • If you delete large portions of the table and the rest fits into RAM, the fastest and easiest way would be this:

    如果删除表的大部分内容,其余的部分放入RAM中,最快、最简单的方法是:

SET temp_buffers = '1000MB'; -- or whatever you can spare temporarily

CREATE TEMP TABLE tmp AS
SELECT t.*
FROM   tbl t
LEFT   JOIN del_list d USING (id)
WHERE  d.id IS NULL;      -- copy surviving rows into temporary table

TRUNCATE tbl;             -- empty table - truncate is very fast for big tables

INSERT INTO tbl
SELECT * FROM tmp;        -- insert back surviving rows.

This way you don't have to recreate views, foreign keys or other depending objects. Read about the temp_buffers setting in the manual. This method is fast as long as the table fits into memory, or at least most of it. Be aware that you can lose data if your server crashes in the middle of this operation. You can wrap all of it into a transaction to make it safer.

这样,您就不必重新创建视图、外键或其他依赖对象。阅读手册中的temp_buffers设置。这个方法只要表适合于内存,或者至少是大部分内存,就会很快。注意,如果服务器在此操作过程中崩溃,可能会丢失数据。您可以将所有信息打包到事务中,以使其更安全。

Run ANALYZE afterwards. Or VACUUM ANALYZE if you did not go the truncate route, or VACUUM FULL ANALYZE if you want to bring it to minimum size. For big tables consider the alternatives CLUSTER / pg_repack:

运行分析。或者真空分析,如果你不走截线,或者真空完全分析,如果你想把它降到最小。对于大型表,考虑替代集群/ pg_repack:

For small tables, a simple DELETE instead of TRUNCATE is often faster:

对于小表,简单的删除而不是截断通常更快:

DELETE FROM tbl t
USING  del_list d
WHERE  t.id = d.id;

Read the Notes section for TRUNCATE in the manual. In particular (as Pedro also pointed out in his comment):

请阅读手册中有关截断的说明部分。特别是(正如佩德罗在他的评论中指出的那样):

TRUNCATE cannot be used on a table that has foreign-key references from other tables, unless all such tables are also truncated in the same command. [...]

除非所有此类表都在同一个命令中被截断,否则不能在具有来自其他表的外键引用的表上使用truncatetable。[…]

And:

和:

TRUNCATE will not fire any ON DELETE triggers that might exist for the tables.

truncatetable不会触发任何用于表的DELETE触发器。

#2


3  

We know the update/delete performance of PostgreSQL is not as powerful as Oracle. When we need to delete millions or 10's of millions of rows, it's really difficult and takes a long time.

我们知道PostgreSQL的更新/删除性能不如Oracle强大。当我们需要删除数百万或10百万行的时候,这真的很困难,需要很长时间。

However, we can still do this in production dbs. The following is my idea:

但是,我们仍然可以在产品dbs中这样做。以下是我的想法:

First, we should create a log table with 2 columns - id & flag (id refers to the id you want to delete; flag can be Y or null, with Y signifying the record is successfully deleted).

首先,我们应该创建一个包含两列的日志表——id & flag (id是指要删除的id);标记可以是Y或null, Y表示记录被成功删除)。

Later, we create a function. We do the delete task every 10,000 rows. You can see more details on my blog. Though it's in Chinese, you can still can get the info you want from the SQL code there.

稍后,我们将创建一个函数。我们每10,000行做删除任务。你可以在我的博客上看到更多的细节。虽然它是中文的,但是您仍然可以从SQL代码中获得您想要的信息。

Make sure the id column of both tables are indexes, as it will run faster.

确保两个表的id列都是索引,因为它会运行得更快。

#3


2  

You may try copying all the data in the table except the IDs you want to delete onto a new table, then renaming then swapping the tables (provided you have enough resources to do it).

您可以尝试将表中除要删除的id之外的所有数据复制到新表中,然后重命名然后交换表(如果您有足够的资源)。

This is not an expert advice.

这不是一个专家的建议。

#4


2  

Two possible answers:

两个可能的答案:

  1. Your table may have lots of constraint or triggers attached to it when you try to delete a record. It will incur much processor cycles and checking from other tables.

    当您试图删除一条记录时,您的表可能有许多约束或触发器附加到它。它将产生大量的处理器周期和来自其他表的检查。

  2. You may need to put this statement inside a transaction.

    您可能需要将此语句放入事务中。

#5


1  

The easiest way to do this would be to drop all your constraints and then do the delete.

最简单的方法是删除所有的约束,然后删除。

#6


1  

First make sure you have an index on the ID fields, both in the table you want to delete from and the table you are using for deletion IDs.

首先,确保ID字段上有一个索引,包括要删除的表和用于删除ID的表。

100 at a time seems too small. Try 1000 or 10000.

一次100英镑似乎太小了。1000年或10000年。

There's no need to delete anything from the deletion ID table. Add a new column for a Batch number and fill it with 1000 for batch 1, 1000 for batch 2, etc. and make sure the deletion query includes the batch number.

不需要从删除ID表中删除任何内容。为批号添加一个新列,并在第1批、第2批处理中填充1000,并确保删除查询包含批号。

#7


0  

If the table you're deleting from is referenced by some_other_table (and you don't want to drop the foreign keys even temporarily), make sure you have an index on the referencing column in some_other_table!

如果您正在删除的表被some_other_table引用(您甚至不希望临时删除外键),请确保在some_other_table的引用列上有一个索引!

I had a similar problem and used auto_explain with auto_explain.log_nested_statements = true, which revealed that the delete was actually doing seq_scans on some_other_table:

我遇到了类似的问题,使用auto_explain和auto_explain。log_nested_statements = true,说明delete实际上是在some_other_table上执行seq_scan:

    Query Text: SELECT 1 FROM ONLY "public"."some_other_table" x WHERE $1 OPERATOR(pg_catalog.=) "id" FOR KEY SHARE OF x    
    LockRows  (cost=[...])  
      ->  Seq Scan on some_other_table x  (cost=[...])  
            Filter: ($1 = id)

Apparently it's trying to lock the referencing rows in the other table (which shouldn't exist, or the delete will fail). After I created indexes on the referencing tables, the delete was orders of magnitude faster.

显然,它试图锁定另一个表中的引用行(不应该存在,否则删除将失败)。在我在引用表上创建索引之后,删除操作的速度要快很多。