如何在不使日志文件失控的情况下从大表中删除过期数据?

时间:2020-12-25 09:19:17

I have a huge table (3 billion rows), which unfortunately contains mostly expired data. I want to simply delete all of these expired rows, and keep the rest.

我有一个巨大的表(30亿行),不幸的是,它包含大部分过期的数据。我想简单地删除所有这些过期的行,并保留其余的行。

I can execute a statement like this:

我可以执行这样的语句:

delete from giganticTable where exp_date < getDate()

The execution plan somehow estimates that about 400 million rows will be deleted.

执行计划以某种方式估计将删除大约4亿行。

When executed, not only does this not finish after an hour, but the database transaction log file is also growing from 6 GB to 90 GB. Note that the database was in bulk-logged recovery model while this is happening. I eventually canceled this query, since I'm sure there must be a better way to do this.

执行时,不仅在一小时后没有完成,而且数据库事务日志文件也从6 GB增长到90 GB。请注意,在发生这种情况时,数据库处于批量记录恢复模型中。我最终取消了这个查询,因为我确信必须有更好的方法来做到这一点。

I have several tables that I need to perform a similar operation to. What's the fastest and most space-efficient way to just delete these rows if I have absolutely no desire to ever recover them?

我有几个表需要执行类似的操作。如果我绝对不想恢复它们,那么删除这些行的最快和最节省空间的方法是什么?

Note that I'm using Microsoft SQL Server 2005.

请注意,我正在使用Microsoft SQL Server 2005。

3 个解决方案

#1


9  

I've found it useful when doing deletes from table with a large number of rows to delete rows in batches of say 5000 or so (I usually test to see which value works the fastest, sometimes it's 5000 rows, sometimes 10,000, etc.). This allows each delete operation to complete quickly, rather than waiting a long time for one statement to knock out 400 million records.

我发现从具有大量行的表中删除以删除批量为5000左右的行时,我发现它很有用(我通常会测试哪个值最快,有时是5000行,有时是10000行等) 。这允许每个删除操作快速完成,而不是等待很长时间一个语句来淘汰4亿条记录。

In SQL Server 2005, something like this should work (please test first, of course):

在SQL Server 2005中,这样的东西应该工作(当然,请先测试):

WHILE EXISTS ( SELECT * FROM giganticTable WHERE exp_date < getDate())
BEGIN
  DELETE TOP(5000) FROM giganticTable WHERE exp_date < getDate()
END

I would see what deleting in batches does to the log file size. If it is still blowing up the logs, then you could try changing the Recovery Model to Simple, deleting the records, and then switching back to Bulk Logged, but only if the system can tolerate the loss of some recent data. I would definitely make a Full Backup before attempting that procedure. This thread also suggests that you could setup a job to backup the logs with truncate only specified, so that could be another option. Hopefully you have an instance you can test with, but I would start with the batched deletes to see how that affects performance and the log file size.

我会看到批量删除对日志文件大小的影响。如果它仍在炸毁日志,那么您可以尝试将恢复模型更改为简单,删除记录,然后切换回批量记录,但前提是系统可以容忍丢失一些最近的数据。在尝试该程序之前,我肯定会进行完全备份。此线程还建议您可以设置作业来备份仅指定了truncate的日志,这可能是另一种选择。希望您有一个可以测试的实例,但我会从批量删除开始,看看它如何影响性能和日志文件大小。

#2


3  

You really don't want to mess with trying anything silly like turning off logging when you want to do a lot of work on a table since any issues during the long task could easily lead to database corruption and other issues. However, there is a way around your issue.

当你想在桌子上做很多工作时,你真的不想乱搞任何愚蠢的事情就像关闭日志一样,因为长期任务中的任何问题都很容易导致数据库损坏和其他问题。但是,有一种方法可以解决您的问题。

Create a temp table that matches the schema of your real table. Populate it with the data you want to KEEP. Then, truncate the original table (extremely fast and easy on the log files). Finally, move the data out of the temp table and into your original (and now empty) table.

创建一个与真实表的模式匹配的临时表。使用要保留的数据填充它。然后,截断原始表(在日志文件上非常快速和简单)。最后,将数据移出临时表并进入原始(现在为空)表。

If you use auto-incrementing primary keys, you will need to force the field to take your original keys (so you don't have issues later).

如果使用自动递增主键,则需要强制该字段使用原始键(因此以后不会出现问题)。

#3


1  

You should have done it daily, so you don't get such a huge job at once.
Since you are in the situation, here are my suggestions:

你应该每天都这样做,这样你就不会马上完成这么大的工作。既然你处于这种情况,我的建议如下:

  1. Split the job like rsbarro says. You probably don't need the while statement--you can do it in several days.
  2. 像rsbarro说的那样拆分工作。您可能不需要while语句 - 您可以在几天内完成。

  3. Write the date explicitly like:

    明确写下日期,如:

    delete from giganticTable where exp_date < '2013-08-07'
    
  4. I don't have a good idea about the huge log, seems there's not a really good way to do.
  5. 我对这个巨大的日志并不了解,似乎没有一个真正好的方法。

#1


9  

I've found it useful when doing deletes from table with a large number of rows to delete rows in batches of say 5000 or so (I usually test to see which value works the fastest, sometimes it's 5000 rows, sometimes 10,000, etc.). This allows each delete operation to complete quickly, rather than waiting a long time for one statement to knock out 400 million records.

我发现从具有大量行的表中删除以删除批量为5000左右的行时,我发现它很有用(我通常会测试哪个值最快,有时是5000行,有时是10000行等) 。这允许每个删除操作快速完成,而不是等待很长时间一个语句来淘汰4亿条记录。

In SQL Server 2005, something like this should work (please test first, of course):

在SQL Server 2005中,这样的东西应该工作(当然,请先测试):

WHILE EXISTS ( SELECT * FROM giganticTable WHERE exp_date < getDate())
BEGIN
  DELETE TOP(5000) FROM giganticTable WHERE exp_date < getDate()
END

I would see what deleting in batches does to the log file size. If it is still blowing up the logs, then you could try changing the Recovery Model to Simple, deleting the records, and then switching back to Bulk Logged, but only if the system can tolerate the loss of some recent data. I would definitely make a Full Backup before attempting that procedure. This thread also suggests that you could setup a job to backup the logs with truncate only specified, so that could be another option. Hopefully you have an instance you can test with, but I would start with the batched deletes to see how that affects performance and the log file size.

我会看到批量删除对日志文件大小的影响。如果它仍在炸毁日志,那么您可以尝试将恢复模型更改为简单,删除记录,然后切换回批量记录,但前提是系统可以容忍丢失一些最近的数据。在尝试该程序之前,我肯定会进行完全备份。此线程还建议您可以设置作业来备份仅指定了truncate的日志,这可能是另一种选择。希望您有一个可以测试的实例,但我会从批量删除开始,看看它如何影响性能和日志文件大小。

#2


3  

You really don't want to mess with trying anything silly like turning off logging when you want to do a lot of work on a table since any issues during the long task could easily lead to database corruption and other issues. However, there is a way around your issue.

当你想在桌子上做很多工作时,你真的不想乱搞任何愚蠢的事情就像关闭日志一样,因为长期任务中的任何问题都很容易导致数据库损坏和其他问题。但是,有一种方法可以解决您的问题。

Create a temp table that matches the schema of your real table. Populate it with the data you want to KEEP. Then, truncate the original table (extremely fast and easy on the log files). Finally, move the data out of the temp table and into your original (and now empty) table.

创建一个与真实表的模式匹配的临时表。使用要保留的数据填充它。然后,截断原始表(在日志文件上非常快速和简单)。最后,将数据移出临时表并进入原始(现在为空)表。

If you use auto-incrementing primary keys, you will need to force the field to take your original keys (so you don't have issues later).

如果使用自动递增主键,则需要强制该字段使用原始键(因此以后不会出现问题)。

#3


1  

You should have done it daily, so you don't get such a huge job at once.
Since you are in the situation, here are my suggestions:

你应该每天都这样做,这样你就不会马上完成这么大的工作。既然你处于这种情况,我的建议如下:

  1. Split the job like rsbarro says. You probably don't need the while statement--you can do it in several days.
  2. 像rsbarro说的那样拆分工作。您可能不需要while语句 - 您可以在几天内完成。

  3. Write the date explicitly like:

    明确写下日期,如:

    delete from giganticTable where exp_date < '2013-08-07'
    
  4. I don't have a good idea about the huge log, seems there's not a really good way to do.
  5. 我对这个巨大的日志并不了解,似乎没有一个真正好的方法。