做大规模更新的最快方法。

时间:2021-04-14 01:34:56

Let’s say you have a table with about 5 million records and a nvarchar(max) column populated with large text data. You want to set this column to NULL if SomeOtherColumn = 1 in the fastest possible way.

假设您有一个包含大约500万条记录的表和一个包含大量文本数据的nvarchar(max)列。如果其他列以最快的方式= 1,则需要将该列设置为NULL。

The brute force UPDATE does not work very well here because it will create large implicit transaction and take forever.

蛮力更新在这里不能很好地工作,因为它将创建大型隐式事务,并将花费很长时间。

Doing updates in small batches of 50K records at a time works but it’s still taking 47 hours to complete on beefy 32 core/64GB server.

一次完成一小批50K记录的更新是可行的,但是在健壮的32内核/64GB服务器上完成更新仍然需要47个小时。

Is there any way to do this update faster? Are there any magic query hints / table options that sacrifices something else (like concurrency) in exchange for speed?

有什么方法可以更快地更新吗?有没有什么神奇的查询提示/表选项可以牺牲其他东西(比如并发)来换取速度?

NOTE: Creating temp table or temp column is not an option because this nvarchar(max) column involves lots of data and so consumes lots of space!

注意:创建temp表或temp列不是一个选项,因为这个nvarchar(max)列包含大量数据,因此消耗大量空间!

PS: Yes, SomeOtherColumn is already indexed.

是的,另一列已经被索引了。

7 个解决方案

#1


7  

From everything I can see it does not look like your problems are related to indexes.

从我所看到的一切来看,您的问题似乎与索引无关。

The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.

关键似乎在于nvarchar(max)字段包含“大量”数据。考虑一下SQL要执行这个更新需要做什么。

Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.

由于要更新的列可能超过8000个字符,所以它存储在页外,这意味着在非NULL时需要额外的阅读。

When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.

当您运行一批50000条更新时,SQL必须将其放在隐式事务中,以便在出现任何问题时回滚。为了回滚,它必须将列的原始值存储在事务日志中。

Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).

假设(为了简单起见)每个列平均包含10,000字节的数据,这意味着50,000行将包含约500MB的数据,这些数据必须临时(以简单恢复模式)或永久(以完全恢复模式)存储。

There is no way to disable the logs as it will compromise the database integrity.

无法禁用日志,因为这会损害数据库的完整性。

I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.

我在我的dog slow desktop上运行了一个快速测试,并且运行的批量甚至是10000,变得非常慢,但是将大小降到1000行,这意味着一个临时的日志大小大约为10MB,运行得很好。

I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.

我装载了一个包含35万行的表,并标记了5万行进行更新。这在4分钟左右完成,因为它你应该能够更新整个线性扩展500万行桌面上我的狗缓慢在6小时1处理器2 gb桌面,所以我希望更好的在你结实的服务器支持的圣什么的。

You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.

您可能希望将update语句作为一个选择来运行,只选择主键和大型nvarchar列,并确保它按照预期的速度运行。

Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.

当然,瓶颈可能是其他用户锁定东西或在服务器上的存储或内存上的争用,但是由于您没有提到其他用户,因此我假设您的DB处于单用户模式。

As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.

作为优化,您应该确保事务日志位于与数据不同的物理磁盘/磁盘组上,以最小化查找时间。

#2


3  

You could set the database recovery mode to Simple to reduce logging, BUT do not do this without considering the full implications for a production environment.

您可以将数据库恢复模式设置为简单,以减少日志记录,但是在不考虑对生产环境的全部影响的情况下,不要这样做。

What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.

表上有哪些索引?考虑到批量更新的大约。50000行太长了,我认为需要一个索引。

#3


3  

Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.

希望您已经在设置为null的列上删除了任何索引,包括全文索引。如前所述,暂时关闭事务和日志文件将发挥作用。备份数据通常也会截断日志文件。

#4


1  

Have you tried placing an index or statistics on someOtherColumn?

你试过在其他列上放置索引或统计数据吗?

#5


1  

This really helped me. I went from 2 hours to 20 minutes with this.

这真的帮助了我。从2小时到20分钟。

/* I'm using database recovery mode to Simple */
/* Update table statistics */

set transaction isolation level read uncommitted     

/* Your 50k update, just to have a measures of the time it will take */

set transaction isolation level READ COMMITTED

In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.

在我的经验中,在MSSQL 2005中工作,每天(自动)从一个数据库中的一个表移动到另一个数据库中的另一个表(尽管没有nvarchar(max))),在一个8GB、2Ghz的四核服务器上大约需要20分钟,而且不会影响应用程序的性能。移动的意思是插入选择,然后删除。CPU使用率永远不会超过30%,即使被删除的表有28M条记录,并且每分钟大约进行4K插入,但没有更新。这就是我的情况,它可能会随服务器负载而变化。

READ UNCOMMITTED

读未提交

"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.

“指定语句(您的更新)可以读取已被其他事务修改但尚未提交的行。”在我的情况下,记录是只读的。

I don't know what rg-tsql means but here you'll find info about transaction isolation levels in MSSQL.

我不知道rg-tsql是什么意思,但在这里,您将在MSSQL中找到关于事务隔离级别的信息。

#6


0  

Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.

尝试索引“SomeOtherColumn”……50K的记录应该在瞬间更新。如果已经有了索引,请查看是否需要重新组织索引,以及是否已经为其收集了统计数据。

#7


0  

If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.

如果您正在运行一个没有足够空间来复制所有表的生产环境,我相信您迟早会遇到麻烦。

If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:

如果您提供一些关于其他列=1的行数的信息,也许我们可以换个方式思考,但是我建议:

0) Backup your table 1) Index the flag column 2) Set the table option to "no log tranctions" ... if posible 3) write a stored procedure to run the updates

0)备份您的表1)索引标志列2)将表选项设置为“无日志趋势”…如果posible 3)编写一个存储过程来运行更新。

#1


7  

From everything I can see it does not look like your problems are related to indexes.

从我所看到的一切来看,您的问题似乎与索引无关。

The key seems to be in the fact that your nvarchar(max) field contains "lots" of data. Think about what SQL has to do in order to perform this update.

关键似乎在于nvarchar(max)字段包含“大量”数据。考虑一下SQL要执行这个更新需要做什么。

Since the column you are updating is likely more than 8000 characters it is stored off-page, which implies additional effort in reading this column when it is not NULL.

由于要更新的列可能超过8000个字符,所以它存储在页外,这意味着在非NULL时需要额外的阅读。

When you run a batch of 50000 updates SQL has to place this in an implicit transaction in order to make it possible to roll back in case of any problems. In order to roll back it has to store the original value of the column in the transaction log.

当您运行一批50000条更新时,SQL必须将其放在隐式事务中,以便在出现任何问题时回滚。为了回滚,它必须将列的原始值存储在事务日志中。

Assuming (for simplicity sake) that each column contains on average 10,000 bytes of data, that means 50,000 rows will contain around 500MB of data, which has to be stored temporarily (in simple recovery mode) or permanently (in full recovery mode).

假设(为了简单起见)每个列平均包含10,000字节的数据,这意味着50,000行将包含约500MB的数据,这些数据必须临时(以简单恢复模式)或永久(以完全恢复模式)存储。

There is no way to disable the logs as it will compromise the database integrity.

无法禁用日志,因为这会损害数据库的完整性。

I ran a quick test on my dog slow desktop, and running batches of even 10,000 becomes prohibitively slow, but bringing the size down to 1000 rows, which implies a temporary log size of around 10MB, worked just nicely.

我在我的dog slow desktop上运行了一个快速测试,并且运行的批量甚至是10000,变得非常慢,但是将大小降到1000行,这意味着一个临时的日志大小大约为10MB,运行得很好。

I loaded a table with 350,000 rows and marked 50,000 of them for update. This completed in around 4 minutes, and since it scales linearly you should be able to update your entire 5Million rows on my dog slow desktop in around 6 hours on my 1 processor 2GB desktop, so I would expect something much better on your beefy server backed by SAN or something.

我装载了一个包含35万行的表,并标记了5万行进行更新。这在4分钟左右完成,因为它你应该能够更新整个线性扩展500万行桌面上我的狗缓慢在6小时1处理器2 gb桌面,所以我希望更好的在你结实的服务器支持的圣什么的。

You may want to run your update statement as a select, selecting only the primary key and the large nvarchar column, and ensure this runs as fast as you expect.

您可能希望将update语句作为一个选择来运行,只选择主键和大型nvarchar列,并确保它按照预期的速度运行。

Of course the bottleneck may be other users locking things or contention on your storage or memory on the server, but since you did not mention other users I will assume you have the DB in single user mode for this.

当然,瓶颈可能是其他用户锁定东西或在服务器上的存储或内存上的争用,但是由于您没有提到其他用户,因此我假设您的DB处于单用户模式。

As an optimization you should ensure that the transaction logs are on a different physical disk /disk group than the data to minimize seek times.

作为优化,您应该确保事务日志位于与数据不同的物理磁盘/磁盘组上,以最小化查找时间。

#2


3  

You could set the database recovery mode to Simple to reduce logging, BUT do not do this without considering the full implications for a production environment.

您可以将数据库恢复模式设置为简单,以减少日志记录,但是在不考虑对生产环境的全部影响的情况下,不要这样做。

What indexes are in place on the table? Given that batch updates of approx. 50,000 rows take so long, I would say you require an index.

表上有哪些索引?考虑到批量更新的大约。50000行太长了,我认为需要一个索引。

#3


3  

Hopefully you already dropped any indexes on the column you are setting to null, including full text indexes. As said before, turning off transactions and the log file temporarily would do the trick. Backing up your data will usually truncate your log files too.

希望您已经在设置为null的列上删除了任何索引,包括全文索引。如前所述,暂时关闭事务和日志文件将发挥作用。备份数据通常也会截断日志文件。

#4


1  

Have you tried placing an index or statistics on someOtherColumn?

你试过在其他列上放置索引或统计数据吗?

#5


1  

This really helped me. I went from 2 hours to 20 minutes with this.

这真的帮助了我。从2小时到20分钟。

/* I'm using database recovery mode to Simple */
/* Update table statistics */

set transaction isolation level read uncommitted     

/* Your 50k update, just to have a measures of the time it will take */

set transaction isolation level READ COMMITTED

In my experience, working in MSSQL 2005, moving everyday (automatically) 4 Million 46-byte-records (no nvarchar(max) though) from one table in a database to another table in a different database takes around 20 minutes in a QuadCore 8GB, 2Ghz server and it doesn't hurt application performance. By moving I mean INSERT INTO SELECT and then DELETE. The CPU usage never goes over 30 %, even when the table being deleted has 28M records and it constantly makes around 4K insert per minute but no updates. Well, that's my case, it may vary depending on your server load.

在我的经验中,在MSSQL 2005中工作,每天(自动)从一个数据库中的一个表移动到另一个数据库中的另一个表(尽管没有nvarchar(max))),在一个8GB、2Ghz的四核服务器上大约需要20分钟,而且不会影响应用程序的性能。移动的意思是插入选择,然后删除。CPU使用率永远不会超过30%,即使被删除的表有28M条记录,并且每分钟大约进行4K插入,但没有更新。这就是我的情况,它可能会随服务器负载而变化。

READ UNCOMMITTED

读未提交

"Specifies that statements (your updates) can read rows that have been modified by other transactions but not yet committed." In my case, the records are readonly.

“指定语句(您的更新)可以读取已被其他事务修改但尚未提交的行。”在我的情况下,记录是只读的。

I don't know what rg-tsql means but here you'll find info about transaction isolation levels in MSSQL.

我不知道rg-tsql是什么意思,但在这里,您将在MSSQL中找到关于事务隔离级别的信息。

#6


0  

Try indexing 'SomeOtherColumn'...50K records should update in a snap. If there is already an index in place see if the index needs to be reorganized and that statistics have been collected for it.

尝试索引“SomeOtherColumn”……50K的记录应该在瞬间更新。如果已经有了索引,请查看是否需要重新组织索引,以及是否已经为其收集了统计数据。

#7


0  

If you are running a production environment with not enough space to duplicate all your tables, I believe that you are looking for trouble sooner or later.

如果您正在运行一个没有足够空间来复制所有表的生产环境,我相信您迟早会遇到麻烦。

If you provide some info about the number of rows with SomeOtherColumn=1, perhaps we can think another way, but I suggest:

如果您提供一些关于其他列=1的行数的信息,也许我们可以换个方式思考,但是我建议:

0) Backup your table 1) Index the flag column 2) Set the table option to "no log tranctions" ... if posible 3) write a stored procedure to run the updates

0)备份您的表1)索引标志列2)将表选项设置为“无日志趋势”…如果posible 3)编写一个存储过程来运行更新。