SQL Server - 在不锁定数据的情况下合并大型表

时间:2021-08-18 00:13:10

I have a very large set of data (~3 million records) which needs to be merged with updates and new records on a daily schedule. I have a stored procedure that actually breaks up the record set into 1000 record chunks and uses the MERGE command with temp tables in an attempt to avoid locking the live table while the data is updating. The problem is that it doesn't exactly help. The table still "locks up" and our website that uses the data receives timeouts when attempting to access the data. I even tried splitting it up into 100 record chunks and even tried a WAITFOR DELAY '000:00:5' to see if it would help to pause between merging the chunks. It's still rather sluggish.

我有一个非常大的数据集(约300万条记录),需要在每日时间表上与更新和新记录合并。我有一个存储过程,实际上将记录集分成1000个记录块,并使用带有临时表的MERGE命令,以避免在数据更新时锁定活动表。问题是它并没有完全帮助。该表仍然“锁定”,并且我们使用数据的网站在尝试访问数据时会收到超时。我甚至尝试将它分成100个记录块,甚至尝试了WAITFOR DELAY'000:00:5',看看它是否有助于在合并块之间暂停。它仍然相当缓慢。

I'm looking for any suggestions, best practices, or examples on how to merge large sets of data without locking the tables.

我正在寻找有关如何在不锁定表的情况下合并大型数据集的任何建议,最佳实践或示例。

Thanks

谢谢

3 个解决方案

#1


6  

Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.

在进行选择时,将前端更改为使用NOLOCK或READ UNCOMMITTED。

You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.

您不能NOLOCK MERGE,INSERT或UPDATE,因为必须锁定记录才能执行更新。但是,你可以NOLOCK SELECTS。

Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.

请注意,您应该谨慎使用它。如果脏读是可以的,那就继续吧。但是,如果读取需要更新的数据,那么您需要沿着不同的路径前进,并弄清楚为什么合并3M记录会导致问题。

I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.

我愿意打赌,大部分时间都花在从合并命令期间从磁盘读取数据和/或在低内存情况下工作。简单地将更多ram填充到数据库服务器中可能会更好。

An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.

理想的数量是有足够的RAM来根据需要将整个数据库拉入内存。例如,如果您有一个4GB的数据库,那么请确保您在x64服务器中有8GB的RAM ..当然。

#2


5  

I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.

我担心我会有相反的经历。我们正在执行更新和插入,其中源表只有一小部分行数作为目标表,其数量为数百万。

When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.

当我们在整个操作窗口中组合源表记录然后只执行一次MERGE时,我们看到性能提高了500%。我对此的解释是,您只需支付一次MERGE命令的前期分析,而不是在紧密循环中反复进行。

Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.

此外,我确信将160万行(源)合并到700万行(目标),而不是400行到700万行,超过4000个不同的操作(在我们的例子中)更好地利用了SQL服务器引擎的功能。同样,相当多的工作是分析两个数据集,这只进行一次。

Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:

我要问的另一个问题是,您是否知道MERGE命令在源表和目标表上的索引上执行得更好吗?我想推荐您以下链接:

http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx

http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx

#3


0  

From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.

从个人经验来看,MERGE的主要问题是,由于它确实是页面锁定,因此它排除了INSERT中指向表的任何并发性。因此,如果您沿着这条道路前进,那么批量所有将在单个编写器中命中表的更新是至关重要的。

For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.

例如:我们有一个表,INSERT在每个条目上花了0.2秒疯狂,大部分时间似乎都浪费在事务锁存上,所以我们将其转换为使用MERGE,一些快速测试显示它允许我们插入256个条目在0.4秒内甚至512秒在0.5秒内,我们使用负载生成器进行了测试,所有这些都很好,直到它达到生产状态并且页面锁上的所有内容都被阻塞,导致总吞吐量比单个INSERT低得多。

The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.

解决方案不仅要在MERGE操作中对来自单个生产者的条目进行批处理,还要在单个MERGE操作中通过额外级别的队列(从前也为每个DB的单个连接,从生产者批量处理批次)。但是使用MARS交换所有生成器调用执行实际MERGE事务的存储过程),这样我们就能够每秒处理数千个INSERT而没有问题。

Having the NOLOCK hints on all of your front-end reads is an absolute must, always.

对所有前端读取都有NOLOCK提示是绝对必须的。

#1


6  

Change your front end to use NOLOCK or READ UNCOMMITTED when doing the selects.

在进行选择时,将前端更改为使用NOLOCK或READ UNCOMMITTED。

You can't NOLOCK MERGE,INSERT, or UPDATE as the records must be locked in order to perform the update. However, you can NOLOCK the SELECTS.

您不能NOLOCK MERGE,INSERT或UPDATE,因为必须锁定记录才能执行更新。但是,你可以NOLOCK SELECTS。

Note that you should use this with caution. If dirty reads are okay, then go ahead. However, if the reads require the updated data then you need to go down a different path and figure out exactly why merging 3M records is causing an issue.

请注意,您应该谨慎使用它。如果脏读是可以的,那就继续吧。但是,如果读取需要更新的数据,那么您需要沿着不同的路径前进,并弄清楚为什么合并3M记录会导致问题。

I'd be willing to bet that most of the time is spent reading data from the disk during the merge command and/or working around low memory situations. You might be better off simply stuffing more ram into your database server.

我愿意打赌,大部分时间都花在从合并命令期间从磁盘读取数据和/或在低内存情况下工作。简单地将更多ram填充到数据库服务器中可能会更好。

An ideal amount would be to have enough ram to pull the whole database into memory as needed. For example, if you have a 4GB database, then make sure you have 8GB of RAM.. in an x64 server of course.

理想的数量是有足够的RAM来根据需要将整个数据库拉入内存。例如,如果您有一个4GB的数据库,那么请确保您在x64服务器中有8GB的RAM ..当然。

#2


5  

I'm afraid that I've quite the opposite experience. We were performing updates and insertions where the source table had only a fraction of the number of rows as the target table, which was in the millions.

我担心我会有相反的经历。我们正在执行更新和插入,其中源表只有一小部分行数作为目标表,其数量为数百万。

When we combined the source table records across the entire operational window and then performed the MERGE just once, we saw a 500% increase in performance. My explanation for this is that you are paying for the up front analysis of the MERGE command just once instead of over and over again in a tight loop.

当我们在整个操作窗口中组合源表记录然后只执行一次MERGE时,我们看到性能提高了500%。我对此的解释是,您只需支付一次MERGE命令的前期分析,而不是在紧密循环中反复进行。

Furthermore, I am certain that merging 1.6 million rows (source) into 7 million rows (target), as opposed to 400 rows into 7 million rows over 4000 distinct operations (in our case) leverages the capabilities of the SQL server engine much better. Again, a fair amount of the work is in the analysis of the two data sets and this is done only once.

此外,我确信将160万行(源)合并到700万行(目标),而不是400行到700万行,超过4000个不同的操作(在我们的例子中)更好地利用了SQL服务器引擎的功能。同样,相当多的工作是分析两个数据集,这只进行一次。

Another question I have to ask is well is whether you are aware that the MERGE command performs much better with indexes on both the source and target tables? I would like to refer you to the following link:

我要问的另一个问题是,您是否知道MERGE命令在源表和目标表上的索引上执行得更好吗?我想推荐您以下链接:

http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx

http://msdn.microsoft.com/en-us/library/cc879317(v=SQL.100).aspx

#3


0  

From personal experience, the main problem with MERGE is that since it does page lock it precludes any concurrency in your INSERTs directed to a table. So if you go down this road it is fundamental that you batch all updates that will hit a table in a single writer.

从个人经验来看,MERGE的主要问题是,由于它确实是页面锁定,因此它排除了INSERT中指向表的任何并发性。因此,如果您沿着这条道路前进,那么批量所有将在单个编写器中命中表的更新是至关重要的。

For example: we had a table on which INSERT took a crazy 0.2 seconds per entry, most of this time seemingly being wasted on transaction latching, so we switched this over to using MERGE and some quick tests showed that it allowed us to insert 256 entries in 0.4 seconds or even 512 in 0.5 seconds, we tested this with load generators and all seemed to be fine, until it hit production and everything blocked to hell on the page locks, resulting in a much lower total throughput than with the individual INSERTs.

例如:我们有一个表,INSERT在每个条目上花了0.2秒疯狂,大部分时间似乎都浪费在事务锁存上,所以我们将其转换为使用MERGE,一些快速测试显示它允许我们插入256个条目在0.4秒内甚至512秒在0.5秒内,我们使用负载生成器进行了测试,所有这些都很好,直到它达到生产状态并且页面锁上的所有内容都被阻塞,导致总吞吐量比单个INSERT低得多。

The solution was to not only batch the entries from a single producer in a MERGE operation, but also to batch the batch from producers going to individual DB in a single MERGE operation through an additional level of queue (previously also a single connection per DB, but using MARS to interleave all the producers call to the stored procedure doing the actual MERGE transaction), this way we were then able to handle many thousands of INSERTs per second without problem.

解决方案不仅要在MERGE操作中对来自单个生产者的条目进行批处理,还要在单个MERGE操作中通过额外级别的队列(从前也为每个DB的单个连接,从生产者批量处理批次)。但是使用MARS交换所有生成器调用执行实际MERGE事务的存储过程),这样我们就能够每秒处理数千个INSERT而没有问题。

Having the NOLOCK hints on all of your front-end reads is an absolute must, always.

对所有前端读取都有NOLOCK提示是绝对必须的。