同一表上的SQL Server死锁

We have problems with deadlock situations in our application. I have read a lot about blocking, locking and deadlocks the last few days to try to get an understanding about the problem in order to solve it.

在我们的应用程序中存在死锁问题。我最近几天读了很多关于阻塞、锁定和死锁的文章，试图了解这个问题，以便解决它。

Now when I read the error log information about the deadlocks I can't understand how this situation can exist. Look at this (I have renamed the table names but the important one is the one called OurTable in the log message):

现在，当我阅读关于死锁的错误日志信息时，我无法理解这种情况是如何存在的。看看这个(我已经重命名了表名，但重要的是日志消息中名为OurTable的表):

deadlock-list
deadlock victim=process1e2ac02c8
process-list
    process id=process1e2ac02c8 taskpriority=0 logused=0 waitresource=OBJECT: 11:290100074:0  waittime=704 ownerId=3144354890 transactionname=SELECT lasttranstarted=2011-12-01T14:43:20.577 XDES=0x80017920 lockMode=S schedulerid=6 kpid=7508 status=suspended spid=155 sbid=0 ecid=0 priority=0 trancount=0 lastbatchstarted=2011-12-01T14:43:20.577 lastbatchcompleted=2011-12-01T14:43:20.577 clientapp=.Net SqlClient Data Provider hostname=DE-1809 hostpid=5856 loginname=2Ezy isolationlevel=read committed (2) xactid=3144354890 currentdb=11 lockTimeout=4294967295 clientoption1=673185824 clientoption2=128056
     executionStack
      frame procname=.dbo.RetrieveSomething line=23 stmtstart=1398 stmtend=3724 sqlhandle=0x03000b0030d42d645a63e6006a9f00000100000000000000
         select
            Col1
            ,Col2
            ,(
                SELECT TOP(1)
                    Col1
                FROM
                    OurTable2 AS C
                        JOIN OurTable AS ETC ON C.Id = ETC.FKId
                            AND E.Id = C.FKId
                ORDER BY ETC.Col2
            ) AS Col3
        from OurTable3 AS E
    process id=process2df4894c8 taskpriority=0 logused=0 waitresource=OBJECT: 11:290100074:0  waittime=9713 ownerId=3144330250 transactionname=INSERT EXEC lasttranstarted=2011-12-01T14:43:11.573 XDES=0x370764930 lockMode=S schedulerid=13 kpid=4408 status=suspended spid=153 sbid=0 ecid=0 priority=0 trancount=1 lastbatchstarted=2011-12-01T14:43:11.573 lastbatchcompleted=2011-12-01T14:43:11.573 clientapp=.Net SqlClient Data Provider hostname=DE-1809 hostpid=5856 loginname=2Ezy isolationlevel=read committed (2) xactid=3144330250 currentdb=11 lockTimeout=4294967295 clientoption1=673185824 clientoption2=128056
     executionStack
      frame procname=adhoc line=1 sqlhandle=0x02000000ba6cb42612240bdb19f7303e279a714276c04344
         select
            Col1
            , Col2
            , Col3
            , ISNULL(
                (select top(1)
                    E_SUB.Col1 + ' ' + E_SUB.Col2
                    from OurTable3 as E_SUB 
                        inner join OurTable2 as C on E_SUB.Id = C.FKId
                        inner join OurTable as ETC on C.Id = ETC.FKId
                as Col3
        from OurTable4
            inner join dbo.OurTable as ETC on Id = ETC.FKId  
    process id=process8674c8 taskpriority=0 logused=0 waitresource=OBJECT: 11:290100074:5  waittime=338 ownerId=3143936820 transactionname=INSERT lasttranstarted=2011-12-01T14:38:24.423 XDES=0x1ecd229f0 lockMode=X schedulerid=7 kpid=12092 status=suspended spid=124 sbid=0 ecid=0 priority=0 trancount=2 lastbatchstarted=2011-12-01T14:38:23.027 lastbatchcompleted=2011-12-01T14:38:23.013 clientapp=.Net SqlClient Data Provider hostname=DE-1809 hostpid=5856 loginname=2Ezy isolationlevel=read committed (2) xactid=3143936820 currentdb=11 lockTimeout=4294967295 clientoption1=673185824 clientoption2=128056
     executionStack
      frame procname=.dbo.UpsertSomething line=332 stmtstart=27712 stmtend=31692 sqlhandle=0x03000b00bbf2a93c0f63a700759f00000100000000000000
            insert into dbo.OurTable
            (
                Col1
                ,Col2
                ,Col3
            )
            values
            (
                @Col1
                ,@Col2
                ,@Col3
            )
       resource-list
        objectlock lockPartition=0 objid=290100074 subresource=FULL dbid=11 objectname=dbo.OurTable id=lock16a1fde80 mode=X associatedObjectId=290100074
         owner-list
         waiter-list
          waiter id=process1e2ac02c8 mode=S requestType=wait
        objectlock lockPartition=0 objid=290100074 subresource=FULL dbid=11 objectname=dbo.OurTable id=lock16a1fde80 mode=X associatedObjectId=290100074
         owner-list
          owner id=process8674c8 mode=X
         waiter-list
          waiter id=process2df4894c8 mode=S requestType=wait
        objectlock lockPartition=5 objid=290100074 subresource=FULL dbid=11 objectname=dbo.OurTable id=lock212f0f300 mode=IS associatedObjectId=290100074
         owner-list
          owner id=process1e2ac02c8 mode=IS
         waiter-list
          waiter id=process8674c8 mode=X requestType=wait

The way I read this is:

我的解读是:

spid 155 is waiting for a Shared table lock on OurTable (spid 124 holds a conflicting X lock)

spid 155正在等待我们表上的共享表锁(spid 124持有冲突的X锁)

spid 153 is waiting for a Shared table lock on OurTable (spid 124 holds a conflicting X lock)

spid 153正在等待我们表上的共享表锁(spid 124持有冲突的X锁)

spid 124 is waiting for an Exclusive table lock on OurTable (spid 155 holds a conflicting IS lock)

spid 124正在等待我们表上的独占表锁(spid 155持有一个冲突的is锁)

My question is how this can happen. Two sessions holds one lock on the whole table at the same time. I thought that a usual deadlock is when two ore more sessions hold locks on different resources and wait for each other. But here the lock is on the same resource. It is not a lock on an index but on the table. This error is frequent in our application and some lock has to be the first one to be requested and why is the second lock accepted if there already is a lock on the entire table?

我的问题是这是怎么发生的。两个会话同时在整个表上保存一个锁。我认为一个通常的死锁是当两个更多的会话持有不同资源的锁并相互等待的时候。但是这里的锁在同一个资源上。它不是索引上的锁，而是表上的锁。这个错误在我们的应用程序中很常见，必须首先请求一个锁，如果整个表上已经有一个锁，为什么还要接受第二个锁呢?

Anyone who can give a hint of what can be wrong or anyone experienced a similar deadlock?

有谁能指出哪里出了问题，或者有谁经历过类似的僵局?

3 个解决方案

#1

After a little bit more searching and testing I am pretty confident I can give the correct answer to my own question.

在进行了一些搜索和测试之后，我非常自信能够给出我自己问题的正确答案。

I have to thank Martin Smith who put me in the right direction by pointing out that the wait resources were different.

我要感谢马丁·史密斯，他让我找到了正确的方向，指出等待资源是不同的。

As Martin wrote in his comment the wait resources are: 11:290100074:0 and 11:290100074:5. After searching this it turns out that if you run Sql Server R2 on a machine with 16 CPUs or more Sql Server is able to use a feature called lock partitioning.

正如马丁在评论中所写的，等待资源是:11:290100074:0和11:290100074:5。搜索之后发现，如果在一台有16个cpu或更多Sql服务器的机器上运行Sql Server R2，那么它就能够使用一种称为锁分区的特性。

This article says among other things:

这篇文章除其他内容外还说:

Only NL, SCH-S, IS, IU, and IX lock modes are acquired on a single partition.

只有NL、sb - s、IS、IU和IX锁模式是在单个分区上获得的。

What happens in my case is that spid 155 puts a shared lock on a row or page and therefor puts an intended shared lock on the object and with the lock partition feature this happens to be on partition id 5.

在我的例子中，spid 155将一个共享锁放置在行或页面上，因此，将一个预期的共享锁放置在对象上，并且具有锁分区特性，这恰好位于分区id 5上。

At the same time spid 124 needs to lock the full object with an exclusive lock and therefor needs to put X lock on all partitions.

与此同时，spid 124需要用独占锁锁定整个对象，因此需要在所有分区上设置X锁。

Shared (S), exclusive (X), and other locks in modes other than NL, SCH-S, IS, IU, and IX must be acquired on all partitions starting with partition ID 0 and following in partition ID order.

除NL、sb -S、IS、IU和IX之外的模式下的共享(S)、独占(X)和其他锁必须在以分区ID 0开头的所有分区上按分区ID顺序获取。

When it arrives at partition id 5 it is told that spid 155 holds an IS lock and it needs to wait until that lock is released.

当它到达分区id 5时，它被告知spid 155持有一个is锁，它需要等待锁释放。

Now when spid 124 is waiting on the IS lock to be released lock escalation occurs on spid 155 and it requests a shared lock on the table. This means it needs to put S lock on all partitions starting at id 0. But immediately on id 0 it hits the wall because spid 124 already holds an exclusive lock on that partition. And there you have the cause of the deadlock.

现在spid 124正在等待的是在spid 155上释放锁升级，它请求表上的共享锁。这意味着它需要在id为0的所有分区上设置S锁。但是在id 0上，它会立即撞到墙上，因为spid 124已经在那个分区上持有一个独占锁。这就是死锁的原因。

I can not guarantee 100% this is the exact answer but I am pretty sure I am, if not 100% right, at least close to the answer.

我不能保证100%这是准确的答案，但我很确定我是，如果不是100%正确的话，至少接近答案。

The solution? Well. The lock partition feature can not be turned off but on the other hand you can control lock escalation with different transaction levels and also different options in the alter table statement.

解决方案?好。锁分区特性不能被关闭，但是另一方面，您可以控制不同事务级别和alter table语句中不同选项的锁升级。

I will continue to investigate why the query forces lock escalation because I believe the solution in my particular case is to tune the query somehow to not escalate. At least I will try this before using the tools mentioned above.

我将继续研究为什么查询强制锁升级，因为我相信在我的特定情况下，解决方案是以某种方式将查询调优为不升级。至少我会在使用上面提到的工具之前尝试一下。

Hope this answer helps other with similar problems.

希望这个答案能帮助其他有类似问题的人。

#2

It is not always true that "usual deadlock is when two ore more sessions hold locks on different resources and wait for each other" - also there are conversion deadlocks. Even if two processes compete on only one resource, they still can embrace in a conversion deadlock which I described here.

“通常的死锁是当两个以上的会话在不同的资源上持有锁并相互等待时”并不总是正确的——还有转换死锁。即使两个进程只在一个资源上竞争，它们仍然可以在我在这里描述的转换死锁中拥抱。

Also, although the best known deadlock scenario involves two connections modifying two tables in different order, there are also other deadlock scenarios involving only one table. Besides, in some scenarios each connection needs to issue only one statement, and it is enough to get a deadlock. Also in some scenarios only one connection needs to modify or acquire exclusive locks – the other one may only read data and only acquire shared locks and still embrace in a deadlock.

另外，尽管最著名的死锁场景涉及到两个连接以不同的顺序修改两个表，但也存在其他只涉及一个表的死锁场景。此外，在某些情况下，每个连接只需要发出一条语句，这足以导致死锁。另外，在某些情况下，只有一个连接需要修改或获取独占锁——另一个连接可能只读取数据，只获取共享锁，并且仍然处于死锁中。

One more thing: answering this "none of the queries runs in a transaction" comment - every DML statement always runs in a transaction, and DML means selects too. All the commands involved in your deadlock run in the context of a transaction. Follow the second link, and run the repro scripts - you will see for yourself.

还有一件事:回答这个“事务中不运行任何查询”的注释——每个DML语句总是在事务中运行，DML意味着选择。死锁中涉及的所有命令都在事务的上下文中运行。遵循第二个链接，并运行repro脚本——您将看到自己。

Anyway, I would just run the select under snapshot isolation - that would prevent this particular deadlock (when one connection only reads) from happening.

无论如何，我将在快照隔离下运行select—这将防止这种特殊的死锁(当一个连接仅读取时)发生。

#3

It happens because your lock strategy is simplistic and you run into a case where this comes and hurts you.

之所以会出现这种情况，是因为您的锁策略过于简单，您会遇到这种情况，并且会对您造成伤害。

As in: if you select with a bad enough lock level you get a read lock and can not update to a write lock while another read lock is in place, if two apps do that... you can have exactly thagt behavior (process 1 gets read ock, 2 gets read lock, 1 wants to update to write lock (waits), 2 wants to update to write lock - deadlock.

如:如果您选择了一个足够坏的锁级别，那么您将获得一个读锁，并且在另一个读锁存在的情况下无法更新写入锁，如果两个应用程序这样做……您可以有确切的thagt行为(进程1获取读锁，进程2获取读锁，进程1希望更新写入锁(等待)，进程2希望更新写入锁死锁。

In your particular case, the reads seem to put up read locks, while the upsert does make one to decide to update, but blows then on the lock necessary for the insert (and yes, you can have a lock that blocks an insert).

在您的特定情况下，读操作似乎会挂起读锁，而upsert确实会让用户决定要更新，但是会在插入所需的锁上击出读锁(是的，您可以使用锁来阻止插入)。

This error is frequent in our application and some lock has to be the first one to be requested and why is the second lock accepted if there already is a lock on the entire table?

这个错误在我们的应用程序中很常见，必须首先请求一个锁，如果整个表上已经有一个锁，为什么还要接受第二个锁呢?

Beginner design issue. The issue happens because some locks are shared (read locks mostly) allowing other read locks to be etablished. IF you allow that. I would suggest making sure the deadlock does not occur. Either have the reads not leave a lock (with NOLOCK) or get proper write locks a lot earlier.

初学者设计问题。之所以会出现这个问题，是因为有些锁是共享的(主要是读锁)，允许其他读锁被etablished。如果你允许。我建议确保不会发生死锁。要么让读操作不留下锁(使用NOLOCK)，要么更早地获得正确的写锁。

#1