INSERT

While browsing SO I found the following Question/Discussion about the "best" approach for inserting records that don't exist yet. One of the statements that struck me was one of [Remus Rusanu] stating:

浏览SO时,我发现了以下关于插入尚不存在的记录的“最佳”方法的问题/讨论。让我印象深刻的一句话是[Remus Rusanu]陈述的一句话:

Both variants are incorrect. You will insert pairs of duplicate @value1, @value2, guaranteed.

两种变体都不正确。您将插入一对重复的@ value1,@ value2,保证。

Although I do agree about this for the syntax where the check is 'separated' from the INSERT (and no explicit locking/transaction mgmt is present); I'm having a hard time understanding why and when this would be true for the other proposed syntax that looks like this

虽然我确实同意检查与INSERT“分离”的语法(并且没有显式的锁定/事务管理);我很难理解为什么以及何时这对于其他提出的语法都是如此

INSERT INTO mytable (x)
SELECT @x WHERE NOT EXISTS (SELECT * FROM mytable WHERE x = @x);

I do NOT want to start (another) what's best/fastest discussion, nor do I think the syntax can 'replace' a unique index/constraint (or PK) but I really need to know in what situations this construction could cause doubles as I've been using this syntax in the past and wonder if it is unsafe to continue doing so in the future.

我不想开始(另一个)什么是最好/最快的讨论,我也不认为语法可以'替换'一个独特的索引/约束(或PK)但我真的需要知道在什么情况下这个结构可能导致双倍,因为我过去一直在使用这种语法,并想知道将来继续这样做是不安全的。

What I think that happens is that the INSERT & SELECT are both in the same (implicit) transaction. The query will take an IX lock on the related record (key) and not release it until the entire query has finished, thus only AFTER the record has been inserted. This lock blocks all other connections from making the same INSERT as they can't get a lock themselves until after our insert has finished; only then they get the lock and will start verifying for themselves if the record already exists or not.

我认为发生的是INSERT和SELECT都在同一(隐式)事务中。查询将对相关记录(密钥)进行IX锁定,并且在整个查询完成之前不会释放它,因此仅在插入记录之后。这个锁阻止所有其他连接进行相同的INSERT,因为它们在我们的插入完成之前无法自己锁定;只有这样他们才能获得锁定,并且如果记录已经存在,他们将开始自己验证。

As IMHO the best way to find out is by testing, I've been running the following code for a while on my laptop:

作为恕我直言,最好的方法是通过测试,我已经在我的笔记本电脑上运行了以下代码:

Create table

CREATE TABLE t_test (x int NOT NULL PRIMARY KEY (x))

Run below on many, many connections in parallel)

在许多很多连接并行运行下面)

SET NOCOUNT ON

WHILE 1 = 1
    BEGIN
        INSERT t_test (x)
        SELECT x = DatePart(ms, CURRENT_TIMESTAMP)
         WHERE NOT EXISTS ( SELECT *
                              FROM t_test old
                             WHERE old.x = DatePart(ms, CURRENT_TIMESTAMP) )
    END

So far the only things to note are:

到目前为止,唯一需要注意的事项是:

No errors encountered (yet)

没有遇到任何错误(尚未)

CPU is running quite hot =)

CPU运行得很热=)

table held 300 records quickly (due to 3ms 'precision' of datetime) after that no actual inserts are happening any more, as expected.

表之后很快就会有300条记录(由于3ms'日期时间的'精度'),之后没有按预期发生任何实际插入。

UPDATE:

Turns out my example above is not doing what I intended it to do. Instead of multiple connections trying to insert the same record simultaneously I simply had it not-inserting already existing records after the first second. As it probably took about a second to copy-paste & execute the query on the next connection there was never a danger of duplicates. I'll be wearing my donkey-ears for the remainder of the day...

事实证明我上面的例子没有按照我的意图去做。我没有尝试同时插入相同记录的多个连接,而是在第一秒之后没有插入已存在的记录。由于在下一次连接上复制粘贴和执行查询可能需要大约一秒钟,因此从不存在重复的危险。我将在今天余下的时间里戴着我的驴耳......

Anyway, I've adapted the test to be more in line of the matter at hand (using the same table)

无论如何,我已经使测试更加符合当前的事情(使用相同的表格)

SET NOCOUNT ON

DECLARE @midnight datetime
SELECT @midnight = Convert(datetime, Convert(varchar, CURRENT_TIMESTAMP, 106), 106)

WHILE 1 = 1
    BEGIN
        INSERT t_test (x)
        SELECT x = DateDiff(ms, @midnight, CURRENT_TIMESTAMP)
         WHERE NOT EXISTS ( SELECT *
                              FROM t_test old
                             WHERE old.x = DateDiff(ms, @midnight, CURRENT_TIMESTAMP))
    END

And lo & behold, the output window now holds plenty of errors along the lines of

而且,看来,输出窗口现在存在很多错误

Msg 2627, Level 14, State 1, Line 8 Violation of PRIMARY KEY constraint 'PK__t_test__3BD019E521C3B7EE'. Cannot insert >duplicate key in object 'dbo.t_test'. The duplicate key value is (57581873).

Msg 2627,Level 14,State 1,Line 8违反PRIMARY KEY约束'PK__t_test__3BD019E521C3B7EE'。无法在对象'dbo.t_test'中插入>重复键。重复键值为(57581873)。

FYI: As pointed out by Andomar, adding a HOLDLOCK and/or SERIALIZABLE hint indeed 'solves' the problem but then turns out to be causing lots of deadlocks... which isn't great but not unexpected either when I think it through.

仅供参考:正如Andomar所指出的那样,添加一个HOLDLOCK和/或SERIALIZABLE提示确实“解决”了这个问题,但结果却造成了很多死锁......这种情况并不好,但是当我想通过它时并不意外。

Guess I have quite a bit of code review to do...

猜猜我有相当多的代码审查要做...

2 个解决方案

#1

Thanks for posting separate question. You have several misconceptions:

感谢您发布单独的问题。你有几个误解:

The query will take an IX lock on the related record (key) and not release it until the entire query has finished

查询将对相关记录(密钥)进行IX锁定,并且在整个查询完成之前不会释放它

The INSERT will lock the rows inserted, X lock (intent locks like IX can only be requested on parent entities on the lock hierarchy, never on records). This lock must be held until the transaction commits (strict two-phase locking requires X locks always to be released only at the end of the transaction).

INSERT将锁定插入的行,X锁(像IX这样的意图锁只能在锁层次结构上的父实体上请求,永远不会在记录上请求)。必须保持此锁定直到事务提交(严格的两阶段锁定要求X锁定始终仅在事务结束时释放)。

Note that the locks acquired by the INSERT will not block more inserts even of the same key. The only way to prevent duplicates is an unique index and the mechanism to enforce the uniqueness is not lock based. Yes, on a primary key, due to its uniqueness, duplicates will be prevented but the forces at play are different, even if locking does play a role.

请注意,INSERT获取的锁定不会阻止更多插入,即使是相同的密钥。防止重复的唯一方法是唯一索引,并且强制唯一性的机制不是基于锁定的。是的,在主键上,由于其唯一性,可以防止重复,但是即使锁定确实发挥作用,也会发挥不同的作用力。

In your example what will happen is that the operations will serialize because the SELECT blocks on the INSERT, due to the X vs. S lock conflict on the newly inserted row. another think to consider is that 300 records of type INT will fit on a single page and a lot of optimizations will kick in (eg. use a scan instead of multiple seeks) and will alter the test results. Remember, a hypothesis with many positives and no proof is still only a conjecture...

在您的示例中,将发生的操作是序列化,因为INSERT上的SELECT块,由于新插入的行上的X与S锁定冲突。另一个需要考虑的是,300个INT类型的记录将适合单个页面,并且会进行大量优化(例如,使用扫描而不是多次搜索)并且将改变测试结果。请记住,一个有许多正面和没有证据的假设仍然只是一个猜想......

To test the problem you need to ensure that the INSERT does not block concurrent SELECTs. Running under RCSI or under snapshot isolation is one way to achieve this (and may 'achieve' it in production involuntarily and break the app that made all the assumptions above...) A WHERE clause is another way. A significantly big table and secondary indexes is yet another way.

要测试该问题,您需要确保INSERT不会阻止并发SELECT。在RCSI下运行或在快照隔离下运行是实现这一目标的一种方法(并且可能会在生产中“实现”它并且会破坏上面所有假设的应用程序......)WHERE子句是另一种方式。一个非常大的表和二级索引是另一种方式。

So here is how I tested it:

所以这是我测试它的方式:

set nocount on;
go

drop database test;
go

create database test;
go

use test;
go

create table test (id int primary key, filler char(200));
go

-- seed 10000 values, fill some pages
declare @i int = 0;
begin transaction
while @i < 10000
begin
    insert into test (id) values (@i);
    set @i += 1;
end
commit;

Now run this from several parallel connection (I used 3):

现在从几个并行连接(我使用3)运行它:

use test;
go

set nocount on;
go

declare @i int;
while (1=1)
begin
    -- This is not cheating. This ensures that many concurrent SELECT attempt 
    -- to insert the same values, and all of them believe the values are 'free'
    select @i = max(id) from test with (readpast);
    insert into test (id)
    select id
        from (values (@i), (@i+1), (@i+2), (@i+3), (@i+4), (@i+5)) as t(id)
        where t.id not in (select id from test);
end

Here are some results:

以下是一些结果:

Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130076).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130096).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130106).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130121).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130141).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130151).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130176).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6

Even with locking, no snapshot isolation, no RCSI. As each SELECT attempts to insert @i+1...@i+5, they'll all discover the values are not existign and then they'll all proceed to INSERT. One lucky winner will succeed, all the rest will cause PK violation. Frequently. I used the @i=MAX(id) intentionally to dramatically increase the chases of conflict, but that is not required. I'll leave the problems of figuring out why all violations occur on values %5+1 as an exercise.

即使有锁定,也没有快照隔离,没有RCSI。当每个SELECT尝试插入@ i + 1 ... @ i + 5时,他们都会发现值不是existign然后它们都会进入INSERT。一个幸运的赢家将成功,其余所有将导致PK违规。经常。我故意使用@ i = MAX(id)来大幅增加冲突的追逐,但这不是必需的。我将留下问题,找出为什么所有违规都发生在值%5 + 1上作为练习。

#2

You are testing from a single connection, so you are not testing concurrency at all. Run the script twice from different windows and you will start to see conflicts.

您正在从单个连接进行测试,因此您根本不测试并发性。从不同的窗口运行脚本两次,您将开始看到冲突。

There are multiple reasons for the conflicts:

冲突有多种原因:

By default, a lock is not held until the end of an (implicit) transaction. Use the with (holdlock) query hint to change this behavior.

默认情况下,在(隐式)事务结束之前不会保持锁定。使用with(holdlock)查询提示可以更改此行为。

The concurrency problem with your query is called a "phantom read". The default transaction isolation level is "read committed", which does not protect against phantom reads. Use the with (serializable) query hint to increase the isolation level. (Try to avoid the set transaction isolation level command, because the isolation level is not cleared when a connection is returned to the connection pool.)

查询的并发问题称为“幻读”。默认事务隔离级别为“已提交读”,但不能防止幻像读取。使用with(可序列化)查询提示来提高隔离级别。 (尽量避免使用set transaction isolation level命令,因为当连接返回到连接池时,不会清除隔离级别。)

The primary key constraint is always enforced. So your query will try to insert a duplicate row and fail by throwing a duplicate key error.

始终强制执行主键约束。因此,您的查询将尝试插入重复的行,并通过抛出重复的键错误而失败。

A good approach is to use your query (which will work 99% of the time) and make the client handle the occasional duplicate key exception in a graceful manner.

一个好的方法是使用您的查询(它将在99%的时间内工作)并使客户端以优雅的方式处理偶尔的重复键异常。

Wikipedia has a great explanation of isolation levels.

*对隔离级别有很好的解释。

#1