SQL Server DRI（ON DELETE CASCADE）是否缓慢？

I've been analyzing a recurring "bug report" (perf issue) in one of our systems related to a particularly slow delete operation. Long story short: It seems that the CASCADE DELETE keys were largely responsible, and I'd like to know (a) if this makes sense, and (b) why it's the case.

我一直在分析一个与特别慢的删除操作相关的系统中反复出现的“错误报告”（性能问题）。长话短说：似乎CASCADE DELETE键主要负责，我想知道（a）这是否有意义，（b）为什么会这样。

We have a schema of, let's say, widgets, those being at the root of a large graph of related tables and related-to-related tables and so on. To be perfectly clear, deleting from this table is actively discouraged; it is the "nuclear option" and users are under no illusions to the contrary. Nevertheless, it sometimes just has to be done.

我们有一个架构，比方说，小部件，那些位于相关表和相关相关表的大图的根部，等等。要非常清楚，主动不鼓励从此表中删除;这是“核选择”，用户并没有相反的幻想。然而，有时候必须这样做。

The schema looks something like this:

架构看起来像这样：

Widgets
   |
   +--- Anvils [1:1]
   |    |
   |    +--- AnvilTestData [1:N]
   |
   +--- WidgetHistory (1:N)
        |
        +--- WidgetHistoryDetails (1:N)

Column definitions look like the following:

列定义如下所示：

Widgets (WidgetID int PK, WidgetName varchar(50))
Anvils (AnvilID int PK, WidgetID int FK/IX/UNIQUE, ...)
AnvilTestData (AnvilID int FK/IX, TestID int, ...Test Data...)
WidgetHistory (HistoryID int PK, WidgetID int FK/IX, HistoryDate datetime, ...)
WidgetHistoryDetails (HistoryID int FK/IX, DetailType smallint, ...)

Nothing too scary, really. A Widget can be different types, an Anvil is a special type, so that relationship is 1:1 (or more accurately 1:0..1). Then there's a large amount of data - perhaps thousands of rows of AnvilTestData per Anvil collected over time, dealing with hardness, corrosion, exact weight, hammer compatibility, usability issues, and impact tests with cartoon heads.

真的，没什么可怕的。 Widget可以是不同的类型，Anvil是一种特殊类型，因此关系是1：1（或更准确地说是1：0..1）。然后有大量的数据 - 每个Anvil可能随时间收集数千行AnvilTestData，处理硬度，腐蚀，精确重量，锤子兼容性，可用性问题以及卡通头的冲击测试。

Then every Widget has a long, boring history of various types of transactions - production, inventory moves, sales, defect investigations, RMAs, repairs, customer complaints, etc. There might be 10-20k details for a single widget, or none at all, depending on its age.

然后，每个Widget都有各种类型的交易历史悠久，无聊 - 生产，库存移动，销售，缺陷调查，RMA，维修，客户投诉等。单个小部件可能有10-20k的详细信息，或者根本没有，取决于它的年龄。

So, unsurprisingly, there's a CASCADE DELETE relationship at every level here. If a Widget needs to be deleted, it means something's gone terribly wrong and we need to erase any records of that widget ever existing, including its history, test data, etc. Again, nuclear option.

所以，不出所料，这里的每个级别都有一个CASCADE DELETE关系。如果需要删除Widget，则意味着某些内容出现了严重错误，我们需要删除现有的小部件的任何记录，包括其历史记录，测试数据等。再次，核选项。

Relations are all indexed, statistics are up to date. Normal queries are fast. The system tends to hum along pretty smoothly for everything except deletes.

关系都是索引，统计数据是最新的。普通查询很快。除了删除之外，系统往往会非常流畅地哼唱。

Getting to the point here, finally, for various reasons we only allow deleting one widget at a time, so a delete statement would look like this:

最后，由于各种原因，我们只允许一次删除一个小部件，因此删除语句如下所示：

DELETE FROM Widgets
WHERE WidgetID = @WidgetID

Pretty simple, innocuous looking delete... that takes over 2 minutes to run, for a widget with no data!

非常简单，无害的看起来删除...运行时间超过2分钟，对于没有数据的小部件！

After slogging through execution plans I was finally able to pick out the AnvilTestData and WidgetHistoryDetails deletes as the sub-operations with the highest cost. So I experimented with turning off the CASCADE (but keeping the actual FK, just setting it to NO ACTION) and rewriting the script as something very much like the following:

在完成执行计划之后，我终于能够选择AnvilTestData和WidgetHistoryDetails删除作为具有最高成本的子操作。所以我尝试关闭CASCADE（但保留实际的FK，只是将其设置为NO ACTION）并将脚本重写为非常类似于以下内容：

DECLARE @AnvilID int
SELECT @AnvilID = AnvilID FROM Anvils WHERE WidgetID = @WidgetID

DELETE FROM AnvilTestData
WHERE AnvilID = @AnvilID

DELETE FROM WidgetHistory
WHERE HistoryID IN (
    SELECT HistoryID
    FROM WidgetHistory
    WHERE WidgetID = @WidgetID)

DELETE FROM Widgets WHERE WidgetID = @WidgetID

Both of these "optimizations" resulted in significant speedups, each one shaving nearly a full minute off the execution time, so that the original 2-minute deletion now takes about 5-10 seconds - at least for new widgets, without much history or test data.

这两个“优化”都带来了显着的加速，每个都超过了执行时间的整整一分钟，所以最初的2分钟删除现在大约需要5-10秒 - 至少对于新的小部件，没有太多的历史或测试数据。

Just to be absolutely clear, there is still a CASCADE from WidgetHistory to WidgetHistoryDetails, where the fanout is highest, I only removed the one originating from Widgets.

为了绝对清楚，仍然存在从WidgetHistory到WidgetHistoryDetails的CASCADE，其中扇出最高，我只删除了源自Widgets的那个。

Further "flattening" of the cascade relationships resulted in progressively less dramatic but still noticeable speedups, to the point where deleting a new widget was almost instantaneous once all of the cascade deletes to larger tables were removed and replaced with explicit deletes.

级联关系的进一步“扁平化”导致逐渐减少的戏剧性但仍然明显的加速，一旦删除所有级联删除较大的表并删除显式删除，删除新的小部件几乎是瞬时的。

I'm using DBCC DROPCLEANBUFFERS and DBCC FREEPROCCACHE before each test. I've disabled all triggers that might be causing further slowdowns (although those would show up in the execution plan anyway). And I'm testing against older widgets, too, and noticing a significant speedup there as well; deletes that used to take 5 minutes now take 20-40 seconds.

我在每次测试之前都使用DBCC DROPCLEANBUFFERS和DBCC FREEPROCCACHE。我已经禁用了可能导致进一步减速的所有触发器（尽管这些触发器会出现在执行计划中）。我也正在测试较旧的小部件，并注意到那里的显着加速;过去需要花费5分钟的删除现在需要20-40秒。

Now I'm an ardent supporter of the "SELECT ain't broken" philosophy, but there just doesn't seem to be any logical explanation for this behaviour other than crushing, mind-boggling inefficiency of the CASCADE DELETE relationships.

现在，我是“SELECT is not broken”哲学的热心支持者，但除了压碎，令人难以置信的CASCADE DELETE关系的低效率之外，似乎没有任何合理的解释。

So, my questions are:

所以，我的问题是：

Is this a known issue with DRI in SQL Server? (I couldn't seem to find any references to this sort of thing on Google or here in SO; I suspect the answer is no.)

这是SQL Server中DRI的已知问题吗？（我似乎无法在谷歌或此处找到任何关于此类事物的引用;我怀疑答案是否定的。）
If not, is there another explanation for the behaviour I'm seeing?

如果没有，对于我所看到的行为是否有另一种解释？
If it is a known issue, why is it an issue, and are there better workarounds I could be using?

如果这是一个已知问题，为什么这是一个问题，我可以使用更好的解决方法吗？

1 个解决方案

#1

SQL Server is best at set-based operations, while CASCADE deletions are, by their nature, record-based.

SQL Server最适合基于集合的操作，而CASCADE删除本质上是基于记录的。

SQL Server, unlike the other servers, tries to optimize the immediate set-based operations, however, it works only one level deep. It needs to have the records deleted in the upper-level tables to delete those in the lower-level tables.

与其他服务器不同，SQL Server尝试优化基于集合的立即操作，但是，它只能在一个级别上运行。它需要在上层表中删除记录以删除较低级别表中的记录。

In other words, cascading operations work up-down, while your solution works down-up, which is more set-based and efficient.

换句话说，级联操作是向上运行的，而您的解决方案是向下运行的，这是更加基于集合和高效的。

Here's a sample schema:

这是一个示例模式：

CREATE TABLE t_g (id INT NOT NULL PRIMARY KEY)

CREATE TABLE t_p (id INT NOT NULL PRIMARY KEY, g INT NOT NULL, CONSTRAINT fk_p_g FOREIGN KEY (g) REFERENCES t_g ON DELETE CASCADE)

CREATE TABLE t_c (id INT NOT NULL PRIMARY KEY, p INT NOT NULL, CONSTRAINT fk_c_p FOREIGN KEY (p) REFERENCES t_p ON DELETE CASCADE)

CREATE INDEX ix_p_g ON t_p (g)

CREATE INDEX ix_c_p ON t_c (p)

, this query:

，这个查询：

DELETE
FROM    t_g
WHERE   id > 50000

and its plan:

及其计划：

  |--Sequence
       |--Table Spool
       |    |--Clustered Index Delete(OBJECT:([test].[dbo].[t_g].[PK__t_g__176E4C6B]), WHERE:([test].[dbo].[t_g].[id] > (50000)))
       |--Index Delete(OBJECT:([test].[dbo].[t_p].[ix_p_g]) WITH ORDERED PREFETCH)
       |    |--Sort(ORDER BY:([test].[dbo].[t_p].[g] ASC, [test].[dbo].[t_p].[id] ASC))
       |         |--Table Spool
       |              |--Clustered Index Delete(OBJECT:([test].[dbo].[t_p].[PK__t_p__195694DD]) WITH ORDERED PREFETCH)
       |                   |--Sort(ORDER BY:([test].[dbo].[t_p].[id] ASC))
       |                        |--Merge Join(Inner Join, MERGE:([test].[dbo].[t_g].[id])=([test].[dbo].[t_p].[g]), RESIDUAL:([test].[dbo].[t_p].[g]=[test].[dbo].[t_g].[id]))
       |                             |--Table Spool
       |                             |--Index Scan(OBJECT:([test].[dbo].[t_p].[ix_p_g]), ORDERED FORWARD)
       |--Index Delete(OBJECT:([test].[dbo].[t_c].[ix_c_p]) WITH ORDERED PREFETCH)
            |--Sort(ORDER BY:([test].[dbo].[t_c].[p] ASC, [test].[dbo].[t_c].[id] ASC))
                 |--Clustered Index Delete(OBJECT:([test].[dbo].[t_c].[PK__t_c__1C330188]) WITH ORDERED PREFETCH)
                      |--Table Spool
                           |--Sort(ORDER BY:([test].[dbo].[t_c].[id] ASC))
                                |--Hash Match(Inner Join, HASH:([test].[dbo].[t_p].[id])=([test].[dbo].[t_c].[p]))
                                     |--Table Spool
                                     |--Index Scan(OBJECT:([test].[dbo].[t_c].[ix_c_p]), ORDERED FORWARD)

First, SQL Server deletes records from t_g, then joins the records deleted with t_p and deletes from the latter, finally, joins records deleted from t_p with t_c and deletes from t_c.

首先，SQL Server从t_g删除记录，然后连接用t_p删除的记录并从后者删除，最后，将从t_p删除的记录与t_c连接，并从t_c删除。

A single three-table join would be much more efficient in this case, and this is what you do with your workaround.

在这种情况下，单个三表连接会更有效，这就是您使用解决方法所做的事情。

If it makes you feel better, Oracle does not optimize cascade operations in any way: they are always NESTED LOOPS and God help you if your forgot to create an index on the referencing column.

如果它让你感觉更好，Oracle不会以任何方式优化级联操作：它们总是NESTED LOOPS如果您忘记在引用列上创建索引，上帝会帮助您。

#1