I'm curious which of the following below would be more efficient?
I've always been a bit cautious about using IN
because I believe SQL Server turns the result set into a big IF
statement. For a large result set this could result in poor performance. For small results sets, I'm not sure either is preferable. For large result sets, wouldn't EXISTS
be more efficient?
我很好奇下面哪个会更有效?我一直对使用这个方法有点谨慎,因为我相信SQL Server将结果集变成了一个很大的IF语句。对于较大的结果集,这可能导致性能低下。对于小的结果集,我不确定两者哪个更可取。对于大型结果集,是否存在更有效的方法?
WHERE EXISTS (SELECT * FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
vs.
vs。
WHERE bx.BoxID IN (SELECT BoxID FROM Base WHERE [Rank = 2])
8 个解决方案
#1
120
EXISTS
will be faster because once the engine has found a hit, it will quit looking as the condition has proved true.
With IN
it will collect all the results from the sub-query before further processing.
存在将会更快,因为一旦引擎找到一个命中,它将停止寻找,因为条件已经被证明是正确的。在进一步处理之前,With将收集子查询的所有结果。
#2
33
I've done some testing on SQL Server 2005 and 2008, and on both the EXISTS and the IN come back with the exact same actual execution plan, as other have stated. The Optimizer is optimal. :)
我已经在SQL Server 2005和2008上做了一些测试,并且在存在的情况下和在返回时都使用了完全相同的实际执行计划,就像其他人说的那样。优化器是最优的。:)
Something to be aware of though, EXISTS, IN, and JOIN can sometimes return different results if you don't phrase your query just right: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
如果您不正确地表达您的查询,那么需要注意的是,存在的、存在的和连接有时会返回不同的结果:http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
#3
31
The accepted answer is shortsighted and the question a bit loose in that:
公认的答案是目光短浅的,这个问题有点松散:
1) Neither explicitly mention whether a covering index is present in the left, right, or both sides.
1)既不明确地提到覆盖指数是否存在于左边、右边或两边。
2) Neither takes into account the size of input left side set and input right side set.
(The question just mentions an overall large result set).2)都没有考虑到输入左侧集和输入右侧集的大小(这个问题只是提到了一个整体较大的结果集)。
I believe the optimizer is smart enough to convert between "in" vs "exists" when there is a significant cost difference due to (1) and (2), otherwise it may just be used as a hint (e.g. exists to encourage use of an a seekable index on the right side).
我认为当(1)和(2)的成本差异较大时,优化器足够聪明地在“in”和“exist”之间进行转换,否则它可能只是作为一个提示(例如,存在是为了鼓励在右侧使用可查询索引)。
Both forms can be converted to join forms internally, have the join order reversed, and run as loop, hash or merge--based on the estimated row counts (left and right) and index existence in left, right, or both sides.
这两个表单都可以在内部转换为join表单,使连接顺序颠倒,并以循环、哈希或合并的形式运行——基于估计的行数(左和右)和左、右或两边的索引存在。
#4
3
I'd go with EXISTS over IN, see below link:
我将与存在在一起,见下面的链接:
SQL Server: JOIN vs IN vs EXISTS - the logical difference
SQL Server:在vs中存在连接vs -逻辑差异
#5
3
The execution plans are typically going to be identical in these cases, but until you see how the optimizer factors in all the other aspects of indexes etc., you really will never know.
在这些情况下,执行计划通常是相同的,但是在您看到优化器如何在索引的所有其他方面发挥作用之前,您将永远不会知道。
#6
2
So, IN is not the same as EXISTS nor it will produce the same execution plan.
因此,在不相同的情况下,它将产生相同的执行计划。
Usually EXISTS is used in a correlated subquery, that means you will JOIN the EXISTS inner query with your outer query. That will add more steps to produce a result as you need to solve the outer query joins and the inner query joins then match their where clauses to join both.
关联子查询中通常使用exist,这意味着您将把exist内部查询与外部查询联接起来。这将添加更多的步骤来生成结果,因为您需要解决外部查询连接和内部查询连接,然后匹配它们的where子句来连接这两个。
Usually IN is used without correlating the inner query with the outer query, and that can be solved in only one step (in the best case scenario).
通常使用IN时不需要将内部查询与外部查询关联,这可以在一个步骤中解决(在最好的情况下)。
Consider this:
考虑一下:
-
If you use IN and the inner query result is millions of rows of distinct values, it will probably perform SLOWER than EXISTS given that the EXISTS query is performant (has the right indexes to join with the outer query).
如果您在内部查询结果中使用了数百万行不同的值,那么它的执行速度可能会低于现有查询的性能(有正确的索引与外部查询连接)。
-
If you use EXISTS and the join with your outer query is complex (takes more time to perform, no suitable indexes) it will slow the query by the number of rows in the outer table, sometimes the estimated time to complete can be in days. If the number of rows is acceptable for your given hardware, or the cardinality of data is correct (for example fewer DISTINCT values in a large data set) IN can perform faster than EXISTS.
如果您使用exist,并且与外部查询的连接很复杂(需要更多的时间执行,没有合适的索引),它将根据外部表中的行数来降低查询速度,有时估计完成的时间可能以天为单位。如果给定硬件的行数是可接受的,或者数据基数是正确的(例如,大数据集中的不同值更少),那么in的执行速度就会比存在的快。
-
All of the above will be noted when you have a fair amount of rows on each table (by fair I mean something that exceeds your CPU processing and/or ram thresholds for caching).
当您在每个表上都有相当数量的行时,就会注意到上面的所有内容(公平地说,我的意思是某些内容超出了您的CPU处理和/或缓存的ram阈值)。
So the ANSWER is it DEPENDS. You can write a complex query inside IN or EXISTS, but as a rule of thumb, you should try to use IN with a limited set of distinct values and EXISTS when you have a lot of rows with a lot of distinct values.
所以答案是,视情况而定。您可以在IN中或exist中编写复杂的查询,但是根据经验,您应该尝试在有限的一组不同的值中使用IN,当您有许多具有不同值的行时,您应该使用IN。
The trick is to limit the number of rows to be scanned.
诀窍是限制要扫描的行数。
Regards,
问候,
MarianoC
MarianoC
#7
1
To optimize the EXISTS
, be very literal; something just has to be there, but you don't actually need any data returned from the correlated sub-query. You're just evaluating a Boolean condition.
为了优化存在,要非常字面化;有些东西必须存在,但实际上不需要从相关子查询返回任何数据。你只是在计算布尔条件。
So:
所以:
WHERE EXISTS (SELECT TOP 1 1 FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
存在的位置(从bx处的基数中选择TOP 1。BoxID =基地。BoxID和[Rank] = 2)
Because the correlated sub-query is RBAR
, the first result hit makes the condition true, and it is processed no further.
由于相关子查询是RBAR,第一个结果命中使条件为真,因此不再进一步处理。
#8
-1
Off the top of my head and not guaranteed to be correct: I believe the second will be faster in this case.
我认为第二种方法在这种情况下会更快。
- In the first, the correlated subquery will likely cause the subquery to be run for each row.
- 在第一个中,关联子查询可能会导致对每一行运行子查询。
- In the second example, the subquery should only run once, since not correlated.
- 在第二个示例中,由于不相关,子查询应该只运行一次。
- In the second example, the
IN
will short-circuit as soon as it finds a match. - 在第二个示例中,In将在找到匹配后立即短路。
#1
120
EXISTS
will be faster because once the engine has found a hit, it will quit looking as the condition has proved true.
With IN
it will collect all the results from the sub-query before further processing.
存在将会更快,因为一旦引擎找到一个命中,它将停止寻找,因为条件已经被证明是正确的。在进一步处理之前,With将收集子查询的所有结果。
#2
33
I've done some testing on SQL Server 2005 and 2008, and on both the EXISTS and the IN come back with the exact same actual execution plan, as other have stated. The Optimizer is optimal. :)
我已经在SQL Server 2005和2008上做了一些测试,并且在存在的情况下和在返回时都使用了完全相同的实际执行计划,就像其他人说的那样。优化器是最优的。:)
Something to be aware of though, EXISTS, IN, and JOIN can sometimes return different results if you don't phrase your query just right: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
如果您不正确地表达您的查询,那么需要注意的是,存在的、存在的和连接有时会返回不同的结果:http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
#3
31
The accepted answer is shortsighted and the question a bit loose in that:
公认的答案是目光短浅的,这个问题有点松散:
1) Neither explicitly mention whether a covering index is present in the left, right, or both sides.
1)既不明确地提到覆盖指数是否存在于左边、右边或两边。
2) Neither takes into account the size of input left side set and input right side set.
(The question just mentions an overall large result set).2)都没有考虑到输入左侧集和输入右侧集的大小(这个问题只是提到了一个整体较大的结果集)。
I believe the optimizer is smart enough to convert between "in" vs "exists" when there is a significant cost difference due to (1) and (2), otherwise it may just be used as a hint (e.g. exists to encourage use of an a seekable index on the right side).
我认为当(1)和(2)的成本差异较大时,优化器足够聪明地在“in”和“exist”之间进行转换,否则它可能只是作为一个提示(例如,存在是为了鼓励在右侧使用可查询索引)。
Both forms can be converted to join forms internally, have the join order reversed, and run as loop, hash or merge--based on the estimated row counts (left and right) and index existence in left, right, or both sides.
这两个表单都可以在内部转换为join表单,使连接顺序颠倒,并以循环、哈希或合并的形式运行——基于估计的行数(左和右)和左、右或两边的索引存在。
#4
3
I'd go with EXISTS over IN, see below link:
我将与存在在一起,见下面的链接:
SQL Server: JOIN vs IN vs EXISTS - the logical difference
SQL Server:在vs中存在连接vs -逻辑差异
#5
3
The execution plans are typically going to be identical in these cases, but until you see how the optimizer factors in all the other aspects of indexes etc., you really will never know.
在这些情况下,执行计划通常是相同的,但是在您看到优化器如何在索引的所有其他方面发挥作用之前,您将永远不会知道。
#6
2
So, IN is not the same as EXISTS nor it will produce the same execution plan.
因此,在不相同的情况下,它将产生相同的执行计划。
Usually EXISTS is used in a correlated subquery, that means you will JOIN the EXISTS inner query with your outer query. That will add more steps to produce a result as you need to solve the outer query joins and the inner query joins then match their where clauses to join both.
关联子查询中通常使用exist,这意味着您将把exist内部查询与外部查询联接起来。这将添加更多的步骤来生成结果,因为您需要解决外部查询连接和内部查询连接,然后匹配它们的where子句来连接这两个。
Usually IN is used without correlating the inner query with the outer query, and that can be solved in only one step (in the best case scenario).
通常使用IN时不需要将内部查询与外部查询关联,这可以在一个步骤中解决(在最好的情况下)。
Consider this:
考虑一下:
-
If you use IN and the inner query result is millions of rows of distinct values, it will probably perform SLOWER than EXISTS given that the EXISTS query is performant (has the right indexes to join with the outer query).
如果您在内部查询结果中使用了数百万行不同的值,那么它的执行速度可能会低于现有查询的性能(有正确的索引与外部查询连接)。
-
If you use EXISTS and the join with your outer query is complex (takes more time to perform, no suitable indexes) it will slow the query by the number of rows in the outer table, sometimes the estimated time to complete can be in days. If the number of rows is acceptable for your given hardware, or the cardinality of data is correct (for example fewer DISTINCT values in a large data set) IN can perform faster than EXISTS.
如果您使用exist,并且与外部查询的连接很复杂(需要更多的时间执行,没有合适的索引),它将根据外部表中的行数来降低查询速度,有时估计完成的时间可能以天为单位。如果给定硬件的行数是可接受的,或者数据基数是正确的(例如,大数据集中的不同值更少),那么in的执行速度就会比存在的快。
-
All of the above will be noted when you have a fair amount of rows on each table (by fair I mean something that exceeds your CPU processing and/or ram thresholds for caching).
当您在每个表上都有相当数量的行时,就会注意到上面的所有内容(公平地说,我的意思是某些内容超出了您的CPU处理和/或缓存的ram阈值)。
So the ANSWER is it DEPENDS. You can write a complex query inside IN or EXISTS, but as a rule of thumb, you should try to use IN with a limited set of distinct values and EXISTS when you have a lot of rows with a lot of distinct values.
所以答案是,视情况而定。您可以在IN中或exist中编写复杂的查询,但是根据经验,您应该尝试在有限的一组不同的值中使用IN,当您有许多具有不同值的行时,您应该使用IN。
The trick is to limit the number of rows to be scanned.
诀窍是限制要扫描的行数。
Regards,
问候,
MarianoC
MarianoC
#7
1
To optimize the EXISTS
, be very literal; something just has to be there, but you don't actually need any data returned from the correlated sub-query. You're just evaluating a Boolean condition.
为了优化存在,要非常字面化;有些东西必须存在,但实际上不需要从相关子查询返回任何数据。你只是在计算布尔条件。
So:
所以:
WHERE EXISTS (SELECT TOP 1 1 FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
存在的位置(从bx处的基数中选择TOP 1。BoxID =基地。BoxID和[Rank] = 2)
Because the correlated sub-query is RBAR
, the first result hit makes the condition true, and it is processed no further.
由于相关子查询是RBAR,第一个结果命中使条件为真,因此不再进一步处理。
#8
-1
Off the top of my head and not guaranteed to be correct: I believe the second will be faster in this case.
我认为第二种方法在这种情况下会更快。
- In the first, the correlated subquery will likely cause the subquery to be run for each row.
- 在第一个中,关联子查询可能会导致对每一行运行子查询。
- In the second example, the subquery should only run once, since not correlated.
- 在第二个示例中,由于不相关,子查询应该只运行一次。
- In the second example, the
IN
will short-circuit as soon as it finds a match. - 在第二个示例中,In将在找到匹配后立即短路。