SQL为什么选择COUNT(*), MIN(col), MAX(col)比选择MIN(col)快,MAX(col)

时间:2022-12-31 22:47:28

We're seeing a huge difference between these queries.

我们看到了这些查询之间的巨大差异。

The slow query

缓慢的查询

SELECT MIN(col) AS Firstdate, MAX(col) AS Lastdate 
FROM table WHERE status = 'OK' AND fk = 4193

Table 'table'. Scan count 2, logical reads 2458969, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

表“表”。扫描计数2,逻辑读取2458969,物理读取0,读前读取0,lob逻辑读取0,lob物理读取0,lob读取前读取0。

SQL Server Execution Times: CPU time = 1966 ms, elapsed time = 1955 ms.

SQL Server执行时间:CPU时间= 1966 ms,运行时间= 1955 ms。

The fast query

快速查询

SELECT count(*), MIN(col) AS Firstdate, MAX(col) AS Lastdate 
FROM table WHERE status = 'OK' AND fk = 4193

Table 'table'. Scan count 1, logical reads 5803, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

表“表”。扫描计数1,逻辑读取5803,物理读取0,读前读取0,逻辑读取0,物理读取0,读前读取0。

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 9 ms.

SQL Server执行时间:CPU时间= 0 ms,运行时间= 9 ms。

Question

问题

What is the reason between the huge performance difference between the queries?

查询之间巨大的性能差异的原因是什么?

Update A little update based on questions given as comments:

更新一个小更新基于问题作为评论:

The order of execution or repeated execution changes nothing performance wise. There are no extra parameters used and the (test)database is not doing anything else during execution.

执行顺序或重复执行不会改变性能。没有使用额外的参数,并且(测试)数据库在执行过程中没有做任何其他事情。

Slow query

慢查询

|--Nested Loops(Inner Join)
 |--Stream Aggregate(DEFINE:([Expr1003]=MIN([DBTest].[dbo].[table].[startdate])))
   |    |--Top(TOP EXPRESSION:((1)))
   |         |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1008]) WITH ORDERED PREFETCH)
   |              |--Index Scan(OBJECT:([DBTest].[dbo].[table].[startdate]), ORDERED FORWARD)
   |              |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[FK]=(5806) AND [DBTest].[dbo].[table].[status]<>'A') LOOKUP ORDERED FORWARD)
   |--Stream Aggregate(DEFINE:([Expr1004]=MAX([DBTest].[dbo].[table].[startdate])))
        |--Top(TOP EXPRESSION:((1)))
             |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1009]) WITH ORDERED PREFETCH)
                  |--Index Scan(OBJECT:([DBTest].[dbo].[table].[startdate]), ORDERED BACKWARD)
                  |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[FK]=(5806) AND [DBTest].[dbo].[table].[status]<>'A') LOOKUP ORDERED FORWARD)

Fast query

快速查询

 |--Compute Scalar(DEFINE:([Expr1003]=CONVERT_IMPLICIT(int,[Expr1012],0)))
   |--Stream Aggregate(DEFINE:([Expr1012]=Count(*), [Expr1004]=MIN([DBTest].[dbo].[table].[startdate]), [Expr1005]=MAX([DBTest].[dbo].[table].[startdate])))
        |--Nested Loops(Inner Join, OUTER REFERENCES:([DBTest].[dbo].[table].[id], [Expr1011]) WITH UNORDERED PREFETCH)
             |--Index Seek(OBJECT:([DBTest].[dbo].[table].[FK]), SEEK:([DBTest].[dbo].[table].[FK]=(5806)) ORDERED FORWARD)
             |--Clustered Index Seek(OBJECT:([DBTest].[dbo].[table].[PK_table]), SEEK:([DBTest].[dbo].[table].[id]=[DBTest].[dbo].[table].[id]),  WHERE:([DBTest].[dbo].[table].[status]<'A' OR [DBTest].[dbo].[table].[status]>'A') LOOKUP ORDERED FORWARD)

SQL为什么选择COUNT(*), MIN(col), MAX(col)比选择MIN(col)快,MAX(col)

Answer

回答

The answer given below by Martin Smith seems to explain the problem. The super short version is that the MS-SQL query-analyser wrongly uses a query plan in the slow query which causes a complete table scan.

下面马丁·史密斯给出的答案似乎可以解释这个问题。超短版本是MS-SQL查询分析器在慢查询中错误地使用了查询计划,从而导致完整的表扫描。

Adding a Count(*), the query hint with(FORCESCAN) or a combined index on the startdate,FK and status columns fixes the performance issue.

在startdate、FK和status列上添加Count(*)、带有(FORCESCAN)的查询提示或组合索引,可以修复性能问题。

1 个解决方案

#1


25  

The SQL Server cardinality estimator makes various modelling assumptions such as

SQL Server基数估计器进行各种建模假设,比如

  • Independence: Data distributions on different columns are independent unless correlation information is available.
  • 独立性:不同列上的数据分布是独立的,除非有相关信息。
  • Uniformity: Within each statistics object histogram step, distinct values are evenly spread and each value has the same frequency.
  • 均匀性:在每个统计对象直方图步骤中,不同的值均匀分布,每个值具有相同的频率。

Source

There are 810,064 rows in the table.

表中有810,064行。

You have the query

你有查询

SELECT COUNT(*),
       MIN(startdate) AS Firstdate,
       MAX(startdate) AS Lastdate
FROM   table
WHERE  status <> 'A'
       AND fk = 4193 

1,893 (0.23%) rows meet the fk = 4193 predicate, and of those two fail the status <> 'A' part so overall 1,891 match and need to be aggregated.

1893(0.23%)行满足fk = 4193谓词,在这两行中,状态<> 'A'部分失败,因此总的来说1891行匹配,需要进行聚合。

You also have two indexes neither of which cover the whole query.

您还有两个索引,它们都不能覆盖整个查询。

For your fast query it uses an index on fk to directly find rows where fk = 4193 then needs to do 1,893 key lookups to find each row in the clustered index to check the status predicate and retrieve the startdate for aggregation.

对于快速查询,它使用fk上的一个索引直接查找fk = 4193的行,然后需要进行1893个键查找,以查找聚集索引中的每一行,以检查状态谓词并检索用于聚合的startdate。

When you remove the COUNT(*) from the SELECT list SQL Server no longer has to process every qualifying row. As a result it considers another option.

当您从SELECT list SQL Server中删除COUNT(*)时,不再需要处理每个符合条件的行。因此,它考虑了另一种选择。

You have an index on startdate so it could start scanning that from the beginning, doing key lookups back to the base table and as soon as it finds the first matching row stop as it has found the MIN(startdate), Similarly the MAX can be found with another scan starting the other end of the index and working backwards.

startdate可以上你有一个索引可能会开始扫描,从一开始,做关键查找回基表,一旦发现第一个匹配的行停止,因为它发现了敏(startdate可以),同样可以找到最大值与另一个扫描的另一端开始索引和工作向后。

SQL Server estimates that each of these scans will end up processing 590 rows before they hit upon one that matches the predicate. Giving 1,180 total lookups vs 1,893 so it chooses this plan.

SQL Server估计,每一次扫描都会处理590行,然后才会遇到与谓词匹配的行。给出1,180个总查找值和1893个,所以它选择了这个计划。

The 590 figure is just table_size / estimated_number_of_rows_that_match. i.e. the cardinality estimator assumes that the matching rows will be evenly distributed throughout the table.

590图只是table_size / estimated_number_of_rows_that_match。例如,基数估计器假定匹配的行将均匀地分布在整个表中。

Unfortunately the 1,891 rows that meet the predicate are not randomly distributed with respect to startdate. In fact they are all condensed into a single 8,205 row segment towards the end of the index meaning that the scan to get to the MIN(startdate) ends up doing 801,859 key lookups before it can stop.

不幸的是,满足谓词的1,891行并不是针对startdate随机分布的。实际上,它们都被压缩成索引末尾处的一个单独的8,205行段,这意味着在到达最小值(startdate)的扫描结束之前,它将进行801,859个键查找。

This can be reproduced below.

这可以在下面复制。

CREATE TABLE T
(
id int identity(1,1) primary key,
startdate datetime,
fk int,
[status] char(1),
Filler char(2000)
)

CREATE NONCLUSTERED INDEX ix ON T(startdate)

INSERT INTO T
SELECT TOP 810064 Getdate() - 1,
                  4192,
                  'B',
                  ''
FROM   sys.all_columns c1,
       sys.all_columns c2  


UPDATE T 
SET fk = 4193, startdate = GETDATE()
WHERE id BETWEEN 801859 and 803748 or id = 810064

UPDATE T 
SET  startdate = GETDATE() + 1
WHERE id > 810064


/*Both queries give the same plan. 
UPDATE STATISTICS T WITH FULLSCAN
makes no difference*/

SELECT MIN(startdate) AS Firstdate, 
       MAX(startdate) AS Lastdate 
FROM T
WHERE status <> 'A' AND fk = 4192


SELECT MIN(startdate) AS Firstdate, 
       MAX(startdate) AS Lastdate 
FROM T
WHERE status <> 'A' AND fk = 4193

You could consider using query hints to force the plan to use the index on fk rather than startdate or add the suggested missing index highlighted in the execution plan on (fk,status) INCLUDE (startdate) to avoid this issue.

您可以考虑使用查询提示来强制计划在fk上使用索引,而不是startdate,或者添加执行计划中突出显示的建议缺失索引(fk,status) INCLUDE (startdate),以避免这个问题。

#1


25  

The SQL Server cardinality estimator makes various modelling assumptions such as

SQL Server基数估计器进行各种建模假设,比如

  • Independence: Data distributions on different columns are independent unless correlation information is available.
  • 独立性:不同列上的数据分布是独立的,除非有相关信息。
  • Uniformity: Within each statistics object histogram step, distinct values are evenly spread and each value has the same frequency.
  • 均匀性:在每个统计对象直方图步骤中,不同的值均匀分布,每个值具有相同的频率。

Source

There are 810,064 rows in the table.

表中有810,064行。

You have the query

你有查询

SELECT COUNT(*),
       MIN(startdate) AS Firstdate,
       MAX(startdate) AS Lastdate
FROM   table
WHERE  status <> 'A'
       AND fk = 4193 

1,893 (0.23%) rows meet the fk = 4193 predicate, and of those two fail the status <> 'A' part so overall 1,891 match and need to be aggregated.

1893(0.23%)行满足fk = 4193谓词,在这两行中,状态<> 'A'部分失败,因此总的来说1891行匹配,需要进行聚合。

You also have two indexes neither of which cover the whole query.

您还有两个索引,它们都不能覆盖整个查询。

For your fast query it uses an index on fk to directly find rows where fk = 4193 then needs to do 1,893 key lookups to find each row in the clustered index to check the status predicate and retrieve the startdate for aggregation.

对于快速查询,它使用fk上的一个索引直接查找fk = 4193的行,然后需要进行1893个键查找,以查找聚集索引中的每一行,以检查状态谓词并检索用于聚合的startdate。

When you remove the COUNT(*) from the SELECT list SQL Server no longer has to process every qualifying row. As a result it considers another option.

当您从SELECT list SQL Server中删除COUNT(*)时,不再需要处理每个符合条件的行。因此,它考虑了另一种选择。

You have an index on startdate so it could start scanning that from the beginning, doing key lookups back to the base table and as soon as it finds the first matching row stop as it has found the MIN(startdate), Similarly the MAX can be found with another scan starting the other end of the index and working backwards.

startdate可以上你有一个索引可能会开始扫描,从一开始,做关键查找回基表,一旦发现第一个匹配的行停止,因为它发现了敏(startdate可以),同样可以找到最大值与另一个扫描的另一端开始索引和工作向后。

SQL Server estimates that each of these scans will end up processing 590 rows before they hit upon one that matches the predicate. Giving 1,180 total lookups vs 1,893 so it chooses this plan.

SQL Server估计,每一次扫描都会处理590行,然后才会遇到与谓词匹配的行。给出1,180个总查找值和1893个,所以它选择了这个计划。

The 590 figure is just table_size / estimated_number_of_rows_that_match. i.e. the cardinality estimator assumes that the matching rows will be evenly distributed throughout the table.

590图只是table_size / estimated_number_of_rows_that_match。例如,基数估计器假定匹配的行将均匀地分布在整个表中。

Unfortunately the 1,891 rows that meet the predicate are not randomly distributed with respect to startdate. In fact they are all condensed into a single 8,205 row segment towards the end of the index meaning that the scan to get to the MIN(startdate) ends up doing 801,859 key lookups before it can stop.

不幸的是,满足谓词的1,891行并不是针对startdate随机分布的。实际上,它们都被压缩成索引末尾处的一个单独的8,205行段,这意味着在到达最小值(startdate)的扫描结束之前,它将进行801,859个键查找。

This can be reproduced below.

这可以在下面复制。

CREATE TABLE T
(
id int identity(1,1) primary key,
startdate datetime,
fk int,
[status] char(1),
Filler char(2000)
)

CREATE NONCLUSTERED INDEX ix ON T(startdate)

INSERT INTO T
SELECT TOP 810064 Getdate() - 1,
                  4192,
                  'B',
                  ''
FROM   sys.all_columns c1,
       sys.all_columns c2  


UPDATE T 
SET fk = 4193, startdate = GETDATE()
WHERE id BETWEEN 801859 and 803748 or id = 810064

UPDATE T 
SET  startdate = GETDATE() + 1
WHERE id > 810064


/*Both queries give the same plan. 
UPDATE STATISTICS T WITH FULLSCAN
makes no difference*/

SELECT MIN(startdate) AS Firstdate, 
       MAX(startdate) AS Lastdate 
FROM T
WHERE status <> 'A' AND fk = 4192


SELECT MIN(startdate) AS Firstdate, 
       MAX(startdate) AS Lastdate 
FROM T
WHERE status <> 'A' AND fk = 4193

You could consider using query hints to force the plan to use the index on fk rather than startdate or add the suggested missing index highlighted in the execution plan on (fk,status) INCLUDE (startdate) to avoid this issue.

您可以考虑使用查询提示来强制计划在fk上使用索引,而不是startdate,或者添加执行计划中突出显示的建议缺失索引(fk,status) INCLUDE (startdate),以避免这个问题。