带有where子句和group by的SQL max（）函数不能有效地使用索引

I have a table MYTABLE that has approximately 25 columns, with two of them being USERID (integer) and USERDATETIME (dateTime).

我有一个MYTABLE表,大约有25列,其中两列是USERID(整数)和USERDATETIME(dateTime)。

I have an index over this table on these two columns, with USERID being the first column followed by USERDATETIME.

我在这两列上有一个关于此表的索引,USERID是第一列,后跟USERDATETIME。

I would like to get the maximum USERDATETIME for each USERID. So:

我想获得每个USERID的最大USERDATETIME。所以:

select USERID,MAX(USERDATETIME) 
from MYTABLE WHERE USERDATETIME < '2015-10-11'
GROUP BY USERID

I would have expected the optimizer to be able to find each unique USERID and maximum USERDATETIME with the number of seeks equal to the number of unique USERIDs. And I would expect this to be reasonable fast. I have 2000 userids and 6 million rows in myTable. However, the actual plan shows 6 million rows from an index scan. If I use an index with USERDATETIME/USERID, the plan changes to use an index seek, but still 6 million rows.

我希望优化器能够找到每个唯一的USERID和最大的USERDATETIME,其中搜索次数等于唯一的USERID数。而且我希望这是合理的。我在myTable中有2000个用户ID和600万行。但是,实际计划显示索引扫描中有600万行。如果我使用带有USERDATETIME / USERID的索引,则计划将更改为使用索引搜索,但仍然是600万行。

Why does SQL not use the index in a way that would reduce the number of rows processed?

为什么SQL不会以减少处理行数的方式使用索引?

2 个解决方案

#1

If you are using SQL Server this is not an optimisation generally carried out by the product (except in limited cases where the table is partitioned by that value).

如果您使用的是SQL Server,则这不是通常由产品执行的优化(除非在表中使用该值进行分区的情况除外)。

However you can do it manually using the technique from here

但是,您可以使用此处的技术手动完成

CREATE TABLE YourTable
  (
     USERID       INT,
     USERDATETIME DATETIME,
     OtherColumns CHAR(10)
  )

CREATE CLUSTERED INDEX IX
  ON YourTable(USERID ASC, USERDATETIME ASC);

WITH R
     AS (SELECT TOP 1 USERID,
                      USERDATETIME
         FROM   YourTable
         ORDER  BY USERID DESC,
                   USERDATETIME DESC
         UNION ALL
         SELECT SubQuery.USERID,
                SubQuery.USERDATETIME
         FROM   (SELECT T.USERID,
                        T.USERDATETIME,
                        rn = ROW_NUMBER()
                               OVER (
                                 ORDER BY T.USERID DESC, T.USERDATETIME DESC)
                 FROM   R
                        JOIN YourTable T
                          ON T.USERID < R.USERID) AS SubQuery
         WHERE  SubQuery.rn = 1)
SELECT *
FROM   R

If you have another table with the UserIds it is possible to get an efficient plan more easily with

如果您有另一个包含UserIds的表,则可以更轻松地获得有效的计划

SELECT U.USERID,
       CA.USERDATETIME
FROM   Users U
       CROSS APPLY (SELECT TOP 1 USERDATETIME
                    FROM   YourTable Y
                    WHERE  Y.USERID = U.USERID
                    ORDER  BY USERDATETIME DESC) CA

#2

The WHERE clause is the limiting factor on your query using the index.

WHERE子句是使用索引查询的限制因素。

With a standard SQL Server query, indexes are used either to select records quickly (which that index would allow), and to limit records returned (which that index would not allow). So, why wont this index allow fir quick limitation?

使用标准SQL Server查询,索引用于快速选择记录(该索引允许哪些记录),以及限制返回的记录(该索引不允许)。那么,为什么这个指数不允许快速限制呢?

When the query optimizer considers optimizations based on a WHERE clause, it looks for an index that either starts with the item(s) in the WHERE clause, or one that can be used to efficiently identify the records that are allowed (or not allowed) to be in the result set.

当查询优化器根据WHERE子句考虑优化时,它会查找以WHERE子句中的项开头的索引,或者可以用于有效识别允许(或不允许)的记录的索引。在结果集中。

With this index, the server first can find the distinct userIDs involved. It then would want to limit the rows considered based on the WHERE clause. However, to do this, the optimizer will likely estimate that it will have to conduct the equivalent of a full index or table scan AFTER locating the userIDs.

使用此索引,服务器首先可以找到所涉及的不同用户ID。然后,它希望限制基于WHERE子句考虑的行。但是,要执行此操作,优化程序可能会估计在找到userID之后必须执行相当于完整索引或表扫描的操作。

An alternate strategy that might be possible is to scan the index, identifying userIDs and dates together. This is what the optimizer chose.

可能的替代策略是扫描索引,一起识别用户ID和日期。这是优化器选择的。

One possible solution to that is a different index - one by date, then userID - in addition to the one being used. This would limit the number of records being scanned to identify userID maximums, and thus be a bit faster.

一种可能的解决方案是使用不同的索引 - 一个按日期,然后是userID - 除了正在使用的索引之外。这将限制扫描的记录数以识别用户ID最大值,因此更快一些。

Note that your index would be fast if you did not need the WHERE clause. But the where clause requires the optimizer to consider the use case where the WHERE clause limits the items selected to the last row considered.

请注意,如果您不需要WHERE子句,则索引会很快。但是where子句要求优化器考虑WHERE子句将所选项限制为所考虑的最后一行的用例。

Also, an index where the Date field was DESCENDING order might be more efficient as well.

此外,Date字段为DESCENDING顺序的索引也可能更有效。

#1