为什么这个索引不能提高查询性能

时间:2021-12-21 22:50:21

Platform: SQL Server 2012

平台:SQL Server 2012

Background: I have two fairly large log tables - around 600k records each that are being joined using Pk/Fk. For the sake of argument, lets call them ReallyBigLog1 and ReallyBigLog2. The query (below) takes about 3.5-sec to run. The WHERE clause includes three different values. When asked to help improve this query, I immediately noticed that the items in the WHERE clause were not indexed. I smugly suggested adding indexes - assuming the increased performance would make me look like a hero. However, the additional index had no measurable affect.

背景:我有两个相当大的日志表——大约有600k条记录,每个记录都使用Pk/Fk连接。为了讨论起见,我们把它们叫做ReallyBigLog1和ReallyBigLog2。查询(如下)运行大约需要3.5秒。WHERE子句包含三个不同的值。当被要求帮助改进这个查询时,我立即注意到WHERE子句中的项没有被索引。我沾沾自喜地建议增加索引——假设增加的性能会让我看起来像个英雄。然而,附加指数没有可衡量的影响。

Question: Given the query below, why does indexing StartTime, EndTime, and DateStamp have no measurable affect on query time?

问:给定下面的查询,为什么索引开始时间、结束时间和DateStamp对查询时间没有可测量的影响?

Query

查询

SELECT 

    IrreleventField1,
    IrreleventField2,
    IrreleventField3....

    FROM  [dbo].[ReallyBigLog1] AS [T1]

    INNER JOIN [dbo].[ReallyBigLog2] AS [T2] ON [T1].[Id] = [T2].[Id]

    WHERE ([T1].[EndTime] IS NOT NULL) AND ([T1].[StartTime] IS NOT NULL) AND ([T2].[DateStamp] >= '2017-5-16 00:00:00')

Indexes

索引

CREATE NONCLUSTERED INDEX [ix_RecommendedIndex]
ON [dbo].[ReallyBigLog1]
([StartTime] , [EndTime])

CREATE NONCLUSTERED INDEX [IX_DateStamp]
ON [dbo].[ReallyBigLog2]
([DateStamp])

Execution Plan

执行计划

5 SELECT            
    4 Compute Scalar        
        3 Merge Join  / Inner Join Merge:([dbo].[ReallyBigLog1].[Id] [T2]=[dbo].[ReallyBigLog1].[Id] [T1]), Residual:([dbo].[ReallyBigLog2].[Id] as [T2].[Id]=[dbo].[ReallyBigLog1].[Id] as [T1].[Id])  
            1 Clustered Index Scan Predicate:([dbo].[ReallyBigLog1].[StartTime] as [T1].[StartTime] IS NOT NULL AND [dbo].[ReallyBigLog1].[EndTime] as [T1].[EndTime] IS NOT NULL), ORDERED FORWARD [dbo].[ReallyBigLog1].[PK_dbo.ReallyBigLog1] [T1]
            2 Clustered Index Scan Predicate:([dbo].[ReallyBigLog2].[DateStamp] as [T2].[DateStamp]>='2017-05-16 00:00:00.000'), ORDERED FORWARD [dbo].[ReallyBigLog2].[PK_dbo.ReallyBigLog2] [T2]

EDIT (Tables Composition)

编辑(表结构)

SELECT
  (SELECT COUNT(*) FROM ReallyBigLog1 WHERE StartTime IS NULL) as NullStartTime,
  (SELECT COUNT(*) FROM ReallyBigLog1 WHERE EndTime IS NULL) as NullEndTime,
  (SELECT COUNT(*) FROM ReallyBigLog1) as Log1Count,
  (SELECT COUNT(*) FROM ReallyBigLog2 WHERE DateStamp > '2017-5-16 00:00:00') AS DateStampUsage,
  (SELECT COUNT(*) FROM ReallyBigLog2) AS Log2Count

DateStampUsage  Log2Count   NullStartTime   NullEndTime  Log1Count
443038          651929      33748           34144        509545

3 个解决方案

#1


1  

Since you're fetching most of the rows in the tables, the indexes have to be covering (=contain every column you need in your query from that table) to help you at all -- and that improvement might not be much.

由于您正在获取表中的大多数行,因此索引必须覆盖(=包含查询中来自该表的每一列),以帮助您完成所有工作——这种改进可能不会有多大效果。

The reason the indexes don't really help is that you're reading most of the rows, and you have IrreleventFields in your query. Since the index contains only the index key + clustered key, the rest of the fields must be fetched from the table (=clustered index) using the clustered index key. That's called key lookup and can be very costly, because it has to be done for every single row found from the index that matches your search criteria.

索引没有真正帮助的原因是您正在读取大多数行,并且在查询中有不相关的字段。由于索引只包含索引键+集群键,所以必须使用聚集索引键从表(=聚集索引)中获取其余字段。这被称为键查找,而且代价非常高,因为必须对从匹配搜索条件的索引中找到的每一行执行此操作。

For the index being covered, you can add the "irrelevant" fields into include part of the index, if you want to try if it improves the situation.

对于被覆盖的索引,可以将“无关”字段添加到包含索引的部分,如果您想尝试它是否改善了情况。

#2


2  

ix_RecommendedIndex will be of very poor help, unless you have a lots of nulls.

ix_推荐者索引将非常没有帮助,除非您有大量的null。

Here, the indexes which really matters are Ids and IX_DateStamp. Since you seem to have a lots of matching data in the WHERE clause, the optimiser prefers a clustered table scan (to merge the Ids).

这里,真正重要的索引是id和IX_DateStamp。由于WHERE子句中似乎有很多匹配的数据,所以optimiser更喜欢集群表扫描(以合并id)。

One possibility to make it faster would be a CLUSTERED index on IX_DateStamp, but it will have performance side effects for other queries, and should be stressed on a test environment first.

加快速度的一种可能是使用IX_DateStamp的集群索引,但是它会对其他查询产生性能副作用,并且应该首先在测试环境中加以强调。

If you can provide the EXPLAIN with statistics, it may help for a better diagnostic.

如果你能提供统计的解释,它可能有助于更好的诊断。

edit: With the statistics provided, I don't see how you can make it faster just with indexes. There are way too many data to parse (more than half of the two tables). You are hitting the point where you may need to consolidate your data appart, in another table, or optimize the data at the binary level (smaller record size for faster scans).

编辑:有了提供的统计数据,我不知道如何使用索引使它更快。有太多的数据要解析(两个表的一半以上)。您可能需要在另一个表中合并数据分配,或者在二进制级别上优化数据(更小的记录大小以实现更快的扫描)。

#3


1  

Having an index on the date and time alone is not going to help as much. You should have an index that covers conditions to your joins as well.. Such as the ID columns. Since your query is primarily quantifying on the time-stamp of the T2 alias, I would offer the following indexes

仅仅有一个日期和时间的索引并不能起到多大的作用。您还应该有一个包含连接条件的索引。例如ID列。由于您的查询主要是对T2别名的时间戳进行量化,所以我将提供以下索引

table           index
ReallyBigLog2   (DateStamp, ID )
ReallyBigLog1   (id, endTime, StartTime )

And here is why. You are specifically looking for transactions in T2 > a given date. So the really big log 2 STARTS with that as the basis. Then ALSO include the "ID" column for the JOIN basis to log table 1. Both parts of the index here are covered and do not require going to the data pages for comparison to get the fields yet.

这是为什么。您正在寻找特定日期下T2 >中的事务。log 2以它为基底。然后还将“ID”列包含在连接基础上,以记录表1。这里介绍了索引的两个部分,并且还不需要到数据页进行比较以获得字段。

Now, the columns index for T1. Start with the ID as an immediate found or not to the T2 table. Having the endTime, StartTime as part of the index, again, it does not have to go to the raw data pages to qualify the WHERE / JOIN criteria.

现在,T1的列指数。从ID开始,作为立即找到的或不存在于T2表。使用endTime、StartTime作为索引的一部分,它也不需要访问原始数据页面来限定WHERE / JOIN标准。

Once that is all done, it has the set of records, goes to the data pages for those and pulls the rest of the details you need.

完成这些之后,它就有了一组记录,进入数据页面获取这些记录,并提取所需的其他细节。

from
   [dbo].[ReallyBigLog2] AS [T2]
      JOIN [dbo].[ReallyBigLog1] AS [T1]
         ON [T1].[Id] = [T2].[Id]
         AND ([T1].[EndTime] IS NOT NULL) 
         AND ([T1].[StartTime] IS NOT NULL) 
where
   [T2].[DateStamp] >= '2017-5-16 00:00:00'

#1


1  

Since you're fetching most of the rows in the tables, the indexes have to be covering (=contain every column you need in your query from that table) to help you at all -- and that improvement might not be much.

由于您正在获取表中的大多数行,因此索引必须覆盖(=包含查询中来自该表的每一列),以帮助您完成所有工作——这种改进可能不会有多大效果。

The reason the indexes don't really help is that you're reading most of the rows, and you have IrreleventFields in your query. Since the index contains only the index key + clustered key, the rest of the fields must be fetched from the table (=clustered index) using the clustered index key. That's called key lookup and can be very costly, because it has to be done for every single row found from the index that matches your search criteria.

索引没有真正帮助的原因是您正在读取大多数行,并且在查询中有不相关的字段。由于索引只包含索引键+集群键,所以必须使用聚集索引键从表(=聚集索引)中获取其余字段。这被称为键查找,而且代价非常高,因为必须对从匹配搜索条件的索引中找到的每一行执行此操作。

For the index being covered, you can add the "irrelevant" fields into include part of the index, if you want to try if it improves the situation.

对于被覆盖的索引,可以将“无关”字段添加到包含索引的部分,如果您想尝试它是否改善了情况。

#2


2  

ix_RecommendedIndex will be of very poor help, unless you have a lots of nulls.

ix_推荐者索引将非常没有帮助,除非您有大量的null。

Here, the indexes which really matters are Ids and IX_DateStamp. Since you seem to have a lots of matching data in the WHERE clause, the optimiser prefers a clustered table scan (to merge the Ids).

这里,真正重要的索引是id和IX_DateStamp。由于WHERE子句中似乎有很多匹配的数据,所以optimiser更喜欢集群表扫描(以合并id)。

One possibility to make it faster would be a CLUSTERED index on IX_DateStamp, but it will have performance side effects for other queries, and should be stressed on a test environment first.

加快速度的一种可能是使用IX_DateStamp的集群索引,但是它会对其他查询产生性能副作用,并且应该首先在测试环境中加以强调。

If you can provide the EXPLAIN with statistics, it may help for a better diagnostic.

如果你能提供统计的解释,它可能有助于更好的诊断。

edit: With the statistics provided, I don't see how you can make it faster just with indexes. There are way too many data to parse (more than half of the two tables). You are hitting the point where you may need to consolidate your data appart, in another table, or optimize the data at the binary level (smaller record size for faster scans).

编辑:有了提供的统计数据,我不知道如何使用索引使它更快。有太多的数据要解析(两个表的一半以上)。您可能需要在另一个表中合并数据分配,或者在二进制级别上优化数据(更小的记录大小以实现更快的扫描)。

#3


1  

Having an index on the date and time alone is not going to help as much. You should have an index that covers conditions to your joins as well.. Such as the ID columns. Since your query is primarily quantifying on the time-stamp of the T2 alias, I would offer the following indexes

仅仅有一个日期和时间的索引并不能起到多大的作用。您还应该有一个包含连接条件的索引。例如ID列。由于您的查询主要是对T2别名的时间戳进行量化,所以我将提供以下索引

table           index
ReallyBigLog2   (DateStamp, ID )
ReallyBigLog1   (id, endTime, StartTime )

And here is why. You are specifically looking for transactions in T2 > a given date. So the really big log 2 STARTS with that as the basis. Then ALSO include the "ID" column for the JOIN basis to log table 1. Both parts of the index here are covered and do not require going to the data pages for comparison to get the fields yet.

这是为什么。您正在寻找特定日期下T2 >中的事务。log 2以它为基底。然后还将“ID”列包含在连接基础上,以记录表1。这里介绍了索引的两个部分,并且还不需要到数据页进行比较以获得字段。

Now, the columns index for T1. Start with the ID as an immediate found or not to the T2 table. Having the endTime, StartTime as part of the index, again, it does not have to go to the raw data pages to qualify the WHERE / JOIN criteria.

现在,T1的列指数。从ID开始,作为立即找到的或不存在于T2表。使用endTime、StartTime作为索引的一部分,它也不需要访问原始数据页面来限定WHERE / JOIN标准。

Once that is all done, it has the set of records, goes to the data pages for those and pulls the rest of the details you need.

完成这些之后,它就有了一组记录,进入数据页面获取这些记录,并提取所需的其他细节。

from
   [dbo].[ReallyBigLog2] AS [T2]
      JOIN [dbo].[ReallyBigLog1] AS [T1]
         ON [T1].[Id] = [T2].[Id]
         AND ([T1].[EndTime] IS NOT NULL) 
         AND ([T1].[StartTime] IS NOT NULL) 
where
   [T2].[DateStamp] >= '2017-5-16 00:00:00'