选择性在索引扫描/搜索中的作用

时间:2021-01-01 04:17:34

I have been reading in many SQL books and articles that selectivity is an important factor in creating index. If a column has low selectivity, an index seek does more harm that good. But none of the articles explain why. Can anybody explain why it is so, or provide a link to a relevant article?

我一直在阅读许多SQL书籍和文章,其中选择性是创建索引的重要因素。如果一列具有低选择性,则索引搜索会带来更多伤害。但这些文章都没有解释原因。任何人都可以解释为什么会这样,或提供相关文章的链接?

2 个解决方案

#1


7  

From SimpleTalk article by Robert Sheldon: 14 SQL Server Indexing Questions You Were Too Shy To Ask

来自Robert Sheldon的SimpleTalk文章:14个SQL Server索引问题,你太害羞了

The ratio of unique values within a key column is referred to as index selectivity. The more unique the values, the higher the selectivity, which means that a unique index has the highest possible selectivity. The query engine loves highly selective key columns, especially if those columns are referenced in the WHERE clause of your frequently run queries. The higher the selectivity, the faster the query engine can reduce the size of the result set. The flipside, of course, is that a column with relatively few unique values is seldom a good candidate to be indexed.

关键列中唯一值的比率称为索引选择性。值越独特,选择性越高,这意味着唯一索引具有尽可能高的选择性。查询引擎喜欢高度选择性的键列,尤其是在经常运行的查询的WHERE子句中引用这些列的情况下。选择性越高,查询引擎可以越快地减小结果集的大小。当然,另一方面,具有相对较少的唯一值的列很少是被索引的好候选者。

Also check these articles:

另请查看这些文章:

  • Check this post by Pinal Dave
  • 查看Pinal Dave的这篇文章

  • this other on SQL Serverpedia
  • 另一个在SQL Serverpedia上

  • This forum post on SqlServerCentral can help you too.
  • 这篇关于SqlServerCentral的论坛帖子也可以帮到你。

  • This article on SqlServerCentral also
  • 这篇文章也在SqlServerCentral上

From the SqlServerCentral article:

从SqlServerCentral文章:

In general, a nonclustered index should be selective. That is, the values in the column should be fairly unique and queries that filter on it should return small portions of the table.

通常,非聚集索引应该是选择性的。也就是说,列中的值应该是相当独特的,并且对其进行过滤的查询应该返回表的一小部分。

The reason for this is that key/RID lookups are expensive operations and if a nonclustered index is to be used to evaluate a query it needs to be covering or sufficiently selective that the costs of the lookups aren’t deemed to be too high.

这样做的原因是密钥/ RID查找是昂贵的操作,并且如果要使用非聚集索引来评估查询,则需要覆盖或有足够的选择性以使查找的成本不被认为太高。

If SQL considers the index (or the subset of the index keys that the query would be seeking on) insufficiently selective then it is very likely that the index will be ignored and the query executed as a clustered index (table) scan.

如果SQL认为索引(或查询将要搜索的索引键的子集)选择性不足,则很可能忽略索引并将查询作为聚簇索引(表)扫描执行。

It is important to note that this does not just apply to the leading column. There are scenarios where a very unselective column can be used as the leading column, with the other columns in the index making it selective enough to be used.

重要的是要注意,这不仅适用于前导列。在某些情况下,非常非选择性的列可以用作前导列,索引中的其他列使其具有足够的选择性以供使用。

#2


3  

I try to write a very simple explanation (based on my current knowledge of Sql Server):

我尝试写一个非常简单的解释(基于我目前对Sql Server的了解):

If an index has low selectivity it means that for the same value a bigger percentage of the total rows are found. (like 200 from the 500 rows has the same value on your index based)

如果索引的选择性较低,则意味着对于相同的值,可以找到总行数的较大百分比。 (如500行中的200与您的索引上的值相同)

Usually if the index does not contain all the column information what you need, then it is using a pointer, where to find the row physically which is connected to that "entry" on the index. Then in a secpnd step the engine has to read out that row.

通常,如果索引不包含您需要的所有列信息,那么它使用指针,在哪里找到物理上与索引上的“条目”相连的行。然后在一个secpnd步骤中,引擎必须读出该行。

So as you see a search like this using two step. And here comes the selectivity:

所以当你看到这样的搜索使用两步。这里有选择性:

More results you get becuse of the low selectivity more double work the engine has to do. So there are some cases because of this fact where even a table scan is more efficient then an index seek with very low selectivity.

更多的结果是因为低选择性,引擎必须做更多的双重工作。因此,在某些情况下,由于这一事实,即使是表扫描也比具有非常低选择性的索引搜索更有效。

#1


7  

From SimpleTalk article by Robert Sheldon: 14 SQL Server Indexing Questions You Were Too Shy To Ask

来自Robert Sheldon的SimpleTalk文章:14个SQL Server索引问题,你太害羞了

The ratio of unique values within a key column is referred to as index selectivity. The more unique the values, the higher the selectivity, which means that a unique index has the highest possible selectivity. The query engine loves highly selective key columns, especially if those columns are referenced in the WHERE clause of your frequently run queries. The higher the selectivity, the faster the query engine can reduce the size of the result set. The flipside, of course, is that a column with relatively few unique values is seldom a good candidate to be indexed.

关键列中唯一值的比率称为索引选择性。值越独特,选择性越高,这意味着唯一索引具有尽可能高的选择性。查询引擎喜欢高度选择性的键列,尤其是在经常运行的查询的WHERE子句中引用这些列的情况下。选择性越高,查询引擎可以越快地减小结果集的大小。当然,另一方面,具有相对较少的唯一值的列很少是被索引的好候选者。

Also check these articles:

另请查看这些文章:

  • Check this post by Pinal Dave
  • 查看Pinal Dave的这篇文章

  • this other on SQL Serverpedia
  • 另一个在SQL Serverpedia上

  • This forum post on SqlServerCentral can help you too.
  • 这篇关于SqlServerCentral的论坛帖子也可以帮到你。

  • This article on SqlServerCentral also
  • 这篇文章也在SqlServerCentral上

From the SqlServerCentral article:

从SqlServerCentral文章:

In general, a nonclustered index should be selective. That is, the values in the column should be fairly unique and queries that filter on it should return small portions of the table.

通常,非聚集索引应该是选择性的。也就是说,列中的值应该是相当独特的,并且对其进行过滤的查询应该返回表的一小部分。

The reason for this is that key/RID lookups are expensive operations and if a nonclustered index is to be used to evaluate a query it needs to be covering or sufficiently selective that the costs of the lookups aren’t deemed to be too high.

这样做的原因是密钥/ RID查找是昂贵的操作,并且如果要使用非聚集索引来评估查询,则需要覆盖或有足够的选择性以使查找的成本不被认为太高。

If SQL considers the index (or the subset of the index keys that the query would be seeking on) insufficiently selective then it is very likely that the index will be ignored and the query executed as a clustered index (table) scan.

如果SQL认为索引(或查询将要搜索的索引键的子集)选择性不足,则很可能忽略索引并将查询作为聚簇索引(表)扫描执行。

It is important to note that this does not just apply to the leading column. There are scenarios where a very unselective column can be used as the leading column, with the other columns in the index making it selective enough to be used.

重要的是要注意,这不仅适用于前导列。在某些情况下,非常非选择性的列可以用作前导列,索引中的其他列使其具有足够的选择性以供使用。

#2


3  

I try to write a very simple explanation (based on my current knowledge of Sql Server):

我尝试写一个非常简单的解释(基于我目前对Sql Server的了解):

If an index has low selectivity it means that for the same value a bigger percentage of the total rows are found. (like 200 from the 500 rows has the same value on your index based)

如果索引的选择性较低,则意味着对于相同的值,可以找到总行数的较大百分比。 (如500行中的200与您的索引上的值相同)

Usually if the index does not contain all the column information what you need, then it is using a pointer, where to find the row physically which is connected to that "entry" on the index. Then in a secpnd step the engine has to read out that row.

通常,如果索引不包含您需要的所有列信息,那么它使用指针,在哪里找到物理上与索引上的“条目”相连的行。然后在一个secpnd步骤中,引擎必须读出该行。

So as you see a search like this using two step. And here comes the selectivity:

所以当你看到这样的搜索使用两步。这里有选择性:

More results you get becuse of the low selectivity more double work the engine has to do. So there are some cases because of this fact where even a table scan is more efficient then an index seek with very low selectivity.

更多的结果是因为低选择性,引擎必须做更多的双重工作。因此,在某些情况下,由于这一事实,即使是表扫描也比具有非常低选择性的索引搜索更有效。