具有重复值的列上的数据库索引

If there is a table containing details of employees including a column Gender whose value can be either M/F. Now would it make sense to create an index on this column, would it make the search faster? Logically if we fire a select statement with where clause containing Gender as the column, it should cut down the search time by half. But I have heard that this kind of index will not help and would be actually ignored by the Database Optimizer while executing the query. But I am not getting why? Can somebody please explain?

如果有一个包含员工详细信息的表，其中包括一列Gender，其值可以是M / F.现在，在此列上创建索引是否有意义，是否会使搜索更快？逻辑上，如果我们使用where子句包含Gender作为列来激活select语句，它应该将搜索时间缩短一半。但是我听说这种索引没有帮助，并且在执行查询时实际上会被Database Optimizer忽略。但我不明白为什么？有人可以解释一下吗？

2 个解决方案

#1

In most cases, only one index can be used to optimize a database query. If a query needs to match several indexed columns, the query planner will have to decide which of these indexes to use. Each index has a cardinality, which is roughly the number of different values across the table. An index with higher cardinality will be more effective, because selecting rows that match the index will result in very few rows to scan to match the other conditions.

在大多数情况下，只能使用一个索引来优化数据库查询。如果查询需要匹配多个索引列，则查询计划程序必须决定使用哪些索引。每个索引都有一个基数，它基本上是表中不同值的数量。具有更高基数的索引将更有效，因为选择与索引匹配的行将导致扫描的行非常少以匹配其他条件。

An index on a gender column will only cut the table in half. Any other index will be more effective.

性别列的索引只会将表格减半。任何其他指数都会更有效。

As an analogy, think of phone books. If you had a single phone book for an entire country, it would be huge and hard to search for the specific person you want. So phone books are usually made for just a city, or a few cities in an area, to make them reasonable sizes. But if you instead had a "Male phone book" instead of regional phone books, it would be nearly as unusable as a phone book for the entire country. The criteria for creating new phone books is that they should be much smaller than a book for the entire country. A factor of 2 reduction isn't very useful when you're starting with an enormous size.

作为类比，想想电话簿。如果您在整个国家/地区只有一本电话簿，那么搜索您想要的特定人员将会非常困难。因此，电话簿通常仅针对一个城市或某个地区的几个城市制作，以使其尺寸合理。但是，如果您使用“男性电话簿”而不是区域电话簿，那么它几乎就像整个国家的电话簿一样无法使用。创建新电话簿的标准是它们应该比整个国家的书小得多。当你从一个巨大的尺寸开始时，减少2倍并不是很有用。

#2

Presumably, gender take on two values. In general, an index on gender would not be helpful. In fact, it might be hurtful.

据推测，性别具有两个价值观。一般而言，性别指数没有帮助。事实上，它可能是有害的。

If you are selecting on gender, without an index, the query optimizer does a full table scan of the database pages to satisfy the query. On a typical page, half the entries would match the query, so you would start getting results on the first hit.

如果选择性别而没有索引，查询优化器会对数据库页执行全表扫描以满足查询。在典型页面上，一半条目将与查询匹配，因此您将在第一次点击时开始获得结果。

In this phase of query execution, an index is typically used to reduce the number of pages being read. However, if every page has a record with "M" and "F", then every page still has to be read. To make matters worse, using an index means that you read from one random page, and then another, and another, instead of just reading the values sequentially. Jumping around pages takes a bit extra time. If the pages do not all fit in memory, you have a situation called thrashing, and it could take a really, really long time.

在查询执行的这个阶段，索引通常用于减少正在读取的页面数。但是，如果每个页面都有“M”和“F”的记录，那么仍然必须读取每个页面。更糟糕的是，使用索引意味着您从一个随机页面读取，然后从另一个页面读取另一个页面，而不是仅按顺序读取值。跳转页面需要一些额外的时间。如果这些页面都不适合内存，那么就会出现一种叫做颠簸的情况，这可能需要非常长的时间。

The one exception to this is a clustered index, where the values on the pages are actually sorted by the values. In that case, a query using the index would be about 50% faster, because only have the pages need to be read. This can be particularly effective in an "archive" table, where you have active records that are frequently searched. This flag might occur on 10%, 1%, or 0.1% of the records, and the clustered index can be a significant speed improvement.

对此的一个例外是聚簇索引，其中页面上的值实际上按值排序。在这种情况下，使用索引的查询将快50％，因为只需要读取页面。这在“存档”表中尤其有效，在该表中，您具有经常搜索的活动记录。此标志可能出现在记录的10％，1％或0.1％上，并且聚簇索引可以显着提高速度。

It would be rare on a large table to run a query that returns half the records. Quite possibly, gender in combination with other columns would be a good candidate for inclusion in an index.

在大型表上运行返回一半记录的查询是很少见的。很可能，性别与其他列相结合将是包含在索引中的良好候选者。

#1