MySQL为什么不在用作布尔值的int字段上使用索引?

时间:2020-12-17 00:09:24
select * from myTable where myInt

will not show any possible_keys when explaining the query even though there is an index on myInt field.

即使myInt字段上有索引,在解释查询时也不会显示任何possible_keys。

Edit:
The index in question is not unique.

编辑:有问题的索引不是唯一的。

4 个解决方案

#1


For MySQL to use the index, you have to explicitly compare the int field to a value (e.g. true, 1).

要让MySQL使用索引,您必须明确地将int字段与值(例如,true,1)进行比较。

select * from myTable where myInt = true

#2


I'm not a database expert, but doesn't it defeat the purpose of having an index on the field if there is only two possible values of the field?

我不是数据库专家,但是如果该字段只有两个可能的值,它是否会破坏在该字段上建立索引的目的?

If all of the fields in the indexed column are unique, then the database engine can do an index scan to find the relevant rows. If there are only two possible values - then I don't see the purpose of having that field indexed. The DB engine has to do the same operation that it would if the index did not exist.

如果索引列中的所有字段都是唯一的,则数据库引擎可以执行索引扫描以查找相关行。如果只有两个可能的值 - 那么我没有看到将该字段编入索引的目的。数据库引擎必须执行与索引不存在时相同的操作。

Perhaps MySQL is not showing it as a possible key because the engine has discarded the idea of using the index in the execution plan?

也许MySQL没有将它显示为可能的关键,因为引擎已经放弃了在执行计划中使用索引的想法?

#3


There are lots of factors to consider.

有很多因素需要考虑。

One factor that should not enter into it is the notation used in the question. When the column is a boolean, then these conditions should be treated by the optimizer as identical:

不应该进入的一个因素是问题中使用的符号。当列是布尔值时,优化器应将这些条件视为相同:

SELECT * FROM MyTable WHERE MyInt;

SELECT * FROM MyTable WHERE MyInt != 0;

SELECT * FROM MyTable WHERE MyInt IS TRUE;

SELECT * FROM MyTable WHERE MyInt = TRUE;

There may be other equivalent formulations. The first of these is not standard SQL (even if the type of MyInt is BOOLEAN; the others are standard. But the optimizer should simply transform the shorthand into the appropriate long form and then behave the same as if the long form was written by the user. (If the optimizer does not do this, then there is arguably a problem with the optimizer; the query should be reduced to a canonical form before deciding how to process the query. However, there are often blind spots in even the best optimizers. Learning how to avoid those is an art form, and inherently DBMS-specific.)

可能存在其他等效配方。第一个不是标准的SQL(即使MyInt的类型是BOOLEAN;其他的都是标准的。但优化器应该简单地将速记转换为适当的长形式,然后表现就像长形式由user。(如果优化器没有这样做,那么优化器可能存在问题;在决定如何处理查询之前,应该将查询简化为规范形式。但是,即使是最好的优化器也常常会出现盲点学习如何避免这些是一种艺术形式,本质上是DBMS特有的。)

The optimizer uses an index when it believes the index will boost performance of the query. When the index won't boost performance, it is ignored (if the optimizer is any good). Sometimes, that depends on whether the statistics for the index are up to date.

优化器在认为索引将提高查询性能时使用索引。当索引不会提升性能时,它会被忽略(如果优化器是好的)。有时,这取决于索引的统计数据是否是最新的。

In data warehousing systems, the system can be designed and configured to make sequential scans of the table very fast; in such systems, if the selectivity of an index is such that using it will pull more than as little as 25% of the rows, it can actually be quicker to do the full table scan than to use the index.

在数据仓库系统中,系统可以设计和配置为非常快速地对表进行顺序扫描;在这样的系统中,如果索引的选择性使得使用它将拉动超过25%的行,实际上进行全表扫描比使用索引更快。

Think about it. When reading via an index, the DBMS has to do at least two reads; it reads the information about the row from the index page, and then it has to read the row from the data page.

想一想。当通过索引读取时,DBMS必须至少进行两次读取;它从索引页面读取有关该行的信息,然后它必须从数据页面读取该行。

Some DBMS provide index-only tables. All the data is in the index. Other DBMS provide a mechanism such that you can say "index is unique on columns A, B, C; however, include columns D and E in the data too". Then if the query requires data from A, B, C, D or E (or any combination) and there's no filtering on other columns, the DBMS only has to scan the index, not the table pages too.

某些DBMS提供仅索引表。所有数据都在索引中。其他DBMS提供了一种机制,您可以说“索引在列A,B,C上是唯一的;但是,也包括数据中的列D和E”。然后,如果查询需要来自A,B,C,D或E(或任何组合)的数据,并且对其他列没有过滤,则DBMS只需扫描索引,而不是表页。

Typically, you get many index rows to a page. However, for some tables, reading an index may require reading more data than reading the rows. Consider the archetypal many-to-many mapping table containing two (4-byte) integer ID values. That requires 8 bytes per row in the data pages, but the index probably requires 4-8 bytes of overhead (because the index key entry stores the two ID values plus the information needed to locate the corresponding row on disk). So, an index scan there may involve twice as much disk I/O as the data scan, even if the index scan is done 'index only'.

通常,您会在页面中获得许多索引行。但是,对于某些表,读取索引可能需要读取比读取行更多的数据。考虑包含两个(4字节)整数ID值的原型多对多映射表。这需要数据页中每行8个字节,但索引可能需要4-8个字节的开销(因为索引键条目存储了两个ID值以及在磁盘上定位相应行所需的信息)。因此,索引扫描可能涉及两倍于数据扫描的磁盘I / O,即使索引扫描仅执行“索引”。

This is barely touching the surface of the possible reasons for using or not using an index.

这几乎没有触及使用或不使用索引的可能原因。

#4


Your question's SQL looks malformed to me. Are you looking for non-null values of the column? This should use the index:

您的问题的SQL看起来对我来说是不正确的。您是否在寻找列的非空值?这应该使用索引:

select * from myTable where myInt is not null

#1


For MySQL to use the index, you have to explicitly compare the int field to a value (e.g. true, 1).

要让MySQL使用索引,您必须明确地将int字段与值(例如,true,1)进行比较。

select * from myTable where myInt = true

#2


I'm not a database expert, but doesn't it defeat the purpose of having an index on the field if there is only two possible values of the field?

我不是数据库专家,但是如果该字段只有两个可能的值,它是否会破坏在该字段上建立索引的目的?

If all of the fields in the indexed column are unique, then the database engine can do an index scan to find the relevant rows. If there are only two possible values - then I don't see the purpose of having that field indexed. The DB engine has to do the same operation that it would if the index did not exist.

如果索引列中的所有字段都是唯一的,则数据库引擎可以执行索引扫描以查找相关行。如果只有两个可能的值 - 那么我没有看到将该字段编入索引的目的。数据库引擎必须执行与索引不存在时相同的操作。

Perhaps MySQL is not showing it as a possible key because the engine has discarded the idea of using the index in the execution plan?

也许MySQL没有将它显示为可能的关键,因为引擎已经放弃了在执行计划中使用索引的想法?

#3


There are lots of factors to consider.

有很多因素需要考虑。

One factor that should not enter into it is the notation used in the question. When the column is a boolean, then these conditions should be treated by the optimizer as identical:

不应该进入的一个因素是问题中使用的符号。当列是布尔值时,优化器应将这些条件视为相同:

SELECT * FROM MyTable WHERE MyInt;

SELECT * FROM MyTable WHERE MyInt != 0;

SELECT * FROM MyTable WHERE MyInt IS TRUE;

SELECT * FROM MyTable WHERE MyInt = TRUE;

There may be other equivalent formulations. The first of these is not standard SQL (even if the type of MyInt is BOOLEAN; the others are standard. But the optimizer should simply transform the shorthand into the appropriate long form and then behave the same as if the long form was written by the user. (If the optimizer does not do this, then there is arguably a problem with the optimizer; the query should be reduced to a canonical form before deciding how to process the query. However, there are often blind spots in even the best optimizers. Learning how to avoid those is an art form, and inherently DBMS-specific.)

可能存在其他等效配方。第一个不是标准的SQL(即使MyInt的类型是BOOLEAN;其他的都是标准的。但优化器应该简单地将速记转换为适当的长形式,然后表现就像长形式由user。(如果优化器没有这样做,那么优化器可能存在问题;在决定如何处理查询之前,应该将查询简化为规范形式。但是,即使是最好的优化器也常常会出现盲点学习如何避免这些是一种艺术形式,本质上是DBMS特有的。)

The optimizer uses an index when it believes the index will boost performance of the query. When the index won't boost performance, it is ignored (if the optimizer is any good). Sometimes, that depends on whether the statistics for the index are up to date.

优化器在认为索引将提高查询性能时使用索引。当索引不会提升性能时,它会被忽略(如果优化器是好的)。有时,这取决于索引的统计数据是否是最新的。

In data warehousing systems, the system can be designed and configured to make sequential scans of the table very fast; in such systems, if the selectivity of an index is such that using it will pull more than as little as 25% of the rows, it can actually be quicker to do the full table scan than to use the index.

在数据仓库系统中,系统可以设计和配置为非常快速地对表进行顺序扫描;在这样的系统中,如果索引的选择性使得使用它将拉动超过25%的行,实际上进行全表扫描比使用索引更快。

Think about it. When reading via an index, the DBMS has to do at least two reads; it reads the information about the row from the index page, and then it has to read the row from the data page.

想一想。当通过索引读取时,DBMS必须至少进行两次读取;它从索引页面读取有关该行的信息,然后它必须从数据页面读取该行。

Some DBMS provide index-only tables. All the data is in the index. Other DBMS provide a mechanism such that you can say "index is unique on columns A, B, C; however, include columns D and E in the data too". Then if the query requires data from A, B, C, D or E (or any combination) and there's no filtering on other columns, the DBMS only has to scan the index, not the table pages too.

某些DBMS提供仅索引表。所有数据都在索引中。其他DBMS提供了一种机制,您可以说“索引在列A,B,C上是唯一的;但是,也包括数据中的列D和E”。然后,如果查询需要来自A,B,C,D或E(或任何组合)的数据,并且对其他列没有过滤,则DBMS只需扫描索引,而不是表页。

Typically, you get many index rows to a page. However, for some tables, reading an index may require reading more data than reading the rows. Consider the archetypal many-to-many mapping table containing two (4-byte) integer ID values. That requires 8 bytes per row in the data pages, but the index probably requires 4-8 bytes of overhead (because the index key entry stores the two ID values plus the information needed to locate the corresponding row on disk). So, an index scan there may involve twice as much disk I/O as the data scan, even if the index scan is done 'index only'.

通常,您会在页面中获得许多索引行。但是,对于某些表,读取索引可能需要读取比读取行更多的数据。考虑包含两个(4字节)整数ID值的原型多对多映射表。这需要数据页中每行8个字节,但索引可能需要4-8个字节的开销(因为索引键条目存储了两个ID值以及在磁盘上定位相应行所需的信息)。因此,索引扫描可能涉及两倍于数据扫描的磁盘I / O,即使索引扫描仅执行“索引”。

This is barely touching the surface of the possible reasons for using or not using an index.

这几乎没有触及使用或不使用索引的可能原因。

#4


Your question's SQL looks malformed to me. Are you looking for non-null values of the column? This should use the index:

您的问题的SQL看起来对我来说是不正确的。您是否在寻找列的非空值?这应该使用索引:

select * from myTable where myInt is not null