I'm trying to optimize some of the database queries in my Rails app and I have several that have got me stumped. They are all using an IN in the WHERE clause and are all doing full table scans even though an appropriate index appears to be in place.
我正在优化我的Rails应用程序中的一些数据库查询,有几个让我感到困惑。它们都在WHERE子句中使用IN,并且都在执行全表扫描,即使适当的索引似乎已经就位。
For example:
例如:
SELECT `user_metrics`.* FROM `user_metrics` WHERE (`user_metrics`.user_id IN (N,N,N,N,N,N,N,N,N,N,N,N))
performs a full table scan and EXPLAIN says:
执行完整的表扫描,并解释说:
select_type: simple
type: all
extra: using where
possible_keys: index_user_metrics_on_user_id (which is an index on the user_id column)
key: (none)
key_length: (none)
ref: (none)
rows: 208
Are indexes not used when an IN statement is used or do I need to do something differently? The queries here are being generated by Rails so I could revisit how my relationships are defined, but I thought I'd start with potential fixes at the DB level first.
当语句被使用或我需要做一些不同的事情时,索引不被使用吗?这里的查询是由Rails生成的,因此我可以重新定义我的关系,但是我想我应该先从DB级别的潜在修复开始。
5 个解决方案
#1
39
看看MySQL是如何使用索引的。
Also validate whether MySQL still performs a full table scan after you add an additional 2000-or-so rows to your user_metrics
table. In small tables, access-by-index is actually more expensive (I/O-wise) than a table scan, and MySQL's optimizer might take this into account.
还可以验证在向user_metrics表添加2000行左右后,MySQL是否仍然执行完整的表扫描。在小表中,按索引访问实际上比表扫描更昂贵(I/O-wise), MySQL的优化器可能会考虑到这一点。
Contrary to my previous post, it turns out that MySQL is also using a cost-based optimizer, which is very good news - that is, provided you run your ANALYZE
at least once when you believe that the volume of data in your database is representative of future day-to-day usage.
与我之前的文章相反,MySQL也在使用基于成本的优化器,这是一个非常好的消息——也就是说,如果您认为数据库中的数据量代表了未来的日常使用,那么您至少要运行一次分析。
When dealing with cost-based optimizers (Oracle, Postgres, etc.), you need to make sure to periodically run ANALYZE
on your various tables as their size increases by more than 10-15%. (Postgres will do this automatically for you, by default, whereas other RDBMSs will leave this responsibility to a DBA, i.e. you.) Through statistical analysis, ANALYZE
will help the optimizer get a better idea of how much I/O (and other associated resources, such as CPU, needed e.g. for sorting) will be involved when choosing between various candidate execution plans. Failure to run ANALYZE
may result in very poor, sometimes disastrous planning decisions (e.g. millisecond-queries taking, sometimes, hours because of bad nested loops on JOIN
s.)
在处理基于成本的优化器(Oracle、Postgres等)时,您需要确保在各种表上定期运行分析,因为它们的大小增加了10-15%以上。(默认情况下,Postgres将自动为您执行此操作,而其他rdbms则将此责任留给DBA(即您)。)通过统计分析,分析将帮助优化器更好地了解在各种候选执行计划之间进行选择时将涉及多少I/O(以及其他相关资源,例如CPU,用于排序)。运行分析失败可能会导致非常糟糕的、有时是灾难性的计划决策(例如,毫秒级查询有时会因为连接上的嵌套循环错误而花费数小时)。
If performance is still unsatisfactory after running ANALYZE
, then you will typically be able to work around the issue by using hints, e.g. FORCE INDEX
, whereas in other cases you might have stumbled over a MySQL bug (e.g. this older one, which could have bitten you were you to use Rails' nested_set
).
如果性能运行分析后仍不满意,那么你就能够解决这个问题通过使用提示,例如力量指数,而在其他情况下你可能会发现在一个MySQL错误(例如这个年长的一个,它可能会咬你你使用Rails的nested_set)。
Now, since you are in a Rails app, it will be cumbersome (and defeat the purpose of ActiveRecord
) to issue your custom queries with hints instead of continuing to use the ActiveRecord
-generated ones.
现在,由于您使用的是Rails应用程序,所以用提示发出自定义查询而不是继续使用ActiveRecord生成的查询将会很麻烦(并且违背ActiveRecord的目的)。
I had mentioned that in our Rails application all SELECT
queries dropped below 100ms after switching to Postgres, whereas some of the complex joins generated by ActiveRecord
would occasionally take as much as 15s or more with MySQL 5.1 because of nested loops with inner table scans, even when indices were available. No optimizer is perfect, and you should be aware of the options. Other potential performance issues to be aware of, besides query plan optimization, are locking. This is outside the scope of your problem though.
我有提到我们的Rails应用程序所有的SELECT查询转向Postgres后跌破100 ms,而一些复杂的连接由ActiveRecord偶尔会高达15秒或更多使用MySQL 5.1因为嵌套循环的内部表扫描,即使指数。没有一个优化器是完美的,您应该注意这些选项。除了查询计划优化之外,还需要注意的其他潜在性能问题是锁定。这超出了你的问题范围。
#2
9
Try forcing this index:
试着强迫这个索引:
SELECT `user_metrics`.*
FROM `user_metrics` FORCE INDEX (index_user_metrics_on_user_id)
WHERE (`user_metrics`.user_id IN (N,N,N,N,N,N,N,N,N,N,N,N))
I just checked, it does use an index on exactly same query:
我刚刚检查过,它确实在同一个查询中使用了索引:
EXPLAIN EXTENDED
SELECT * FROM tests WHERE (test IN ('test 1', 'test 2', 'test 3', 'test 4', 'test 5', 'test 6', 'test 7', 'test 8', 'test 9'))
1, 'SIMPLE', 'tests', 'range', 'ix_test', 'ix_test', '602', '', 9, 100.00, 'Using where'
#3
6
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
有时MySQL不使用索引,即使是可用的索引。发生这种情况的一种情况是,优化器估计使用索引将需要MySQL访问表中很大比例的行。(在这种情况下,表扫描可能会快得多,因为它需要更少的查找。)
What percentage of rows match your IN clause?
你在这一条款中所占的比例是多少?
#4
3
I know I'm late for the party. But hope I can help someone else with similar problem.
我知道我要迟到了。但是希望我能帮助其他有类似问题的人。
Lately, I'm having the same problem. Then I decide to use self-join-thing to solve my problem. The problem is not MySQL. Problem is us. The return type from subquery is difference from our table. So we must cast the type of subquery to the type of select column. Below is example code:
最近,我也遇到了同样的问题。然后我决定用自我结合的方法来解决我的问题。问题不是MySQL。问题是我们。子查询的返回类型与我们的表不同。因此,我们必须将子查询类型转换为select列的类型。下面是示例代码:
select `user_metrics`.*
from `user_metrics` um
join (select `user_metrics`.`user_id` in (N, N, N, N) ) as temp
on um.`user_id` = temp.`user_id`
Or my own code:
或我自己的代码:
Old: (Not use index: ~4s)
老:(不使用索引:~4s)
SELECT
`jxm_character`.*
FROM
jxm_character
WHERE
information_date IN (SELECT DISTINCT
(information_date)
FROM
jxm_character
WHERE
information_date >= DATE_SUB('2016-12-2', INTERVAL 7 DAY))
AND `jxm_character`.`ranking_type` = 1
AND `jxm_character`.`character_id` = 3146089;
New: (Use index: ~0.02s)
新:(使用指数:~ 0.02 s)
SELECT
*
FROM
jxm_character jc
JOIN
(SELECT DISTINCT
(information_date)
FROM
jxm_character
WHERE
information_date >= DATE_SUB('2016-12-2', INTERVAL 7 DAY)) AS temp
ON jc.information_date = STR_TO_DATE(temp.information_date, '%Y-%m-%d')
AND jc.ranking_type = 1
AND jc.character_id = 3146089;
jxm_character:
jxm_character:
- Records: ~3.5M
- 记录:~ 3.5 m
- PK: jxm_character(information_date, ranking_type, character_id)
- PK:jxm_character(information_date ranking_type character_id)
SHOW VARIABLES LIKE '%version%';
显示变量如“%版本%”;
'protocol_version', '10'
'version', '5.1.69-log'
'version_comment', 'Source distribution'
Last note: Make sure you understand MySQL index left-most rule.
最后注意:确保您理解了MySQL索引最左边的规则。
P/s: Sorry for my bad English. I post my code (production, of course) to clear my solution :D.
抱歉,我的英语不好。我发布我的代码(当然是生产代码)来清除我的解决方案:D。
#5
0
Does it get any better if you remove the redundant brackets around the where clause?
如果删除where子句周围的冗余括号,是否会更好?
Although it could just be that because you've only got 200 or so rows, it decided a table scan would be faster. Try with a table with more records in it.
虽然可能只是因为只有大约200行,它认为表扫描会更快。尝试使用一个包含更多记录的表。
#1
39
看看MySQL是如何使用索引的。
Also validate whether MySQL still performs a full table scan after you add an additional 2000-or-so rows to your user_metrics
table. In small tables, access-by-index is actually more expensive (I/O-wise) than a table scan, and MySQL's optimizer might take this into account.
还可以验证在向user_metrics表添加2000行左右后,MySQL是否仍然执行完整的表扫描。在小表中,按索引访问实际上比表扫描更昂贵(I/O-wise), MySQL的优化器可能会考虑到这一点。
Contrary to my previous post, it turns out that MySQL is also using a cost-based optimizer, which is very good news - that is, provided you run your ANALYZE
at least once when you believe that the volume of data in your database is representative of future day-to-day usage.
与我之前的文章相反,MySQL也在使用基于成本的优化器,这是一个非常好的消息——也就是说,如果您认为数据库中的数据量代表了未来的日常使用,那么您至少要运行一次分析。
When dealing with cost-based optimizers (Oracle, Postgres, etc.), you need to make sure to periodically run ANALYZE
on your various tables as their size increases by more than 10-15%. (Postgres will do this automatically for you, by default, whereas other RDBMSs will leave this responsibility to a DBA, i.e. you.) Through statistical analysis, ANALYZE
will help the optimizer get a better idea of how much I/O (and other associated resources, such as CPU, needed e.g. for sorting) will be involved when choosing between various candidate execution plans. Failure to run ANALYZE
may result in very poor, sometimes disastrous planning decisions (e.g. millisecond-queries taking, sometimes, hours because of bad nested loops on JOIN
s.)
在处理基于成本的优化器(Oracle、Postgres等)时,您需要确保在各种表上定期运行分析,因为它们的大小增加了10-15%以上。(默认情况下,Postgres将自动为您执行此操作,而其他rdbms则将此责任留给DBA(即您)。)通过统计分析,分析将帮助优化器更好地了解在各种候选执行计划之间进行选择时将涉及多少I/O(以及其他相关资源,例如CPU,用于排序)。运行分析失败可能会导致非常糟糕的、有时是灾难性的计划决策(例如,毫秒级查询有时会因为连接上的嵌套循环错误而花费数小时)。
If performance is still unsatisfactory after running ANALYZE
, then you will typically be able to work around the issue by using hints, e.g. FORCE INDEX
, whereas in other cases you might have stumbled over a MySQL bug (e.g. this older one, which could have bitten you were you to use Rails' nested_set
).
如果性能运行分析后仍不满意,那么你就能够解决这个问题通过使用提示,例如力量指数,而在其他情况下你可能会发现在一个MySQL错误(例如这个年长的一个,它可能会咬你你使用Rails的nested_set)。
Now, since you are in a Rails app, it will be cumbersome (and defeat the purpose of ActiveRecord
) to issue your custom queries with hints instead of continuing to use the ActiveRecord
-generated ones.
现在,由于您使用的是Rails应用程序,所以用提示发出自定义查询而不是继续使用ActiveRecord生成的查询将会很麻烦(并且违背ActiveRecord的目的)。
I had mentioned that in our Rails application all SELECT
queries dropped below 100ms after switching to Postgres, whereas some of the complex joins generated by ActiveRecord
would occasionally take as much as 15s or more with MySQL 5.1 because of nested loops with inner table scans, even when indices were available. No optimizer is perfect, and you should be aware of the options. Other potential performance issues to be aware of, besides query plan optimization, are locking. This is outside the scope of your problem though.
我有提到我们的Rails应用程序所有的SELECT查询转向Postgres后跌破100 ms,而一些复杂的连接由ActiveRecord偶尔会高达15秒或更多使用MySQL 5.1因为嵌套循环的内部表扫描,即使指数。没有一个优化器是完美的,您应该注意这些选项。除了查询计划优化之外,还需要注意的其他潜在性能问题是锁定。这超出了你的问题范围。
#2
9
Try forcing this index:
试着强迫这个索引:
SELECT `user_metrics`.*
FROM `user_metrics` FORCE INDEX (index_user_metrics_on_user_id)
WHERE (`user_metrics`.user_id IN (N,N,N,N,N,N,N,N,N,N,N,N))
I just checked, it does use an index on exactly same query:
我刚刚检查过,它确实在同一个查询中使用了索引:
EXPLAIN EXTENDED
SELECT * FROM tests WHERE (test IN ('test 1', 'test 2', 'test 3', 'test 4', 'test 5', 'test 6', 'test 7', 'test 8', 'test 9'))
1, 'SIMPLE', 'tests', 'range', 'ix_test', 'ix_test', '602', '', 9, 100.00, 'Using where'
#3
6
Sometimes MySQL does not use an index, even if one is available. One circumstance under which this occurs is when the optimizer estimates that using the index would require MySQL to access a very large percentage of the rows in the table. (In this case, a table scan is likely to be much faster because it requires fewer seeks.)
有时MySQL不使用索引,即使是可用的索引。发生这种情况的一种情况是,优化器估计使用索引将需要MySQL访问表中很大比例的行。(在这种情况下,表扫描可能会快得多,因为它需要更少的查找。)
What percentage of rows match your IN clause?
你在这一条款中所占的比例是多少?
#4
3
I know I'm late for the party. But hope I can help someone else with similar problem.
我知道我要迟到了。但是希望我能帮助其他有类似问题的人。
Lately, I'm having the same problem. Then I decide to use self-join-thing to solve my problem. The problem is not MySQL. Problem is us. The return type from subquery is difference from our table. So we must cast the type of subquery to the type of select column. Below is example code:
最近,我也遇到了同样的问题。然后我决定用自我结合的方法来解决我的问题。问题不是MySQL。问题是我们。子查询的返回类型与我们的表不同。因此,我们必须将子查询类型转换为select列的类型。下面是示例代码:
select `user_metrics`.*
from `user_metrics` um
join (select `user_metrics`.`user_id` in (N, N, N, N) ) as temp
on um.`user_id` = temp.`user_id`
Or my own code:
或我自己的代码:
Old: (Not use index: ~4s)
老:(不使用索引:~4s)
SELECT
`jxm_character`.*
FROM
jxm_character
WHERE
information_date IN (SELECT DISTINCT
(information_date)
FROM
jxm_character
WHERE
information_date >= DATE_SUB('2016-12-2', INTERVAL 7 DAY))
AND `jxm_character`.`ranking_type` = 1
AND `jxm_character`.`character_id` = 3146089;
New: (Use index: ~0.02s)
新:(使用指数:~ 0.02 s)
SELECT
*
FROM
jxm_character jc
JOIN
(SELECT DISTINCT
(information_date)
FROM
jxm_character
WHERE
information_date >= DATE_SUB('2016-12-2', INTERVAL 7 DAY)) AS temp
ON jc.information_date = STR_TO_DATE(temp.information_date, '%Y-%m-%d')
AND jc.ranking_type = 1
AND jc.character_id = 3146089;
jxm_character:
jxm_character:
- Records: ~3.5M
- 记录:~ 3.5 m
- PK: jxm_character(information_date, ranking_type, character_id)
- PK:jxm_character(information_date ranking_type character_id)
SHOW VARIABLES LIKE '%version%';
显示变量如“%版本%”;
'protocol_version', '10'
'version', '5.1.69-log'
'version_comment', 'Source distribution'
Last note: Make sure you understand MySQL index left-most rule.
最后注意:确保您理解了MySQL索引最左边的规则。
P/s: Sorry for my bad English. I post my code (production, of course) to clear my solution :D.
抱歉,我的英语不好。我发布我的代码(当然是生产代码)来清除我的解决方案:D。
#5
0
Does it get any better if you remove the redundant brackets around the where clause?
如果删除where子句周围的冗余括号,是否会更好?
Although it could just be that because you've only got 200 or so rows, it decided a table scan would be faster. Try with a table with more records in it.
虽然可能只是因为只有大约200行,它认为表扫描会更快。尝试使用一个包含更多记录的表。