We have a table with about 25,000,000 rows called 'events' having the following schema:
我们有一个包含大约25,000,000行的表,称为“事件”,具有以下模式:
TABLE events
- campaign_id : int(10)
- city : varchar(60)
- country_code : varchar(2)
The following query takes VERY long (> 2000 seconds):
以下查询需要很长时间(> 2000秒):
SELECT COUNT(*) AS counted_events, country_code
FROM events
WHERE campaign_id` in (597)
GROUPY BY city, country_code
ORDER BY counted_events
We found out that it's because of the GROUP BY
part.
我们发现这是因为GROUP BY部分。
There is already an index idx_campaign_id_city_country_code on (campaign_id, city, country_code)
which is used.
已经使用了(campaign_id,city,country_code)上的索引idx_campaign_id_city_country_code。
Maybe someone can suggest a good solution to speed it up?
也许有人可以提出一个很好的解决方案来加速它?
Update:
更新:
'Explain' shows that out of many possible index MySql uses this one: 'idx_campaign_id_city_country_code', for rows it shows: '471304' and for 'Extra' it shows: 'Using where; Using temporary; Using filesort' –
'Explain'表明,在许多可能的索引中,MySql使用了这个:'idx_campaign_id_city_country_code',对于它显示的行:'471304'和'Extra'它显示:'使用where;使用临时;使用filesort' -
Here is the whole result of EXPLAIN:
这是EXPLAIN的整个结果:
- id: '1'
- id:'1'
- select_type: 'SIMPLE'
- select_type:'SIMPLE'
- table: 'events'
- 表:'事件'
- type: 'ref'
- 输入:'ref'
- possible_keys: 'index_campaign,idx_campaignid_paid,idx_city_country_code,idx_city_country_code_campaign_id,idx_cid,idx_campaign_id_city_country_code'
- possible_keys:'index_campaign,idx_campaignid_paid,idx_city_country_code,idx_city_country_code_campaign_id,idx_cid,idx_campaign_id_city_country_code'
- key: 'idx_campaign_id_city_country_code'
- key:'idx_campaign_id_city_country_code'
- key_len: '4'
- key_len:'4'
- ref: 'const'
- ref:'const'
- rows: '471304'
- 行:'471304'
- Extra: 'Using where; Using temporary; Using filesort'
- 额外:'使用地点;使用临时;使用filesort'
UPDATE:
更新:
Ok, I think it has been solved:
好的,我认为它已经解决了:
Looking at the pasted query here again I realized that I forget to mention here that there was one more column in the SELECT called 'country_name'. So the query was very slow then (including country_name), but I'll just leave it out and now the performance of the query is absolutely ok. Sorry for that mistake!
再次查看粘贴的查询,我意识到我忘记在这里提到SELECT中还有一个名为'country_name'的列。所以查询非常慢(包括country_name),但我会把它留下来,现在查询的性能绝对可以。抱歉,这个错误!
So thank you for all your helpful comments, I'll upvote all the good answers! There were some really helpful additions, that I probably also we apply (like changing types etc).
非常感谢您的所有有用的评论,我将提供所有好的答案!有一些非常有用的补充,我可能也应用(如改变类型等)。
4 个解决方案
#1
3
without seeing what EXPLAIN says it's a long distance shot, anyway:
无论如何,没有看到EXPLAIN说它是远距离射击:
- make an index on (city,country_code)
- 在(city,country_code)上建立索引
- see if there's a way to use partitioning, your table is getting rather huge
- 看看是否有办法使用分区,你的表变得相当庞大
- if country code is always 2 chars change it to char
- 如果国家代码总是2个字符,则将其更改为char
- change numeric indexes to unsigned int
- 将数字索引更改为unsigned int
post entire EXPLAIN output
发布整个EXPLAIN输出
#2
0
don't use IN()
- better use:
不要使用IN() - 更好地使用:
WHERE campaign_id = 597
OR campaign_id = 231
OR ....
afaik IN()
is very slow.
afaik IN()很慢。
update: like nik0lias commented - IN()
is faster than concatenating OR
conditions.
更新:像nik0lias一样评论 - IN()比连接OR条件更快。
#3
0
Some ideas:
一些想法:
-
Given the nature and size of the table it would be a great candidate for partitioned tables by country. This way the events of every country would be stored in a different physical table even if it behaves as a virtual big table
鉴于表格的性质和大小,它将成为按国家/地区分区表格的绝佳选择。这样,每个国家的事件都将存储在不同的物理表中,即使它表现为虚拟大表
-
Is country code an string? May be you have a country_id that could be easier to sort. (It may force you to create or change indexes)
国家代码是字符串吗?可能你有一个country_id,可以更容易排序。 (它可能会强制您创建或更改索引)
-
Are you really using the city in the group by?
你是真的在小组中使用这座城市吗?
#4
0
- partitioning - especially by country will not help
- 分区 - 特别是按国家划分也无济于事
- column IN (const-list) is not slow, it is in fact a case with special optimization
- 列IN(const-list)并不慢,实际上是特殊优化的情况
The problem is, that MySQL doesn't use the index for sorting. I cannot say why, because it should. Could be a bug.
问题是,MySQL不使用索引进行排序。我不能说为什么,因为它应该。可能是一个错误。
The best strategy to execute this query is to scan that sub-tree of the index where event_id=597. Since the index is then sorted by city_id, country_code no extra sorting is needed and rows can be counted while scanning.
执行此查询的最佳策略是扫描索引的子树,其中event_id = 597。由于索引按city_id排序,因此country_code不需要额外排序,扫描时可以计算行数。
So the indexes are already optimal for this query. MySQL is just not using them correctly.
因此索引已经是此查询的最佳选择。 MySQL没有正确使用它们。
I'm getting more information off line. It seems this is not a database problem at all, but
我正在离线获取更多信息。看起来这根本不是数据库问题,但是
- the schema is not normalized. The table contains not only country_code, but also country_name (this should be in an extra table).
- 架构未规范化。该表不仅包含country_code,还包含country_name(这应该在一个额外的表中)。
- the real query contains country_name in the select list. But since that column is not indexed, MySQL cannot use an index scan.
- 真实查询在选择列表中包含country_name。但由于该列未编入索引,因此MySQL无法使用索引扫描。
As soon as country_name is dropped from the select list, the query reverts to an index-only scan ("using index" in EXPLAIN output) and is blazingly fast.
从选择列表中删除country_name后,查询将恢复为仅索引扫描(EXPLAIN输出中的“using index”)并且速度非常快。
#1
3
without seeing what EXPLAIN says it's a long distance shot, anyway:
无论如何,没有看到EXPLAIN说它是远距离射击:
- make an index on (city,country_code)
- 在(city,country_code)上建立索引
- see if there's a way to use partitioning, your table is getting rather huge
- 看看是否有办法使用分区,你的表变得相当庞大
- if country code is always 2 chars change it to char
- 如果国家代码总是2个字符,则将其更改为char
- change numeric indexes to unsigned int
- 将数字索引更改为unsigned int
post entire EXPLAIN output
发布整个EXPLAIN输出
#2
0
don't use IN()
- better use:
不要使用IN() - 更好地使用:
WHERE campaign_id = 597
OR campaign_id = 231
OR ....
afaik IN()
is very slow.
afaik IN()很慢。
update: like nik0lias commented - IN()
is faster than concatenating OR
conditions.
更新:像nik0lias一样评论 - IN()比连接OR条件更快。
#3
0
Some ideas:
一些想法:
-
Given the nature and size of the table it would be a great candidate for partitioned tables by country. This way the events of every country would be stored in a different physical table even if it behaves as a virtual big table
鉴于表格的性质和大小,它将成为按国家/地区分区表格的绝佳选择。这样,每个国家的事件都将存储在不同的物理表中,即使它表现为虚拟大表
-
Is country code an string? May be you have a country_id that could be easier to sort. (It may force you to create or change indexes)
国家代码是字符串吗?可能你有一个country_id,可以更容易排序。 (它可能会强制您创建或更改索引)
-
Are you really using the city in the group by?
你是真的在小组中使用这座城市吗?
#4
0
- partitioning - especially by country will not help
- 分区 - 特别是按国家划分也无济于事
- column IN (const-list) is not slow, it is in fact a case with special optimization
- 列IN(const-list)并不慢,实际上是特殊优化的情况
The problem is, that MySQL doesn't use the index for sorting. I cannot say why, because it should. Could be a bug.
问题是,MySQL不使用索引进行排序。我不能说为什么,因为它应该。可能是一个错误。
The best strategy to execute this query is to scan that sub-tree of the index where event_id=597. Since the index is then sorted by city_id, country_code no extra sorting is needed and rows can be counted while scanning.
执行此查询的最佳策略是扫描索引的子树,其中event_id = 597。由于索引按city_id排序,因此country_code不需要额外排序,扫描时可以计算行数。
So the indexes are already optimal for this query. MySQL is just not using them correctly.
因此索引已经是此查询的最佳选择。 MySQL没有正确使用它们。
I'm getting more information off line. It seems this is not a database problem at all, but
我正在离线获取更多信息。看起来这根本不是数据库问题,但是
- the schema is not normalized. The table contains not only country_code, but also country_name (this should be in an extra table).
- 架构未规范化。该表不仅包含country_code,还包含country_name(这应该在一个额外的表中)。
- the real query contains country_name in the select list. But since that column is not indexed, MySQL cannot use an index scan.
- 真实查询在选择列表中包含country_name。但由于该列未编入索引,因此MySQL无法使用索引扫描。
As soon as country_name is dropped from the select list, the query reverts to an index-only scan ("using index" in EXPLAIN output) and is blazingly fast.
从选择列表中删除country_name后,查询将恢复为仅索引扫描(EXPLAIN输出中的“using index”)并且速度非常快。