SQL查询:加快大型表的速度

时间:2021-04-02 02:44:59

We have a table with about 25,000,000 rows called 'events' having the following schema:

我们有一个包含大约25,000,000行的表,称为“事件”,具有以下模式:

TABLE events
- campaign_id   : int(10)
- city      : varchar(60)
- country_code  : varchar(2)

The following query takes VERY long (> 2000 seconds):

以下查询需要很长时间(> 2000秒):

SELECT COUNT(*) AS counted_events, country_code
FROM events
WHERE campaign_id` in (597) 
GROUPY BY city, country_code
ORDER BY counted_events

We found out that it's because of the GROUP BY part.

我们发现这是因为GROUP BY部分。

There is already an index idx_campaign_id_city_country_code on (campaign_id, city, country_code) which is used.

已经使用了(campaign_id,city,country_code)上的索引idx_campaign_id_city_country_code。

Maybe someone can suggest a good solution to speed it up?

也许有人可以提出一个很好的解决方案来加速它?

Update:

更新:

'Explain' shows that out of many possible index MySql uses this one: 'idx_campaign_id_city_country_code', for rows it shows: '471304' and for 'Extra' it shows: 'Using where; Using temporary; Using filesort' –

'Explain'表明,在许多可能的索引中,MySql使用了这个:'idx_campaign_id_city_country_code',对于它显示的行:'471304'和'Extra'它显示:'使用where;使用临时;使用filesort' -

Here is the whole result of EXPLAIN:

这是EXPLAIN的整个结果:

  • id: '1'
  • id:'1'
  • select_type: 'SIMPLE'
  • select_type:'SIMPLE'
  • table: 'events'
  • 表:'事件'
  • type: 'ref'
  • 输入:'ref'
  • possible_keys: 'index_campaign,idx_campaignid_paid,idx_city_country_code,idx_city_country_code_campaign_id,idx_cid,idx_campaign_id_city_country_code'
  • possible_keys:'index_campaign,idx_campaignid_paid,idx_city_country_code,idx_city_country_code_campaign_id,idx_cid,idx_campaign_id_city_country_code'
  • key: 'idx_campaign_id_city_country_code'
  • key:'idx_campaign_id_city_country_code'
  • key_len: '4'
  • key_len:'4'
  • ref: 'const'
  • ref:'const'
  • rows: '471304'
  • 行:'471304'
  • Extra: 'Using where; Using temporary; Using filesort'
  • 额外:'使用地点;使用临时;使用filesort'

UPDATE:

更新:

Ok, I think it has been solved:

好的,我认为它已经解决了:

Looking at the pasted query here again I realized that I forget to mention here that there was one more column in the SELECT called 'country_name'. So the query was very slow then (including country_name), but I'll just leave it out and now the performance of the query is absolutely ok. Sorry for that mistake!

再次查看粘贴的查询,我意识到我忘记在这里提到SELECT中还有一个名为'country_name'的列。所以查询非常慢(包括country_name),但我会把它留下来,现在查询的性能绝对可以。抱歉,这个错误!

So thank you for all your helpful comments, I'll upvote all the good answers! There were some really helpful additions, that I probably also we apply (like changing types etc).

非常感谢您的所有有用的评论,我将提供所有好的答案!有一些非常有用的补充,我可能也应用(如改变类型等)。

4 个解决方案

#1


3  

without seeing what EXPLAIN says it's a long distance shot, anyway:

无论如何,没有看到EXPLAIN说它是远距离射击:

  1. make an index on (city,country_code)
  2. 在(city,country_code)上建立索引
  3. see if there's a way to use partitioning, your table is getting rather huge
  4. 看看是否有办法使用分区,你的表变得相当庞大
  5. if country code is always 2 chars change it to char
  6. 如果国家代码总是2个字符,则将其更改为char
  7. change numeric indexes to unsigned int
  8. 将数字索引更改为unsigned int

post entire EXPLAIN output

发布整个EXPLAIN输出

#2


0  

don't use IN() - better use:

不要使用IN() - 更好地使用:

WHERE campaign_id = 597
OR campaign_id = 231
OR ....

afaik IN() is very slow.

afaik IN()很慢。

update: like nik0lias commented - IN() is faster than concatenating OR conditions.

更新:像nik0lias一样评论 - IN()比连接OR条件更快。

#3


0  

Some ideas:

一些想法:

  • Given the nature and size of the table it would be a great candidate for partitioned tables by country. This way the events of every country would be stored in a different physical table even if it behaves as a virtual big table

    鉴于表格的性质和大小,它将成为按国家/地区分区表格的绝佳选择。这样,每个国家的事件都将存储在不同的物理表中,即使它表现为虚拟大表

  • Is country code an string? May be you have a country_id that could be easier to sort. (It may force you to create or change indexes)

    国家代码是字符串吗?可能你有一个country_id,可以更容易排序。 (它可能会强制您创建或更改索引)

  • Are you really using the city in the group by?

    你是真的在小组中使用这座城市吗?

#4


0  

  • partitioning - especially by country will not help
  • 分区 - 特别是按国家划分也无济于事
  • column IN (const-list) is not slow, it is in fact a case with special optimization
  • 列IN(const-list)并不慢,实际上是特殊优化的情况

The problem is, that MySQL doesn't use the index for sorting. I cannot say why, because it should. Could be a bug.

问题是,MySQL不使用索引进行排序。我不能说为什么,因为它应该。可能是一个错误。

The best strategy to execute this query is to scan that sub-tree of the index where event_id=597. Since the index is then sorted by city_id, country_code no extra sorting is needed and rows can be counted while scanning.

执行此查询的最佳策略是扫描索引的子树,其中event_id = 597。由于索引按city_id排序,因此country_code不需要额外排序,扫描时可以计算行数。

So the indexes are already optimal for this query. MySQL is just not using them correctly.

因此索引已经是此查询的最佳选择。 MySQL没有正确使用它们。


I'm getting more information off line. It seems this is not a database problem at all, but

我正在离线获取更多信息。看起来这根本不是数据库问题,但是

  1. the schema is not normalized. The table contains not only country_code, but also country_name (this should be in an extra table).
  2. 架构未规范化。该表不仅包含country_code,还包含country_name(这应该在一个额外的表中)。
  3. the real query contains country_name in the select list. But since that column is not indexed, MySQL cannot use an index scan.
  4. 真实查询在选择列表中包含country_name。但由于该列未编入索引,因此MySQL无法使用索引扫描。

As soon as country_name is dropped from the select list, the query reverts to an index-only scan ("using index" in EXPLAIN output) and is blazingly fast.

从选择列表中删除country_name后,查询将恢复为仅索引扫描(EXPLAIN输出中的“using index”)并且速度非常快。

#1


3  

without seeing what EXPLAIN says it's a long distance shot, anyway:

无论如何,没有看到EXPLAIN说它是远距离射击:

  1. make an index on (city,country_code)
  2. 在(city,country_code)上建立索引
  3. see if there's a way to use partitioning, your table is getting rather huge
  4. 看看是否有办法使用分区,你的表变得相当庞大
  5. if country code is always 2 chars change it to char
  6. 如果国家代码总是2个字符,则将其更改为char
  7. change numeric indexes to unsigned int
  8. 将数字索引更改为unsigned int

post entire EXPLAIN output

发布整个EXPLAIN输出

#2


0  

don't use IN() - better use:

不要使用IN() - 更好地使用:

WHERE campaign_id = 597
OR campaign_id = 231
OR ....

afaik IN() is very slow.

afaik IN()很慢。

update: like nik0lias commented - IN() is faster than concatenating OR conditions.

更新:像nik0lias一样评论 - IN()比连接OR条件更快。

#3


0  

Some ideas:

一些想法:

  • Given the nature and size of the table it would be a great candidate for partitioned tables by country. This way the events of every country would be stored in a different physical table even if it behaves as a virtual big table

    鉴于表格的性质和大小,它将成为按国家/地区分区表格的绝佳选择。这样,每个国家的事件都将存储在不同的物理表中,即使它表现为虚拟大表

  • Is country code an string? May be you have a country_id that could be easier to sort. (It may force you to create or change indexes)

    国家代码是字符串吗?可能你有一个country_id,可以更容易排序。 (它可能会强制您创建或更改索引)

  • Are you really using the city in the group by?

    你是真的在小组中使用这座城市吗?

#4


0  

  • partitioning - especially by country will not help
  • 分区 - 特别是按国家划分也无济于事
  • column IN (const-list) is not slow, it is in fact a case with special optimization
  • 列IN(const-list)并不慢,实际上是特殊优化的情况

The problem is, that MySQL doesn't use the index for sorting. I cannot say why, because it should. Could be a bug.

问题是,MySQL不使用索引进行排序。我不能说为什么,因为它应该。可能是一个错误。

The best strategy to execute this query is to scan that sub-tree of the index where event_id=597. Since the index is then sorted by city_id, country_code no extra sorting is needed and rows can be counted while scanning.

执行此查询的最佳策略是扫描索引的子树,其中event_id = 597。由于索引按city_id排序,因此country_code不需要额外排序,扫描时可以计算行数。

So the indexes are already optimal for this query. MySQL is just not using them correctly.

因此索引已经是此查询的最佳选择。 MySQL没有正确使用它们。


I'm getting more information off line. It seems this is not a database problem at all, but

我正在离线获取更多信息。看起来这根本不是数据库问题,但是

  1. the schema is not normalized. The table contains not only country_code, but also country_name (this should be in an extra table).
  2. 架构未规范化。该表不仅包含country_code,还包含country_name(这应该在一个额外的表中)。
  3. the real query contains country_name in the select list. But since that column is not indexed, MySQL cannot use an index scan.
  4. 真实查询在选择列表中包含country_name。但由于该列未编入索引,因此MySQL无法使用索引扫描。

As soon as country_name is dropped from the select list, the query reverts to an index-only scan ("using index" in EXPLAIN output) and is blazingly fast.

从选择列表中删除country_name后,查询将恢复为仅索引扫描(EXPLAIN输出中的“using index”)并且速度非常快。