非常慢的MYSQL查询250万行表

时间:2022-09-19 19:06:25

I'm really struggling to get a query time down, its currently having to query 2.5 million rows and it takes over 20 seconds

我真的很难让查询时间缩短,它目前不得不查询250万行,它需要超过20秒

here is the query

这是查询

SELECT play_date AS date, COUNT(DISTINCT(email)) AS count
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
GROUP BY play_date
ORDER BY play_date desc;

 `id` int(11) NOT NULL auto_increment,
  `instance` varchar(255) NOT NULL,
  `email` varchar(255) NOT NULL,
  `type` enum('play','claim','friend','email') NOT NULL,
  `result` enum('win','win-small','lose','none') NOT NULL,
  `timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `play_date` date NOT NULL,
  `email_refer` varchar(255) NOT NULL,
  `remote_addr` varchar(15) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `email` (`email`),
  KEY `result` (`result`),
  KEY `timestamp` (`timestamp`),
  KEY `email_refer` (`email_refer`),
  KEY `type_2` (`type`,`timestamp`),
  KEY `type_4` (`type`,`play_date`),
  KEY `type_result` (`type`,`play_date`,`result`)

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  log ref type_2,type_4,type_result   type_4  1   const   270404  Using where

The query is using the type_4 index.

该查询使用type_4索引。

Does anyone know how I could speed this query up?

有谁知道如何加快这个查询速度?

Thanks Tom

谢谢汤姆

8 个解决方案

#1


15  

That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.

这已经相对不错了。性能接收器是查询必须比较270404 varchars以获得COUNT(DISTINCT(email))的相等性,这意味着必须读取270404行。

You could be able to make the count faster by creating a covering index. This means that the actual rows do not need to be read because all the required information is present in the index itself.

您可以通过创建覆盖索引来加快计数。这意味着不需要读取实际行,因为索引本身中存在所有必需的信息。

To do this, change the index as follows:

为此,请按如下所示更改索引:

KEY `type_4` (`type`,`play_date`, `email`)

I would be surprised if that wouldn't speed things up quite a bit.

如果不能加快速度,我会感到惊讶。

(Thanks to MarkR for the proper term.)

(感谢MarkR的正确任期。)

#2


5  

Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.

您的索引可能与您可以获得的索引一样好。在where子句中的2列上有一个复合索引,并且您发布的解释表明它正在被使用。不幸的是,有270,404行符合where子句中的条件,所有这些行都需要考虑。此外,您没有在选择列表中返回不必要的行。

My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.

我的建议是每天(或每小时或任何有意义的)汇总数据并缓存结果。这样您就可以立即访问稍微陈旧的数据。希望这对您的目的是可以接受的。

#3


4  

Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps

尝试使用play_date上的索引,键入(与type_4相同,只是反转字段)并查看是否有帮助

There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.

有4种可能的类型,我假设有100个可能的日期。如果查询使用类型,play_date索引,它基本上(不是100%准确,但一般的想法)说。

(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates

By reversing the index, the approach is

通过扭转指数,方法是

> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file

Hope this helps

希望这可以帮助

#4


3  

Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).

将电子邮件提取到单独的表应该是一个很好的性能提升,因为计算不同的varchar字段应该需要一段时间。除此之外 - 使用正确的索引并且查询本身尽可能优化(当然除了电子邮件)。

#5


1  

The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.

COUNT(DISTINCT(电子邮件))部分是杀死你的一部分。如果你真的只需要270,404的前2000个结果,也许只对结果而不是整个集合进行电子邮件计数。

SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
    SELECT play_date AS date
      FROM log
     WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
       AND type = 'play'
     ORDER BY play_date desc
     LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date

#6


0  

Try creating an index only on play_date.

尝试仅在play_date上创建索引。

#7


0  

Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.

从长远来看,我建议使用play_date的主键和不同的电子邮件计数来构建一个汇总表。

Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.

取决于您需要它的最新状态 - 允许每天更新(通过play_date)或通过日志表上的触发器进行实时更新。

#8


0  

There is a good chance a table scan will be quicker than random access to over 200,000 rows:

表扫描很可能比随机访问超过200,000行更快:

SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...

Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):

此外,对于大型分组查询,您可以通过强制文件排序而不是基于散列表的组来看到更好的性能(因为如果这需要超过tmp_table_size或max_heap_table_size性能崩溃):

SELECT SQL_BIG_RESULT ...

#1


15  

That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.

这已经相对不错了。性能接收器是查询必须比较270404 varchars以获得COUNT(DISTINCT(email))的相等性,这意味着必须读取270404行。

You could be able to make the count faster by creating a covering index. This means that the actual rows do not need to be read because all the required information is present in the index itself.

您可以通过创建覆盖索引来加快计数。这意味着不需要读取实际行,因为索引本身中存在所有必需的信息。

To do this, change the index as follows:

为此,请按如下所示更改索引:

KEY `type_4` (`type`,`play_date`, `email`)

I would be surprised if that wouldn't speed things up quite a bit.

如果不能加快速度,我会感到惊讶。

(Thanks to MarkR for the proper term.)

(感谢MarkR的正确任期。)

#2


5  

Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.

您的索引可能与您可以获得的索引一样好。在where子句中的2列上有一个复合索引,并且您发布的解释表明它正在被使用。不幸的是,有270,404行符合where子句中的条件,所有这些行都需要考虑。此外,您没有在选择列表中返回不必要的行。

My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.

我的建议是每天(或每小时或任何有意义的)汇总数据并缓存结果。这样您就可以立即访问稍微陈旧的数据。希望这对您的目的是可以接受的。

#3


4  

Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps

尝试使用play_date上的索引,键入(与type_4相同,只是反转字段)并查看是否有帮助

There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.

有4种可能的类型,我假设有100个可能的日期。如果查询使用类型,play_date索引,它基本上(不是100%准确,但一般的想法)说。

(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates

By reversing the index, the approach is

通过扭转指数,方法是

> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file

Hope this helps

希望这可以帮助

#4


3  

Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).

将电子邮件提取到单独的表应该是一个很好的性能提升,因为计算不同的varchar字段应该需要一段时间。除此之外 - 使用正确的索引并且查询本身尽可能优化(当然除了电子邮件)。

#5


1  

The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.

COUNT(DISTINCT(电子邮件))部分是杀死你的一部分。如果你真的只需要270,404的前2000个结果,也许只对结果而不是整个集合进行电子邮件计数。

SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
    SELECT play_date AS date
      FROM log
     WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
       AND type = 'play'
     ORDER BY play_date desc
     LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date

#6


0  

Try creating an index only on play_date.

尝试仅在play_date上创建索引。

#7


0  

Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.

从长远来看,我建议使用play_date的主键和不同的电子邮件计数来构建一个汇总表。

Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.

取决于您需要它的最新状态 - 允许每天更新(通过play_date)或通过日志表上的触发器进行实时更新。

#8


0  

There is a good chance a table scan will be quicker than random access to over 200,000 rows:

表扫描很可能比随机访问超过200,000行更快:

SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...

Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):

此外,对于大型分组查询,您可以通过强制文件排序而不是基于散列表的组来看到更好的性能(因为如果这需要超过tmp_table_size或max_heap_table_size性能崩溃):

SELECT SQL_BIG_RESULT ...