即使覆盖索引，也优化COUNT（DISTINCT）的缓慢

We have a table in MySql with arround 30 million records, the following is table structure

我们在MySql中有一个表有大约3000万条记录,以下是表结构

CREATE TABLE `campaign_logs` (
  `domain` varchar(50) DEFAULT NULL,
  `campaign_id` varchar(50) DEFAULT NULL,
  `subscriber_id` varchar(50) DEFAULT NULL,
  `message` varchar(21000) DEFAULT NULL,
  `log_time` datetime DEFAULT NULL,
  `log_type` varchar(50) DEFAULT NULL,
  `level` varchar(50) DEFAULT NULL,
  `campaign_name` varchar(500) DEFAULT NULL,
  KEY `subscriber_id_index` (`subscriber_id`),
  KEY `log_type_index` (`log_type`),
  KEY `log_time_index` (`log_time`),
  KEY `campid_domain_logtype_logtime_subid_index` (`campaign_id`,`domain`,`log_type`,`log_time`,`subscriber_id`),
  KEY `domain_logtype_logtime_index` (`domain`,`log_type`,`log_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

Following is my query

以下是我的查询

I'm doing UNION ALL instead of using IN operation

我正在做UNION ALL而不是使用IN操作

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       count(DISTINCT subscriber_id) AS COUNT,
       COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_OPENED'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date

UNION ALL

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       COUNT(DISTINCT subscriber_id) AS COUNT,
            COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_SENT'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date

UNION ALL

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       COUNT(DISTINCT subscriber_id) AS COUNT,
            COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_CLICKED'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date,

Following is my Explain statement

以下是我的解释声明

+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| id | select_type  | table         | type  | possible_keys                             | key                                       | key_len | ref  | rows   | Extra                                    |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
|  1 | PRIMARY      | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |  55074 | Using where; Using index; Using filesort |
|  2 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL | 330578 | Using where; Using index; Using filesort |
|  3 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |   1589 | Using where; Using index; Using filesort |
| NULL | UNION RESULT | <union1,2,3>  | ALL   | NULL                                      | NULL                                      | NULL    | NULL |   NULL |                                          |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+

I changed COUNT(subscriber_id) to COUNT(*) and observed no performance gain.

我将COUNT(subscriber_id)更改为COUNT(*)并且没有观察到性能提升。

2.I removed COUNT(DISTINCT subscriber_id) from the query , then I got huge performance gain , I'm getting results in approx 1.5 sec, previously it was taking 50 sec - 1 minute. But I need distinct count of subscriber_id from the query

2.我从查询中删除了COUNT(DISTINCT subscriber_id),然后我获得了巨大的性能提升,我得到了大约1.5秒的结果,之前它需要50秒-1分钟。但我需要查询中不同的subscriber_id计数

Following is explain when I remove COUNT(DISTINCT subscriber_id) from the query

当我从查询中删除COUNT(DISTINCT subscriber_id)时,将解释以下内容

+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| id | select_type  | table         | type  | possible_keys                             | key                                       | key_len | ref  | rows   | Extra                                                     |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
|  1 | PRIMARY      | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |  55074 | Using where; Using index; Using temporary; Using filesort |
|  2 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL | 330578 | Using where; Using index; Using temporary; Using filesort |
|  3 | UNION        | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468     | NULL |   1589 | Using where; Using index; Using temporary; Using filesort |
| NULL | UNION RESULT | <union1,2,3>  | ALL   | NULL                                      | NULL                                      | NULL    | NULL |   NULL |                                                           |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+

I ran three queries individually by removing UNION ALL. ONe query took 32seconds , others are taking 1.5 seconds each, but first query is dealing with around 350K records and others are dealing with only 2k rows

我通过删除UNION ALL单独运行了三个查询。 ONe查询需要32秒,其他查询需要1.5秒,但是第一个查询处理大约350K记录而其他查询只处理2k行

I could solve my performance problem by leaving out COUNT(DISTINCT...) but I need those values. Is there a way to refactor my query, or add an index, or something, to get the COUNT(DISTINCT...) values, but much faster?

我可以通过省略COUNT(DISTINCT ...)来解决我的性能问题,但我需要这些值。有没有办法重构我的查询,或添加索引,或什么,以获得COUNT(DISTINCT ...)值,但更快?

UPDATE the following information is about data distribution of above table

更新以下信息是关于上表的数据分布

for 1 domain 1 campaign 20 log_types 1k-200k subscribers

1域1活动20 log_types 1k-200k订阅者

The above query I'm running for , the domain having 180k+ subscribers.

上面的查询我正在运行,域名拥有180k +订阅者。

6 个解决方案

#1

If the query without the count(distinct) is going much faster, perhaps you can do nested aggregation:

如果没有count(distinct)的查询要快得多,也许你可以做嵌套聚合:

SELECT log_type, log_date,
       count(*) AS COUNT, sum(cnt) AS total
FROM (SELECT log_type,
             DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
             subscriber_id, count(*) as cnt
      FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
      WHERE DOMAIN = 'xxx' AND
            campaign_id = '123' AND
            log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
            log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND 
                             CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
      GROUP BY logtype, log_date, subscriber_id
     ) l
GROUP BY logtype, log_date;

With a bit of luck, this will take 2-3 seconds rather than 50. However, you might need to break this out into subqueries, to get full performance. So, if this does not have a significant performance gain, change the in back to = one of the types. If that works, then the union all may be necessary.

运气好的话,这需要2-3秒而不是50秒。但是,您可能需要将其分解为子查询,以获得完整的性能。因此,如果这没有显着的性能提升,请将in更改为=其中一种类型。如果这样做,那么联盟可能都是必要的。

EDIT:

Another attempt is to use variables to enumerate the values before the group by:

另一种尝试是使用变量枚举组之前的值:

SELECT log_type, log_date, count(*) as cnt,
       SUM(rn = 1) as sub_cnt
FROM (SELECT log_type,
             DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
             subscriber_id,
             (@rn := if(@clt = concat_ws(':', campaign_id, log_type, log_time), @rn + 1,
                        if(@clt := concat_ws(':', campaign_id, log_type, log_time), 1, 1)
                       )
              ) as rn
      FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index) CROSS JOIN
           (select @rn := 0)
      WHERE DOMAIN = 'xxx' AND
            campaign_id = '123' AND
            log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
            log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00', '+00:00', '+05:30') AND 
                             CONVERT_TZ('2015-03-01 23:59:58', '+00:00', '+05:30')
      ORDER BY logtype, log_date, subscriber_id
     ) t
GROUP BY log_type, log_date;

This still requires another sort of the data, but it might help.

这仍然需要另一种数据,但它可能会有所帮助。

#2

To answer your question:

回答你的问题:

Is there a way to refactor my query, or add an index, or something, to get the COUNT(DISTINCT...) values, but much faster?

有没有办法重构我的查询,或添加索引,或什么,以获得COUNT(DISTINCT ...)值,但更快?

Yes, do not group by the calculated field (do not group by the result of the function). Instead, pre-calculate it, save it to the persistent column and include this persistent column into the index.

是的,不要按计算字段分组(不要按功能结果分组)。相反,预先计算它,将其保存到持久列并将此持久列包含在索引中。

I would try to do the following and see if it changes performance significantly.

我会尝试执行以下操作,看看它是否会显着改变性能。

1) Simplify the query and focus on one part. Leave only one longest running SELECT out of the three, get rid of UNION for the tuning period. Once the longest SELECT is optimized, add more and check how the full query works.

1)简化查询并专注于一个部分。在三个中只留下一个运行时间最长的SELECT,在调整期间摆脱UNION。优化最长的SELECT后,添加更多并检查完整查询的工作方式。

2) Grouping by the result of the function doesn't let the engine use index efficiently. Add another column to the table (at first temporarily, just to check the idea) with the result of this function. As far as I can see you want to group by 1 hour, so add column log_time_hour datetime and set it to log_time rounded/truncated to the nearest hour (preserve the date component).

2)通过函数的结果进行分组不会让引擎有效地使用索引。使用此函数的结果向表中添加另一列(首先暂时,只是为了检查想法)。据我所知,你想要分组1小时,所以添加列log_time_hour datetime并将其设置为log_time舍入/截断到最接近的小时(保留日期组件)。

Add index using new column: (domain, campaign_id, log_type, log_time_hour, subscriber_id). The order of first three columns in the index should not matter (because you use equality compare to some constant in the query, not the range), but make them in the same order as in the query. Or, better, make them in the index definition and in the query in the order of selectivity. If you have 100,000 campaigns, 1000 domains and 3 log types, then put them in this order: campaign_id, domain, log_type. It should not matter much, but is worth checking. log_time_hour has to come fourth in the index definition and subscriber_id last.

使用新列添加索引:(domain,campaign_id,log_type,log_time_hour,subscriber_id)。索引中前三列的顺序无关紧要(因为您使用等式与查询中的某个常量进行比较,而不是范围),但要使它们与查询中的顺序相同。或者,更好的是,按照选择性的顺序在索引定义和查询中创建它们。如果您有100,000个广告系列,1000个域和3种日志类型,请按以下顺序排列:campaign_id,domain,log_type。这应该不重要,但值得检查。 log_time_hour必须在索引定义中排在第四位,而subscriber_id排在最后。

In the query use new column in WHERE and in GROUP BY. Make sure that you include all needed columns in the GROUP BY: both log_type and log_time_hour.

在查询中使用WHERE和GROUP BY中的新列。确保在GROUP BY中包含所有需要的列:log_type和log_time_hour。

Do you need both COUNT and COUNT(DISTINCT)? Leave only COUNT first and measure the performance. Leave only COUNT(DISTINCT)and measure the performance. Leave both and measure the performance. See how they compare.

你需要COUNT和COUNT(DISTINCT)吗?首先只保留COUNT并测量性能。只留下COUNT(DISTINCT)并测量性能。保留两者并测量性能。看看他们如何比较。

SELECT log_type,
       log_time_hour,
       count(DISTINCT subscriber_id) AS distinct_total,
       COUNT(subscriber_id) AS total
FROM stats.campaign_logs
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_type = 'EMAIL_OPENED'
  AND log_time_hour >= '2015-02-01 00:00:00' 
  AND log_time_hour <  '2015-03-02 00:00:00'
GROUP BY log_type, log_time_hour

#3

SELECT log_type,
       DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
       count(DISTINCT subscriber_id) AS COUNT,
       COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
  AND campaign_id='123'
  AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_type, log_date

Add AND log_type IN ('EMAIL_OPENED', 'EMAIL_SENT', 'EMAIL_CLICKED') if needed.

如果需要,添加AND log_type IN('EMAIL_OPENED','EMAIL_SENT','EMAIL_CLICKED')。

#4

I would try other orderings of the index you're using, moving the subscriber_id around, and see what the effect is. It's possible you can get better results by moving columns up with a higher cardinality.

我会尝试你正在使用的索引的其他排序,移动subscriber_id,看看效果是什么。通过以更高的基数移动列,您可以获得更好的结果。

At first, I thought it might only be using part of the index (not getting to subscriber_id at all). If it can't use subscriber_id, then moving it up the index tree will cause it to run slower, which will at least tell you it can't use it.

起初,我认为它可能只是使用索引的一部分(根本没有到达subscriber_id)。如果它不能使用subscriber_id,那么将它向上移动索引树将导致它运行得更慢,这至少会告诉你它不能使用它。

I can't think of much else you can play with.

我想不出你可以玩的其他东西。

#5

subscriber_id is of no use in your key because you are grouping by a calculated field outside the key (log_date) before counting distinct subscribers. It explains why this is so slow, because MySQL has to sort and filter duplicate subscribers without use of the key.

subscriber_id在您的密钥中没有用,因为您在计算不同的订户之前按密钥(log_date)之外的计算字段进行分组。它解释了为什么这么慢,因为MySQL必须在不使用密钥的情况下对重复的订阅者进行排序和过滤。
There might be an error with your log_time condition : you should have the opposite timezone conversion of your select (i.e. '+05:30','+00:00'), but it wont have any major incidence on your query time.

您的log_time条件可能有错误:您应该选择相反的时区转换(即'+05:30','+ 00:00'),但它不会对您的查询时间产生任何重大影响。
You can avoid the "union all" by doing a log_type IN (...) and group by log_type, log_date

您可以通过执行log_type IN(...)并按log_type,log_date进行分组来避免“联合所有”

Best effective solutions would be to add a mid-hour field to your database schema and set there one of the 48 mid-hour of the day (and take care of mid-hour timezone). So you could be able to use an index on campaign_id,domain,log_type,log_mid_hour,subscriber_id

最有效的解决方案是在数据库模式中添加一个中午时间字段,并在当天的48小时中设置一个(并处理中午时区)。因此,您可以在campaign_id,domain,log_type,log_mid_hour,subscriber_id上使用索引

This will be quite redundant but will improve the speed.

这将是多余的,但会提高速度。

So this should led to some initializations in your table: be careful : do not test this on your production table

所以这应该导致你的表中的一些初始化:小心:不要在你的生产表上测试它

ALTER TABLE campaign_logs
   ADD COLUMN log_mid_hour TINYINT AFTER log_time;

UPDATE campaign_logs SET log_mid_hour=2*HOUR(log_time)+IF(MINUTE(log_time)>29,1,0);

ALTER TABLE campaign_logs
ADD INDEX(`campaign_id`,`domain`,`log_time`,`log_type`,`log_mid_hour`,`subscriber_id`);

You'll also have to set the log_mid_hour in your script for future records.

您还必须在脚本中设置log_mid_hour以备将来记录。

Your query will become (for an 11 mid-hour timeshift):

您的查询将成为(中午时间为11小时):

SELECT log_type,
   MOD(log_mid_hour+11, 48) tz_log_mid_hour,
   COUNT(DISTINCT subscriber_id) AS COUNT,
   COUNT(subscriber_id) AS total
FROM stats.campaign_logs
WHERE DOMAIN='xxx'
   AND campaign_id='123'
   AND log_type IN('EMAIL_SENT', 'EMAIL_OPENED','EMAIL_CLICKED')
   AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+05:30','+00:00')   
   AND CONVERT_TZ('2015-03-01 23:59:58','+05:30','+00:00')
GROUP BY log_type, log_mid_hour;

This will give you the count for each mid-hour taking full benefit of your index.

这将为您提供每个中午的计数,从而充分利用您的索引。

#6

I had a very similar problem, posted here on SO, and got some great help. Here's the thread: MySQL MyISAM slow count() query despite covering index

我有一个非常类似的问题,在这里发布,并得到了一些很好的帮助。这是线程:尽管覆盖索引,MySQL MyISAM慢count()查询

In a nutshell, I discovered that my problem had NOTHING to do with the query or the indexes, and EVERYTHING to do with the way I had set up the tables and MySQL. My exact same query became much faster when I:

简而言之,我发现我的问题与查询或索引无关,而且与我设置表和MySQL的方式有关。当我:我完全相同的查询变得更快

Switched to InnoDB (which you're already using)

切换到InnoDB(您已经使用过)

Switched the CHARSET to ASCII. If you don't need utf8, it takes 3x as much space (and time to search).

将CHARSET切换为ASCII。如果你不需要utf8,它需要3倍的空间(和搜索时间)。

Make each column size as small as possible, not null if possible.

使每个列的大小尽可能小,如果可能,不为null。

Increased MySQL's InnoDB buffer pool size. Many recommendations are to increase it to 70% of your RAM if this is a dedicated machine.

增加了MySQL的InnoDB缓冲池大小。如果这是一台专用机器,许多建议是将其增加到RAM的70%。

I Sorted my table by my covering index, wrote it out via SELECT INTO OUTFILE, and then re-inserted it into a new table. This physically sorts all the records in the search order.

我通过覆盖索引对表进行排序,通过SELECT INTO OUTFILE写出来,然后将其重新插入到新表中。这会对搜索顺序中的所有记录进行物理排序。

I have no idea which of these changes fixed my problem (because I was unscientific and didn't try them one at a time), but it made my queries 50-100x faster. YMMV.

我不知道哪些更改解决了我的问题(因为我是不科学的,并没有一次尝试一个),但它使我的查询速度提高了50-100倍。因人而异。

#1