如何优化这个MySQL查询?数百万行

时间:2021-03-10 09:15:49

I have the following query:

我有以下查询:

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094 
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10 

The analytics table has 60M rows and the transactions table has 3M rows.

分析表有60M行,事务表有3M行。

When I run an EXPLAIN on this query, I get:

当我在这个查询上运行EXPLAIN时,我得到:

+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+
| # id |  select_type |      table      |  type  |    possible_keys    |        key        |        key_len       |            ref            |   rows   |   Extra   |                                                 |
+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+
| '1'  |  'SIMPLE'    |  'analytics'    |  'ref' |  'analytics_user_id | analytics_source' |  'analytics_user_id' |  '5'                      |  'const' |  '337662' |  'Using where; Using temporary; Using filesort' |
| '1'  |  'SIMPLE'    |  'transactions' |  'ref' |  'tran_analytics'   |  'tran_analytics' |  '5'                 |  'dijishop2.analytics.id' |  '1'     |  NULL     |                                                 |
+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+

I can't figure out how to optimise this query as it's already very basic. It takes around 70 seconds to run this query.

我无法弄清楚如何优化此查询,因为它已经非常基本。运行此查询大约需要70秒。

Here are the indexes that exist:

以下是存在的索引:

+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
|   # Table   |  Non_unique |          Key_name          |  Seq_in_index |    Column_name   |  Collation |  Cardinality |  Sub_part |  Packed |  Null  |  Index_type |  Comment |  Index_comment |
+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
| 'analytics' |  '0'        |  'PRIMARY'                 |  '1'          |  'id'            |  'A'       |  '56934235'  |  NULL     |  NULL   |  ''    |  'BTREE'    |  ''      |  ''            |
| 'analytics' |  '1'        |  'analytics_user_id'       |  '1'          |  'user_id'       |  'A'       |  '130583'    |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'analytics' |  '1'        |  'analytics_product_id'    |  '1'          |  'product_id'    |  'A'       |  '490812'    |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'analytics' |  '1'        |  'analytics_affil_user_id' |  '1'          |  'affil_user_id' |  'A'       |  '55222'     |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'analytics' |  '1'        |  'analytics_source'        |  '1'          |  'source'        |  'A'       |  '24604'     |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'analytics' |  '1'        |  'analytics_country_name'  |  '1'          |  'country_name'  |  'A'       |  '39510'     |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'analytics' |  '1'        |  'analytics_gordon'        |  '1'          |  'id'            |  'A'       |  '56934235'  |  NULL     |  NULL   |  ''    |  'BTREE'    |  ''      |  ''            |
| 'analytics' |  '1'        |  'analytics_gordon'        |  '2'          |  'user_id'       |  'A'       |  '56934235'  |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'analytics' |  '1'        |  'analytics_gordon'        |  '3'          |  'source'        |  'A'       |  '56934235'  |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+


+----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
|    # Table     |  Non_unique |      Key_name     |  Seq_in_index |    Column_name    |  Collation |  Cardinality |  Sub_part |  Packed |  Null  |  Index_type |  Comment |  Index_comment |
+----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
| 'transactions' |  '0'        |  'PRIMARY'        |  '1'          |  'id'             |  'A'       |  '2436151'   |  NULL     |  NULL   |  ''    |  'BTREE'    |  ''      |  ''            |
| 'transactions' |  '1'        |  'tran_user_id'   |  '1'          |  'user_id'        |  'A'       |  '56654'     |  NULL     |  NULL   |  ''    |  'BTREE'    |  ''      |  ''            |
| 'transactions' |  '1'        |  'transaction_id' |  '1'          |  'transaction_id' |  'A'       |  '2436151'   |  '191'    |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'transactions' |  '1'        |  'tran_analytics' |  '1'          |  'analytics'      |  'A'       |  '2436151'   |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'transactions' |  '1'        |  'tran_status'    |  '1'          |  'status'         |  'A'       |  '22'        |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'transactions' |  '1'        |  'gordon_trans'   |  '1'          |  'status'         |  'A'       |  '22'        |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
| 'transactions' |  '1'        |  'gordon_trans'   |  '2'          |  'analytics'      |  'A'       |  '2436151'   |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            |
+----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+

Simplified schema for the two tables before adding any extra indexes as suggested as it didn't improve the situation.

在添加任何额外索引之前简化了两个表的模式,因为它没有改善情况。

CREATE TABLE `analytics` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` int(11) DEFAULT NULL,
  `affil_user_id` int(11) DEFAULT NULL,
  `product_id` int(11) DEFAULT NULL,
  `medium` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `terms` varchar(1024) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `is_browser` tinyint(1) DEFAULT NULL,
  `is_mobile` tinyint(1) DEFAULT NULL,
  `is_robot` tinyint(1) DEFAULT NULL,
  `browser` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `mobile` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `robot` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `platform` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `referrer` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `domain` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `ip` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `continent_code` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `country_name` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `city` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `analytics_user_id` (`user_id`),
  KEY `analytics_product_id` (`product_id`),
  KEY `analytics_affil_user_id` (`affil_user_id`)
) ENGINE=InnoDB AUTO_INCREMENT=64821325 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

CREATE TABLE `transactions` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `transaction_id` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `user_id` int(11) NOT NULL,
  `pay_key` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `sender_email` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `amount` decimal(10,2) DEFAULT NULL,
  `currency` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `analytics` int(11) DEFAULT NULL,
  `ip_address` varchar(46) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `session_id` varchar(60) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `eu_vat_applied` int(1) DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `tran_user_id` (`user_id`),
  KEY `transaction_id` (`transaction_id`(191)),
  KEY `tran_analytics` (`analytics`),
  KEY `tran_status` (`status`)
) ENGINE=InnoDB AUTO_INCREMENT=10019356 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

If the above can not be optimised any further. Any implementation advice on summary tables will be great. We are using a LAMP stack on AWS. The above query is running on RDS (m1.large).

如果以上不能再进一步优化。关于汇总表的任何实现建议都会很棒。我们在AWS上使用LAMP堆栈。以上查询在RDS(m1.large)上运行。

13 个解决方案

#1


10  

I would create the following indexes (b-tree indexes):

我会创建以下索引(b树索引):

analytics(user_id, source, id) 
transactions(analytics, status)

This is different from Gordon's suggestion.

这与戈登的建议不同。

The order of columns in the index is important.

索引中列的顺序很重要。

You filter by specific analytics.user_id, so this field has to be the first in the index. Then you group by analytics.source. To avoid sorting by source this should be the next field of the index. You also reference analytics.id, so it is better to have this field as part of the index, put it last. Is MySQL capable of reading just the index and not touching the table? I don't know, but it is rather easy to test.

您可以按特定的analytics.user_id进行过滤,因此该字段必须是索引中的第一个。然后按analytics.source分组。为避免按源排序,这应该是索引的下一个字段。您还可以参考analytics.id,因此最好将此字段作为索引的一部分,将其放在最后。 MySQL是否能够只读取索引并且不能触及表格?我不知道,但测试相当容易。

Index on transactions has to start with analytics, because it would be used in the JOIN. We also need status.

事务索引必须从分析开始,因为它将在JOIN中使用。我们还需要地位。

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094 
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10 

#2


6  

For this query:

对于此查询:

SELECT a.source AS referrer, 
       COUNT(*) AS frequency, 
       SUM( t.status = 'COMPLETED' ) AS sales
FROM analytics a LEFT JOIN
     transactions t
     ON a.id = t.analytics
WHERE a.user_id = 52094 
GROUP BY a.source 
ORDER BY frequency DESC 
LIMIT 10 ;

You want an index on analytics(user_id, id, source) and transactions(analytics, status).

您需要有关分析(user_id,id,source)和事务(分析,状态)的索引。

#3


6  

First some analysis...

首先是一些分析......

SELECT  a.source AS referrer,
        COUNT(*) AS frequency,  -- See question below
        SUM(t.status = 'COMPLETED') AS sales
    FROM  analytics AS a
    LEFT JOIN  transactions AS t  ON a.id = t.analytics AS a
    WHERE  a.user_id = 52094
    GROUP BY  a.source
    ORDER BY  frequency DESC
    LIMIT  10 

If the mapping from a to t is "one-to-many", then you need to consider whether the COUNT and SUM have the correct values or inflated values. As the query stands, they are "inflated". The JOIN occurs before the aggregation, so you are counting the number of transactions and how many were completed. I'll assume that is desired.

如果从a到t的映射是“一对多”,那么您需要考虑COUNT和SUM是否具有正确的值或膨胀值。在查询中,它们是“膨胀的”。 JOIN发生在聚合之前,因此您可以计算事务数和已完成的事务数。我认为这是理想的。

Note: The usual pattern is COUNT(*); saying COUNT(x) implies checking x for being NULL. I suspect that check is not needed?

注意:通常的模式是COUNT(*);说COUNT(x)意味着检查x是否为NULL。我怀疑不需要支票?

This index handles the WHERE and is "covering":

该索引处理WHERE并且是“覆盖”:

 analytics:  INDEX(user_id, source, id)   -- user_id first

 transactions:  INDEX(analytics, status)  -- in this order

The GROUP BY may or may not require a 'sort'. The ORDER BY, being different than the GROUP BY, definitely will need a sort. And the entire grouped set of rows will need to be sorted; there is no shortcut for the LIMIT.

GROUP BY可能需要也可能不需要'排序'。 ORDER BY与GROUP BY不同,肯定需要排序。并且需要对整个分组的行进行排序; LIMIT没有捷径。

Normally, Summary tables are date-oriented. That is, the PRIMARY KEY includes a 'date' and some other dimensions. Perhaps, keying by date and user_id would make sense? How many transactions per day does the average user have? If at least 10, then let's consider a Summary table. Also, it is important not to be UPDATEing or DELETEing old records. More

通常,汇总表是面向日期的。也就是说,PRIMARY KEY包括'日期'和一些其他维度。也许,按日期和user_id键入会有意义吗?普通用户每天有多少笔交易?如果至少为10,那么让我们考虑一个Summary表。此外,重要的是不要更新或删除旧记录。更多

I would probably have

我可能会

user_id ...,
source ...,
dy DATE ...,
status ...,
freq      MEDIUMINT UNSIGNED NOT NULL,
status_ct MEDIUMINT UNSIGNED NOT NULL,
PRIMARY KEY(user_id, status, source, dy)

Then the query becomes

然后查询变为

SELECT  source AS referrer,
        SUM(freq) AS frequency,
        SUM(status_ct) AS completed_sales
    FROM  Summary
    WHERE  user_id = 52094
      AND  status = 'COMPLETED'
    GROUP BY source
    ORDER BY  frequency DESC
    LIMIT  10 

The speed comes from many factors

速度来自许多因素

  • Smaller table (fewer rows to look at)
  • 较小的表(要查看的行数较少)
  • No JOIN
  • 没有加入
  • More useful index
  • 更有用的索引

(It still needs the extra sort.)

(它仍然需要额外的排序。)

Even without the summary table, there may be some speedups...

即使没有汇总表,也可能会有一些加速......

  • How big are the tables? How big is `innodb_buffer_pool_size?
  • 桌子有多大? `innodb_buffer_pool_size有多大?
  • Normalizing some of the strings that are both bulky and repetitive could make that table not I/O-bound.
  • 规范化一些既庞大又重复的字符串可能会使该表不受I / O限制。
  • This is awful: KEY (transaction_id(191)); See here for 5 ways to fix it.
  • 这很糟糕:KEY(transaction_id(191));请参阅此处了解5种修复方法。
  • IP addresses do not need 255 bytes, nor utf8mb4_unicode_ci. (39) and ascii are sufficient.
  • IP地址不需要255个字节,也不需要utf8mb4_unicode_ci。 (39)和ascii就足够了。

#4


4  

Try below and let me know if this helps.

请尝试下面的内容,如果有帮助请告诉我。

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM (SELECT * FROM analytics where user_id = 52094) analytics
LEFT JOIN (SELECT analytics, status from transactions where analytics = 52094) transactions ON analytics.id = transactions.analytics
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10

#5


3  

Could you try below Approach:

你可以试试下面的方法:

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(sales) AS sales
FROM analytics
LEFT JOIN(
	SELECT transactions.Analytics, (CASE WHEN transactions.status = 'COMPLETED' THEN 1 ELSE 0 END) AS sales
	FROM analytics INNER JOIN transactions ON analytics.id = transactions.analytics
) Tra
ON analytics.id = Tra.analytics
WHERE analytics.user_id = 52094 
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10 

#6


3  

This query potentially joins millions of analytics records with transactions records and calculates the sum (including the status check) on millions of records. If we could first apply the LIMIT 10 and then do the join and calculate the sum, we could speed up the query. Unfortunately, we need the analytics.id for the join, which gets lost after applying the GROUP BY. But maybe analytics.source is selective enough to boost the query anyway.

此查询可能会将数百万条分析记录与事务记录连接在一起,并计算数百万条记录的总和(包括状态检查)。如果我们可以先应用LIMIT 10然后进行连接并计算总和,我们就可以加快查询速度。不幸的是,我们需要使用analytics.id进行连接,这在应用GROUP BY后会丢失。但也许analytics.source有足够的选择性来提升查询。

My Idea is therefore to calculate the frequencies, limit by them, to return the analytics.source and frequency in a subquery and to use this result to filter the analytics in the main query, which then does the rest of the joins and calculations on a hopefully much reduced number of records.

因此,我的想法是计算频率,通过它们限制,在子查询中返回analytics.source和频率,并使用此结果过滤主查询中的分析,然后在主查询中执行其余的连接和计算。希望记录数量大大减少。

Minimal subquery (note: no join, no sum, returns 10 records):

最小子查询(注意:没有连接,没有总和,返回10条记录):

SELECT
    source,
    COUNT(id) AS frequency
FROM analytics
WHERE user_id = 52094
GROUP BY source
ORDER BY frequency DESC 
LIMIT 10

The full query using the above query as subquery x:

使用上述查询作为子查询x的完整查询:

SELECT
    x.source AS referrer,
    x.frequency,
    SUM(IF(t.status = 'COMPLETED', 1, 0)) AS sales
FROM
    (<subquery here>) x
    INNER JOIN analytics a
       ON x.source = a.source  -- This reduces the number of records
    LEFT JOIN transactions t
       ON a.id = t.analytics
WHERE a.user_id = 52094      -- We could have several users per source
GROUP BY x.source, x.frequency
ORDER BY x.frequency DESC

If this does not yield the expected performance boost, this could be due to MySQL applying the joins in an unexpected order. As explained here "Is there a way to force MySQL execution order?", you could replace the join by STRAIGHT_JOIN in this case.

如果这不会产生预期的性能提升,这可能是由于MySQL以意外的顺序应用连接。如此处所述“是否有办法强制执行MySQL执行顺序?”,在这种情况下,您可以用STRAIGHT_JOIN替换连接。

#7


2  

Only Problem I find in your query is

我在您的查询中找到的唯一问题是

GROUP BY analytics.source 
ORDER BY frequency DESC 

because of this query is doing filesort using temporary table.

因为这个查询是使用临时表做文件。

One way to avoid this is by creating another table like

避免这种情况的一种方法是创建另一个表格

CREATE TABLE `analytics_aggr` (
  `source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `frequency` int(10) DEFAULT NULL,
  `sales` int(10) DEFAULT NULL,
  KEY `sales` (`sales`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;`

insert data into analytics_aggr using below query

使用以下查询将数据插入analytics_aggr

insert into analytics_aggr SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
    FROM analytics
    LEFT JOIN transactions ON analytics.id = transactions.analytics
    WHERE analytics.user_id = 52094 
    GROUP BY analytics.source 
    ORDER BY null 

Now you can easily get you data using

现在,您可以轻松地使用数据获取数据

select * from analytics_aggr order by sales desc

#8


2  

Try this

尝试这个

SELECT 
    a.source AS referrer, 
    COUNT(a.id) AS frequency, 
    SUM(t.sales) AS sales
FROM (Select id, source From analytics Where user_id = 52094) a
LEFT JOIN (Select analytics, case when status = 'COMPLETED' Then 1 else 0 end as sales 
           From transactions) t ON a.id = t.analytics
GROUP BY a.source 
ORDER BY frequency DESC 
LIMIT 10 

I'm proposing this because you said "they are massive table" but this sql using very few columns only. In this case if we use inline view with require columns only then it will be good

我提出这个是因为你说“它们是大规模的表”但是这个sql仅使用很少的列。在这种情况下,如果我们仅使用带有需求列的内联视图,那么它将是好的

Note: memory also will play important role here. So confirm the memory before decide the inline view

注意:记忆也会在这里发挥重要作用。因此在确定内联视图之前确认内存

#9


2  

I would try to separate querying from the two tables. Since you need only top 10 sources, I would get them first and then query from transactions the sales column:

我会尝试将查询与两个表分开。由于您只需要前10个来源,我会首先获取它们,然后从销售列的交易中查询:

SELECT  source as referrer
        ,frequency
        ,(select count(*) 
          from   transactions t  
          where  t.analytics in (select distinct id 
                                 from   analytics 
                                 where  user_id = 52094
                                        and source = by_frequency.source) 
                 and status = 'completed'
         ) as sales
from    (SELECT analytics.source
                ,count(*) as frequency
        from    analytics 
        where   analytics.user_id = 52094
        group by analytics.source
        order by frequency desc
        limit 10
        ) by_frequency

It may be also faster without the distinct

没有明显的,它也可能更快

#10


2  

I would try subquery:

我会尝试子查询:

SELECT a.source AS referrer, 
       COUNT(*) AS frequency,
       SUM((SELECT COUNT(*) FROM transactions t 
        WHERE a.id = t.analytics AND t.status = 'COMPLETED')) AS sales
FROM analytics a
WHERE a.user_id = 52094 
GROUP BY a.source
ORDER BY frequency DESC 
LIMIT 10; 

Plus indexes exactly as @Gordon's answer: analytics(user_id, id, source) and transactions(analytics, status).

加上索引与@ Gordon的答案完全相同:分析(user_id,id,source)和事务(分析,状态)。

#11


2  

I am assuming the predicate, user_id = 52094, is for illustration purpose and in application, the selected user_id is a variable.

我假设谓词user_id = 52094,仅用于说明目的,在应用程序中,所选的user_id是一个变量。

I also assume that ACID property is not very important here.

我还假设ACID属性在这里不是很重要。

(1) Therefore, I will maintain two replica tables with only the necessary fields (it is similar to the indices Vladimir had suggested above) using a utility table.

(1)因此,我将使用实用程序表维护两个只有必要字段的副本表(它类似于Vladimir上面建议的索引)。

CREATE TABLE mv_anal (
  `id` int(11) NOT NULL,
  `user_id` int(11) DEFAULT NULL,
  `source` varchar(45),
  PRIMARY KEY (`id`)
);

CREATE TABLE mv_trans (
  `id` int(11) NOT NULL,
  `status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `analytics` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

CREATE TABLE util (
  last_updated_anal int (11) NOT NULL,
  last_updated_trans int (11) NOT NULL
);

INSERT INTO util (0, 0);

The gain here is that we will be reading a relatively smaller projections of the original tables -- hopefully, OS level and DB level caches work and they aren't read from slower secondary storage but from faster RAM. This can be a very great gain.

这里的好处是我们将读取原始表的相对较小的预测 - 希望操作系统级别和数据库级别高速缓存工作,并且它们不是从较慢的辅助存储器读取而是从较快的RAM读取。这可能是一个非常大的收获。

Here is how I updated the two tables (the below is a transaction run by a cron) :

以下是我更新两个表的方法(下面是一个由cron运行的事务):

-- TRANSACTION STARTS -- 

INSERT INTO mv_trans 
SELECT id, IF (status = 'COMPLETE', 1, 0) AS status, analysis 
FROM transactions JOIN util
ON util.last_updated_trans <= transactions.id

UPDATE util
SET last_updated_trans = sub.m
FROM (SELECT MAX (id) AS m FROM mv_trans) sub;

-- TRANSACTION COMMITS -- 

-- similar transaction for mv_anal.

(2) Now, I will tackle the selectivity to reduce sequential scan time. I will have to build a b-tree index on user_id, source and id (in this sequence) on mv_anal.

(2)现在,我将解决减少顺序扫描时间的选择性问题。我将不得不在mv_anal上的user_id,source和id(在这个序列中)构建一个b树索引。

Note: the above can be achieved by just creating index on analytics table but building such an index requires reading big table with 60M rows. My method requires the index building to read only very thin table. Thus, we can rebuild the btree more frequently (to counter the skew problem as the table is append-only).

注意:上面的操作可以通过在分析表上创建索引来实现,但构建这样的索引需要读取具有60M行的大表。我的方法要求索引构建只读取非常薄的表。因此,我们可以更频繁地重建btree(以解决倾斜问题,因为表是仅附加的)。

This is how I make sure the high selectivity is achieved when querying and to counter skewing btree problem.

这就是我确保在查询和解决偏斜btree问题时实现高选择性的方法。

(3) In PostgreSQL, WITH subqueries are always materialized. I hope similarly for MySQL. Therefore, as the last mile of optimization:

(3)在PostgreSQL中,WITH子查询始终具体化。我同样希望MySQL。因此,作为最后一英里的优化:

WITH sub_anal AS (
  SELECT user_id, source AS referrer, COUNT (id) AS frequency
  FROM mv_anal
  WHERE user_id = 52094
  GROUP BY user_id, source
  ORDER BY COUNT (id) DESC
  LIMIT 10
)
SELECT sa.referrer, sa.frequency, SUM (status) AS sales
FROM sub_anal AS sa 
JOIN mv_anal anal 
ON sa.referrer = anal.source AND sa.user_id = anal.user_id
JOIN mv_trans AS trans
ON anal.id = trans.analytics

#12


1  

Late to the party. I think you'll need to load one index into MySQL's cache. The NLJ is probably killing performance. Here's how I see it:

迟到了。我想你需要在MySQL的缓存中加载一个索引。 NLJ可能会扼杀表现。以下是我的看法:

The Path

路径

Your query is simple. It has two tables and the "path" is very clear:

您的查询很简单。它有两个表,“路径”非常清晰:

  • The optimizer should plan on reading the analytics table first.
  • 优化器应该首先计划读取分析表。
  • The optimizer should plan on reading the transactions table second. This is because you are using a LEFT OUTER JOIN. No much discussion on this one.
  • 优化器应该计划第二次读取事务表。这是因为您正在使用LEFT OUTER JOIN。关于这一点的讨论不多。
  • Besides, the analytics table is 60 million rows and the best path should filter rows as soon as possible on this one.
  • 此外,分析表是6000万行,最佳路径应该尽快过滤行。

The Access

访问

Once the path is clear, you need to decide if you want to use an Index Access or a Table Access. Both have pros and cons. However, you want to improve the SELECT performance:

路径清除后,您需要决定是否要使用索引访问或表访问。两者都有利有弊。但是,您希望提高SELECT性能:

  • You should choose Index Access.
  • 您应该选择“索引访问”。
  • Avoid hybrid access. Therefore, you should avoid any Table Access (fetches) at all cost. Translation: place all the participating columns in indexes.
  • 避免混合访问。因此,您应该不惜一切代价避免任何表访问(提取)。翻译:将所有参与列放在索引中。

The Filtering

过滤

Again, you want high performance for the SELECT. Therefore:

同样,您希望SELECT具有高性能。因此:

  • You should perform the filtering at the index level, not at the table level.
  • 您应该在索引级别执行筛选,而不是在表级别执行筛选。

Row Aggregation

行聚合

After filtering, the next step is to aggregate rows by GROUP BY analytics.source. This can be improved by placing the source column as the first column in the index.

过滤后,下一步是通过GROUP BY analytics.source聚合行。可以通过将源列作为索引中的第一列放置来改进。

Optimal Indexes for Path, Access, Filtering, and Aggregation

路径,访问,过滤和聚合的最佳索引

Considering all the above, you should include all mentioned columns into indexes. The following indexes should improve the response time:

考虑到上述所有情况,您应该将所有提到的列包含在索引中。以下索引应该可以缩短响应时间:

create index ix1_analytics on analytics (user_id, source, id);

create index ix2_transactions on transactions (analytics, status);

These indexes fulfill the "path", the "access", and the "filtering" strategies decribed above.

这些索引实现了上面描述的“路径”,“访问”和“过滤”策略。

The Index Cache

索引缓存

Finally -- and this is critical -- load the secondary index into MySQL's memory cache. MySQL is performing a NLJ (Nested Loop Join) -- a 'ref' in MySQL lingo -- and needs to access the second one randomly nearly 200k times.

最后 - 这很关键 - 将二级索引加载到MySQL的内存缓存中。 MySQL正在执行一个NLJ(嵌套循环连接) - 一个MySQL术语中的'ref' - 并且需要随机访问第二个近200k次。

Unfortunately, I don't know for sure how to load the index into MySQL's cache. The use of FORCE may work, as in:

不幸的是,我不知道如何将索引加载到MySQL的缓存中。使用FORCE可能会起作用,如:

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions FORCE index (ix2_transactions)
  ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094 
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10

Make sure you have enough cache space. Here's a short question/answer to figure out: How to figure out if mysql index fits entirely in memory

确保您有足够的缓存空间。这里有一个简短的问题/答案:如何确定mysql索引是否完全适合内存

Good luck! Oh, and post the results.

祝你好运!哦,发布结果。

#13


1  

This question has definitely received a lot of attention so I'm sure all obvious solutions have been tried. I did not see something that addresses the LEFT JOIN in the query, though.

这个问题肯定得到了很多关注,所以我确信所有明显的解决方案都已经尝试过了。但是,我没有看到在查询中解决LEFT JOIN的问题。

I have noticed that LEFT JOIN statements usually force query planners into hash join which are fast for a small number of results, but terribly slow for a large number of results. As noted in @Rick James' answer, since the join in the original query is on the identity field analytics.id, this will generate large number of results. A hash join will yield terrible performance results. The suggestion below addresses this below without any schema or processing changes.

我注意到LEFT JOIN语句通常会强制查询规划器进入散列连接,这对于少量结果来说很快,但对于大量结果来说非常慢。正如@Rick James的回答所述,由于原始查询中的连接位于identity field analytics.id上,因此会生成大量结果。散列连接将产生可怕的性能结果。下面的建议解决了以下问题,没有任何架构或处理更改。

Since the aggregation is by analytics.source, I would try a query that creates separate aggregations for frequency by source and sales by source and defers the left join until after aggregation is complete. This should allow the indexes to be used best (typically this is a merge join for large data sets).

由于聚合是通过analytics.source进行的,我会尝试一个查询,按源和源按源创建单独的聚合频率,并推迟左连接,直到聚合完成。这应该允许最好地使用索引(通常这是大型数据集的合并连接)。

Here is my suggestion:

这是我的建议:

SELECT t1.source AS referrer, t1.frequency, t2.sales
FROM (
  -- Frequency by source
  SELECT a.source, COUNT(a.id) AS frequency
  FROM analytics a
  WHERE a.user_id=52094
  GROUP BY a.source
) t1
LEFT JOIN (
  -- Sales by source
  SELECT a.source,
    SUM(IF(t.status = 'COMPLETED', 1, 0)) AS sales
  FROM analytics a
  JOIN transactions t
  WHERE a.id = t.analytics
    AND t.status = 'COMPLETED'
    AND a.user_id=52094
  GROUP by a.source
) t2
  ON t1.source = t2.source
ORDER BY frequency DESC 
LIMIT 10 

Hope this helps.

希望这可以帮助。

#1


10  

I would create the following indexes (b-tree indexes):

我会创建以下索引(b树索引):

analytics(user_id, source, id) 
transactions(analytics, status)

This is different from Gordon's suggestion.

这与戈登的建议不同。

The order of columns in the index is important.

索引中列的顺序很重要。

You filter by specific analytics.user_id, so this field has to be the first in the index. Then you group by analytics.source. To avoid sorting by source this should be the next field of the index. You also reference analytics.id, so it is better to have this field as part of the index, put it last. Is MySQL capable of reading just the index and not touching the table? I don't know, but it is rather easy to test.

您可以按特定的analytics.user_id进行过滤,因此该字段必须是索引中的第一个。然后按analytics.source分组。为避免按源排序,这应该是索引的下一个字段。您还可以参考analytics.id,因此最好将此字段作为索引的一部分,将其放在最后。 MySQL是否能够只读取索引并且不能触及表格?我不知道,但测试相当容易。

Index on transactions has to start with analytics, because it would be used in the JOIN. We also need status.

事务索引必须从分析开始,因为它将在JOIN中使用。我们还需要地位。

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094 
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10 

#2


6  

For this query:

对于此查询:

SELECT a.source AS referrer, 
       COUNT(*) AS frequency, 
       SUM( t.status = 'COMPLETED' ) AS sales
FROM analytics a LEFT JOIN
     transactions t
     ON a.id = t.analytics
WHERE a.user_id = 52094 
GROUP BY a.source 
ORDER BY frequency DESC 
LIMIT 10 ;

You want an index on analytics(user_id, id, source) and transactions(analytics, status).

您需要有关分析(user_id,id,source)和事务(分析,状态)的索引。

#3


6  

First some analysis...

首先是一些分析......

SELECT  a.source AS referrer,
        COUNT(*) AS frequency,  -- See question below
        SUM(t.status = 'COMPLETED') AS sales
    FROM  analytics AS a
    LEFT JOIN  transactions AS t  ON a.id = t.analytics AS a
    WHERE  a.user_id = 52094
    GROUP BY  a.source
    ORDER BY  frequency DESC
    LIMIT  10 

If the mapping from a to t is "one-to-many", then you need to consider whether the COUNT and SUM have the correct values or inflated values. As the query stands, they are "inflated". The JOIN occurs before the aggregation, so you are counting the number of transactions and how many were completed. I'll assume that is desired.

如果从a到t的映射是“一对多”,那么您需要考虑COUNT和SUM是否具有正确的值或膨胀值。在查询中,它们是“膨胀的”。 JOIN发生在聚合之前,因此您可以计算事务数和已完成的事务数。我认为这是理想的。

Note: The usual pattern is COUNT(*); saying COUNT(x) implies checking x for being NULL. I suspect that check is not needed?

注意:通常的模式是COUNT(*);说COUNT(x)意味着检查x是否为NULL。我怀疑不需要支票?

This index handles the WHERE and is "covering":

该索引处理WHERE并且是“覆盖”:

 analytics:  INDEX(user_id, source, id)   -- user_id first

 transactions:  INDEX(analytics, status)  -- in this order

The GROUP BY may or may not require a 'sort'. The ORDER BY, being different than the GROUP BY, definitely will need a sort. And the entire grouped set of rows will need to be sorted; there is no shortcut for the LIMIT.

GROUP BY可能需要也可能不需要'排序'。 ORDER BY与GROUP BY不同,肯定需要排序。并且需要对整个分组的行进行排序; LIMIT没有捷径。

Normally, Summary tables are date-oriented. That is, the PRIMARY KEY includes a 'date' and some other dimensions. Perhaps, keying by date and user_id would make sense? How many transactions per day does the average user have? If at least 10, then let's consider a Summary table. Also, it is important not to be UPDATEing or DELETEing old records. More

通常,汇总表是面向日期的。也就是说,PRIMARY KEY包括'日期'和一些其他维度。也许,按日期和user_id键入会有意义吗?普通用户每天有多少笔交易?如果至少为10,那么让我们考虑一个Summary表。此外,重要的是不要更新或删除旧记录。更多

I would probably have

我可能会

user_id ...,
source ...,
dy DATE ...,
status ...,
freq      MEDIUMINT UNSIGNED NOT NULL,
status_ct MEDIUMINT UNSIGNED NOT NULL,
PRIMARY KEY(user_id, status, source, dy)

Then the query becomes

然后查询变为

SELECT  source AS referrer,
        SUM(freq) AS frequency,
        SUM(status_ct) AS completed_sales
    FROM  Summary
    WHERE  user_id = 52094
      AND  status = 'COMPLETED'
    GROUP BY source
    ORDER BY  frequency DESC
    LIMIT  10 

The speed comes from many factors

速度来自许多因素

  • Smaller table (fewer rows to look at)
  • 较小的表(要查看的行数较少)
  • No JOIN
  • 没有加入
  • More useful index
  • 更有用的索引

(It still needs the extra sort.)

(它仍然需要额外的排序。)

Even without the summary table, there may be some speedups...

即使没有汇总表,也可能会有一些加速......

  • How big are the tables? How big is `innodb_buffer_pool_size?
  • 桌子有多大? `innodb_buffer_pool_size有多大?
  • Normalizing some of the strings that are both bulky and repetitive could make that table not I/O-bound.
  • 规范化一些既庞大又重复的字符串可能会使该表不受I / O限制。
  • This is awful: KEY (transaction_id(191)); See here for 5 ways to fix it.
  • 这很糟糕:KEY(transaction_id(191));请参阅此处了解5种修复方法。
  • IP addresses do not need 255 bytes, nor utf8mb4_unicode_ci. (39) and ascii are sufficient.
  • IP地址不需要255个字节,也不需要utf8mb4_unicode_ci。 (39)和ascii就足够了。

#4


4  

Try below and let me know if this helps.

请尝试下面的内容,如果有帮助请告诉我。

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM (SELECT * FROM analytics where user_id = 52094) analytics
LEFT JOIN (SELECT analytics, status from transactions where analytics = 52094) transactions ON analytics.id = transactions.analytics
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10

#5


3  

Could you try below Approach:

你可以试试下面的方法:

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(sales) AS sales
FROM analytics
LEFT JOIN(
	SELECT transactions.Analytics, (CASE WHEN transactions.status = 'COMPLETED' THEN 1 ELSE 0 END) AS sales
	FROM analytics INNER JOIN transactions ON analytics.id = transactions.analytics
) Tra
ON analytics.id = Tra.analytics
WHERE analytics.user_id = 52094 
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10 

#6


3  

This query potentially joins millions of analytics records with transactions records and calculates the sum (including the status check) on millions of records. If we could first apply the LIMIT 10 and then do the join and calculate the sum, we could speed up the query. Unfortunately, we need the analytics.id for the join, which gets lost after applying the GROUP BY. But maybe analytics.source is selective enough to boost the query anyway.

此查询可能会将数百万条分析记录与事务记录连接在一起,并计算数百万条记录的总和(包括状态检查)。如果我们可以先应用LIMIT 10然后进行连接并计算总和,我们就可以加快查询速度。不幸的是,我们需要使用analytics.id进行连接,这在应用GROUP BY后会丢失。但也许analytics.source有足够的选择性来提升查询。

My Idea is therefore to calculate the frequencies, limit by them, to return the analytics.source and frequency in a subquery and to use this result to filter the analytics in the main query, which then does the rest of the joins and calculations on a hopefully much reduced number of records.

因此,我的想法是计算频率,通过它们限制,在子查询中返回analytics.source和频率,并使用此结果过滤主查询中的分析,然后在主查询中执行其余的连接和计算。希望记录数量大大减少。

Minimal subquery (note: no join, no sum, returns 10 records):

最小子查询(注意:没有连接,没有总和,返回10条记录):

SELECT
    source,
    COUNT(id) AS frequency
FROM analytics
WHERE user_id = 52094
GROUP BY source
ORDER BY frequency DESC 
LIMIT 10

The full query using the above query as subquery x:

使用上述查询作为子查询x的完整查询:

SELECT
    x.source AS referrer,
    x.frequency,
    SUM(IF(t.status = 'COMPLETED', 1, 0)) AS sales
FROM
    (<subquery here>) x
    INNER JOIN analytics a
       ON x.source = a.source  -- This reduces the number of records
    LEFT JOIN transactions t
       ON a.id = t.analytics
WHERE a.user_id = 52094      -- We could have several users per source
GROUP BY x.source, x.frequency
ORDER BY x.frequency DESC

If this does not yield the expected performance boost, this could be due to MySQL applying the joins in an unexpected order. As explained here "Is there a way to force MySQL execution order?", you could replace the join by STRAIGHT_JOIN in this case.

如果这不会产生预期的性能提升,这可能是由于MySQL以意外的顺序应用连接。如此处所述“是否有办法强制执行MySQL执行顺序?”,在这种情况下,您可以用STRAIGHT_JOIN替换连接。

#7


2  

Only Problem I find in your query is

我在您的查询中找到的唯一问题是

GROUP BY analytics.source 
ORDER BY frequency DESC 

because of this query is doing filesort using temporary table.

因为这个查询是使用临时表做文件。

One way to avoid this is by creating another table like

避免这种情况的一种方法是创建另一个表格

CREATE TABLE `analytics_aggr` (
  `source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `frequency` int(10) DEFAULT NULL,
  `sales` int(10) DEFAULT NULL,
  KEY `sales` (`sales`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;`

insert data into analytics_aggr using below query

使用以下查询将数据插入analytics_aggr

insert into analytics_aggr SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
    FROM analytics
    LEFT JOIN transactions ON analytics.id = transactions.analytics
    WHERE analytics.user_id = 52094 
    GROUP BY analytics.source 
    ORDER BY null 

Now you can easily get you data using

现在,您可以轻松地使用数据获取数据

select * from analytics_aggr order by sales desc

#8


2  

Try this

尝试这个

SELECT 
    a.source AS referrer, 
    COUNT(a.id) AS frequency, 
    SUM(t.sales) AS sales
FROM (Select id, source From analytics Where user_id = 52094) a
LEFT JOIN (Select analytics, case when status = 'COMPLETED' Then 1 else 0 end as sales 
           From transactions) t ON a.id = t.analytics
GROUP BY a.source 
ORDER BY frequency DESC 
LIMIT 10 

I'm proposing this because you said "they are massive table" but this sql using very few columns only. In this case if we use inline view with require columns only then it will be good

我提出这个是因为你说“它们是大规模的表”但是这个sql仅使用很少的列。在这种情况下,如果我们仅使用带有需求列的内联视图,那么它将是好的

Note: memory also will play important role here. So confirm the memory before decide the inline view

注意:记忆也会在这里发挥重要作用。因此在确定内联视图之前确认内存

#9


2  

I would try to separate querying from the two tables. Since you need only top 10 sources, I would get them first and then query from transactions the sales column:

我会尝试将查询与两个表分开。由于您只需要前10个来源,我会首先获取它们,然后从销售列的交易中查询:

SELECT  source as referrer
        ,frequency
        ,(select count(*) 
          from   transactions t  
          where  t.analytics in (select distinct id 
                                 from   analytics 
                                 where  user_id = 52094
                                        and source = by_frequency.source) 
                 and status = 'completed'
         ) as sales
from    (SELECT analytics.source
                ,count(*) as frequency
        from    analytics 
        where   analytics.user_id = 52094
        group by analytics.source
        order by frequency desc
        limit 10
        ) by_frequency

It may be also faster without the distinct

没有明显的,它也可能更快

#10


2  

I would try subquery:

我会尝试子查询:

SELECT a.source AS referrer, 
       COUNT(*) AS frequency,
       SUM((SELECT COUNT(*) FROM transactions t 
        WHERE a.id = t.analytics AND t.status = 'COMPLETED')) AS sales
FROM analytics a
WHERE a.user_id = 52094 
GROUP BY a.source
ORDER BY frequency DESC 
LIMIT 10; 

Plus indexes exactly as @Gordon's answer: analytics(user_id, id, source) and transactions(analytics, status).

加上索引与@ Gordon的答案完全相同:分析(user_id,id,source)和事务(分析,状态)。

#11


2  

I am assuming the predicate, user_id = 52094, is for illustration purpose and in application, the selected user_id is a variable.

我假设谓词user_id = 52094,仅用于说明目的,在应用程序中,所选的user_id是一个变量。

I also assume that ACID property is not very important here.

我还假设ACID属性在这里不是很重要。

(1) Therefore, I will maintain two replica tables with only the necessary fields (it is similar to the indices Vladimir had suggested above) using a utility table.

(1)因此,我将使用实用程序表维护两个只有必要字段的副本表(它类似于Vladimir上面建议的索引)。

CREATE TABLE mv_anal (
  `id` int(11) NOT NULL,
  `user_id` int(11) DEFAULT NULL,
  `source` varchar(45),
  PRIMARY KEY (`id`)
);

CREATE TABLE mv_trans (
  `id` int(11) NOT NULL,
  `status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,
  `analytics` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
);

CREATE TABLE util (
  last_updated_anal int (11) NOT NULL,
  last_updated_trans int (11) NOT NULL
);

INSERT INTO util (0, 0);

The gain here is that we will be reading a relatively smaller projections of the original tables -- hopefully, OS level and DB level caches work and they aren't read from slower secondary storage but from faster RAM. This can be a very great gain.

这里的好处是我们将读取原始表的相对较小的预测 - 希望操作系统级别和数据库级别高速缓存工作,并且它们不是从较慢的辅助存储器读取而是从较快的RAM读取。这可能是一个非常大的收获。

Here is how I updated the two tables (the below is a transaction run by a cron) :

以下是我更新两个表的方法(下面是一个由cron运行的事务):

-- TRANSACTION STARTS -- 

INSERT INTO mv_trans 
SELECT id, IF (status = 'COMPLETE', 1, 0) AS status, analysis 
FROM transactions JOIN util
ON util.last_updated_trans <= transactions.id

UPDATE util
SET last_updated_trans = sub.m
FROM (SELECT MAX (id) AS m FROM mv_trans) sub;

-- TRANSACTION COMMITS -- 

-- similar transaction for mv_anal.

(2) Now, I will tackle the selectivity to reduce sequential scan time. I will have to build a b-tree index on user_id, source and id (in this sequence) on mv_anal.

(2)现在,我将解决减少顺序扫描时间的选择性问题。我将不得不在mv_anal上的user_id,source和id(在这个序列中)构建一个b树索引。

Note: the above can be achieved by just creating index on analytics table but building such an index requires reading big table with 60M rows. My method requires the index building to read only very thin table. Thus, we can rebuild the btree more frequently (to counter the skew problem as the table is append-only).

注意:上面的操作可以通过在分析表上创建索引来实现,但构建这样的索引需要读取具有60M行的大表。我的方法要求索引构建只读取非常薄的表。因此,我们可以更频繁地重建btree(以解决倾斜问题,因为表是仅附加的)。

This is how I make sure the high selectivity is achieved when querying and to counter skewing btree problem.

这就是我确保在查询和解决偏斜btree问题时实现高选择性的方法。

(3) In PostgreSQL, WITH subqueries are always materialized. I hope similarly for MySQL. Therefore, as the last mile of optimization:

(3)在PostgreSQL中,WITH子查询始终具体化。我同样希望MySQL。因此,作为最后一英里的优化:

WITH sub_anal AS (
  SELECT user_id, source AS referrer, COUNT (id) AS frequency
  FROM mv_anal
  WHERE user_id = 52094
  GROUP BY user_id, source
  ORDER BY COUNT (id) DESC
  LIMIT 10
)
SELECT sa.referrer, sa.frequency, SUM (status) AS sales
FROM sub_anal AS sa 
JOIN mv_anal anal 
ON sa.referrer = anal.source AND sa.user_id = anal.user_id
JOIN mv_trans AS trans
ON anal.id = trans.analytics

#12


1  

Late to the party. I think you'll need to load one index into MySQL's cache. The NLJ is probably killing performance. Here's how I see it:

迟到了。我想你需要在MySQL的缓存中加载一个索引。 NLJ可能会扼杀表现。以下是我的看法:

The Path

路径

Your query is simple. It has two tables and the "path" is very clear:

您的查询很简单。它有两个表,“路径”非常清晰:

  • The optimizer should plan on reading the analytics table first.
  • 优化器应该首先计划读取分析表。
  • The optimizer should plan on reading the transactions table second. This is because you are using a LEFT OUTER JOIN. No much discussion on this one.
  • 优化器应该计划第二次读取事务表。这是因为您正在使用LEFT OUTER JOIN。关于这一点的讨论不多。
  • Besides, the analytics table is 60 million rows and the best path should filter rows as soon as possible on this one.
  • 此外,分析表是6000万行,最佳路径应该尽快过滤行。

The Access

访问

Once the path is clear, you need to decide if you want to use an Index Access or a Table Access. Both have pros and cons. However, you want to improve the SELECT performance:

路径清除后,您需要决定是否要使用索引访问或表访问。两者都有利有弊。但是,您希望提高SELECT性能:

  • You should choose Index Access.
  • 您应该选择“索引访问”。
  • Avoid hybrid access. Therefore, you should avoid any Table Access (fetches) at all cost. Translation: place all the participating columns in indexes.
  • 避免混合访问。因此,您应该不惜一切代价避免任何表访问(提取)。翻译:将所有参与列放在索引中。

The Filtering

过滤

Again, you want high performance for the SELECT. Therefore:

同样,您希望SELECT具有高性能。因此:

  • You should perform the filtering at the index level, not at the table level.
  • 您应该在索引级别执行筛选,而不是在表级别执行筛选。

Row Aggregation

行聚合

After filtering, the next step is to aggregate rows by GROUP BY analytics.source. This can be improved by placing the source column as the first column in the index.

过滤后,下一步是通过GROUP BY analytics.source聚合行。可以通过将源列作为索引中的第一列放置来改进。

Optimal Indexes for Path, Access, Filtering, and Aggregation

路径,访问,过滤和聚合的最佳索引

Considering all the above, you should include all mentioned columns into indexes. The following indexes should improve the response time:

考虑到上述所有情况,您应该将所有提到的列包含在索引中。以下索引应该可以缩短响应时间:

create index ix1_analytics on analytics (user_id, source, id);

create index ix2_transactions on transactions (analytics, status);

These indexes fulfill the "path", the "access", and the "filtering" strategies decribed above.

这些索引实现了上面描述的“路径”,“访问”和“过滤”策略。

The Index Cache

索引缓存

Finally -- and this is critical -- load the secondary index into MySQL's memory cache. MySQL is performing a NLJ (Nested Loop Join) -- a 'ref' in MySQL lingo -- and needs to access the second one randomly nearly 200k times.

最后 - 这很关键 - 将二级索引加载到MySQL的内存缓存中。 MySQL正在执行一个NLJ(嵌套循环连接) - 一个MySQL术语中的'ref' - 并且需要随机访问第二个近200k次。

Unfortunately, I don't know for sure how to load the index into MySQL's cache. The use of FORCE may work, as in:

不幸的是,我不知道如何将索引加载到MySQL的缓存中。使用FORCE可能会起作用,如:

SELECT 
    analytics.source AS referrer, 
    COUNT(analytics.id) AS frequency, 
    SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales
FROM analytics
LEFT JOIN transactions FORCE index (ix2_transactions)
  ON analytics.id = transactions.analytics
WHERE analytics.user_id = 52094 
GROUP BY analytics.source 
ORDER BY frequency DESC 
LIMIT 10

Make sure you have enough cache space. Here's a short question/answer to figure out: How to figure out if mysql index fits entirely in memory

确保您有足够的缓存空间。这里有一个简短的问题/答案:如何确定mysql索引是否完全适合内存

Good luck! Oh, and post the results.

祝你好运!哦,发布结果。

#13


1  

This question has definitely received a lot of attention so I'm sure all obvious solutions have been tried. I did not see something that addresses the LEFT JOIN in the query, though.

这个问题肯定得到了很多关注,所以我确信所有明显的解决方案都已经尝试过了。但是,我没有看到在查询中解决LEFT JOIN的问题。

I have noticed that LEFT JOIN statements usually force query planners into hash join which are fast for a small number of results, but terribly slow for a large number of results. As noted in @Rick James' answer, since the join in the original query is on the identity field analytics.id, this will generate large number of results. A hash join will yield terrible performance results. The suggestion below addresses this below without any schema or processing changes.

我注意到LEFT JOIN语句通常会强制查询规划器进入散列连接,这对于少量结果来说很快,但对于大量结果来说非常慢。正如@Rick James的回答所述,由于原始查询中的连接位于identity field analytics.id上,因此会生成大量结果。散列连接将产生可怕的性能结果。下面的建议解决了以下问题,没有任何架构或处理更改。

Since the aggregation is by analytics.source, I would try a query that creates separate aggregations for frequency by source and sales by source and defers the left join until after aggregation is complete. This should allow the indexes to be used best (typically this is a merge join for large data sets).

由于聚合是通过analytics.source进行的,我会尝试一个查询,按源和源按源创建单独的聚合频率,并推迟左连接,直到聚合完成。这应该允许最好地使用索引(通常这是大型数据集的合并连接)。

Here is my suggestion:

这是我的建议:

SELECT t1.source AS referrer, t1.frequency, t2.sales
FROM (
  -- Frequency by source
  SELECT a.source, COUNT(a.id) AS frequency
  FROM analytics a
  WHERE a.user_id=52094
  GROUP BY a.source
) t1
LEFT JOIN (
  -- Sales by source
  SELECT a.source,
    SUM(IF(t.status = 'COMPLETED', 1, 0)) AS sales
  FROM analytics a
  JOIN transactions t
  WHERE a.id = t.analytics
    AND t.status = 'COMPLETED'
    AND a.user_id=52094
  GROUP by a.source
) t2
  ON t1.source = t2.source
ORDER BY frequency DESC 
LIMIT 10 

Hope this helps.

希望这可以帮助。