Needed help in optimizing order by and count query, I have tables having millions (approx 3 millions) rows.

需要帮助优化order by和count查询，我有具有数百万行(大约300万)的表。

I have to join 4 tables and fetch the records, When i run the simple query it takes only millisecond to complete but as I try to count or order by having left join table it get stuck for unlimited of time.

我必须连接4个表并获取记录，当我运行简单的查询时，只需要毫秒就可以完成，但是当我尝试通过左连接表来计数或排序时，它会被阻塞无限的时间。

Please see the cases below.

请看下面的案例。

DB Server Configuration:

CPU Number of virtual cores: 4
Memory(RAM): 16 GiB
Network Performance: High

Rows in each table:

tbl_customers -  #Rows: 20 million.
tbl_customers_address -  #Row 25 million.
tbl_shop_setting - #Rows 50k
aio_customer_tracking - #Rows 5k

Tables Schema:

CREATE TABLE `tbl_customers` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(225) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `accepts_marketing` TINYINT(1) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_name` VARCHAR(50) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `last_order_id` BIGINT(20) NULL DEFAULT NULL,
    `total_spent` DECIMAL(12,2) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL COLLATE 'latin1_swedish_ci',
    `verified_email` TINYINT(4) NULL DEFAULT NULL,
    `updated_at` DATETIME NULL DEFAULT NULL,
    `created_at` DATETIME NULL DEFAULT NULL,
    `date_updated` DATETIME NULL DEFAULT NULL,
    `date_created` DATETIME NULL DEFAULT NULL,
    PRIMARY KEY (`id`),
    UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `shop_id` (`shop_id`)
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `tbl_customers_address` (
    `id` BIGINT(20) NOT NULL AUTO_INCREMENT,
    `customer_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_address_id` BIGINT(20) NULL DEFAULT NULL,
    `shopify_customer_id` BIGINT(20) NULL DEFAULT NULL,
    `first_name` VARCHAR(50) NULL DEFAULT NULL,
    `last_name` VARCHAR(50) NULL DEFAULT NULL,
    `company` VARCHAR(50) NULL DEFAULT NULL,
    `address1` VARCHAR(250) NULL DEFAULT NULL,
    `address2` VARCHAR(250) NULL DEFAULT NULL,
    `city` VARCHAR(50) NULL DEFAULT NULL,
    `province` VARCHAR(50) NULL DEFAULT NULL,
    `country` VARCHAR(50) NULL DEFAULT NULL,
    `zip` VARCHAR(15) NULL DEFAULT NULL,
    `phone` VARCHAR(20) NULL DEFAULT NULL,
    `name` VARCHAR(50) NULL DEFAULT NULL,
    `province_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_code` VARCHAR(5) NULL DEFAULT NULL,
    `country_name` VARCHAR(50) NULL DEFAULT NULL,
    `longitude` VARCHAR(250) NULL DEFAULT NULL,
    `latitude` VARCHAR(250) NULL DEFAULT NULL,
    `default` TINYINT(1) NULL DEFAULT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `customer_id` (`customer_id`),
    INDEX `shopify_address_id` (`shopify_address_id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

CREATE TABLE `tbl_shop_setting` (
    `id` INT(11) NOT NULL AUTO_INCREMENT,   
    `shop_name` VARCHAR(300) NOT NULL COLLATE 'latin1_swedish_ci',
     PRIMARY KEY (`id`),
)
COLLATE='utf8mb4_general_ci'
ENGINE=InnoDB;


CREATE TABLE `aio_customer_tracking` (
    `id` BIGINT(20) UNSIGNED NOT NULL AUTO_INCREMENT,
    `shopify_customer_id` BIGINT(20) UNSIGNED NOT NULL,
    `email` VARCHAR(255) NULL DEFAULT NULL,
    `shop_id` BIGINT(20) UNSIGNED NOT NULL,
    `domain` VARCHAR(255) NULL DEFAULT NULL,
    `web_session_count` INT(11) NOT NULL,
    `last_seen_date` DATETIME NULL DEFAULT NULL,
    `last_contact_date` DATETIME NULL DEFAULT NULL,
    `last_email_open` DATETIME NULL DEFAULT NULL,
    `created_date` DATETIME NOT NULL,
    `is_geo_fetched` TINYINT(1) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`),
    INDEX `shopify_customer_id` (`shopify_customer_id`),
    INDEX `email` (`email`),
    INDEX `shopify_customer_id_shop_id` (`shopify_customer_id`, `shop_id`),
    INDEX `last_seen_date` (`last_seen_date`)
)
COLLATE='latin1_swedish_ci'
ENGINE=InnoDB;

Query Cases Running and Not Running:

1. Running:  Below query fetch the records by joining all the 4 tables, It takes only 0.300 ms.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20

2. Not running: Simply when try to get the count of these row stuk the query, I waited 10 min but still running.

SELECT 
     COUNT(DISTINCT c.shopify_customer_id)   -- what makes #2 different
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
LIMIT 20


3. Not running: In the #1 query we simply put the 1 Order by clause and it get stuck, I waited 10 min but still running. I study query optimization some article and tried by indexing, Right Join etc.. but still not working.

SELECT `c`.first_name,`c`.last_name,`c`.email, `t`.`last_seen_date`, `t`.`last_contact_date`, `ssh`.`shop_name`, ca.`company`, ca.`address1`, ca.`address2`, ca.`city`, ca.`province`, ca.`country`, ca.`zip`, ca.`province_code`, ca.`country_code`
FROM `tbl_customers` AS `c`
JOIN `tbl_shop_setting` AS `ssh` ON c.shop_id = ssh.id 
LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id
LEFT JOIN `tbl_customers_address` as ca ON (c.shopify_customer_id = ca.shopify_customer_id AND ca.default = 1)
GROUP BY c.shopify_customer_id
  ORDER BY `t`.`last_seen_date`    -- what makes #3 different
LIMIT 20

EXPLAIN QUERY #1:

解释查询# 1:

EXPLAIN QUERY #2:

解释查询# 2:

EXPLAIN QUERY #3:

解释查询# 3:

Any suggestion to optimize the query, table structure are welcome.

任何优化查询、表结构的建议都是受欢迎的。

WHAT I'M TRYING TO DO:

tbl_customers table contains the customer info, tbl_customer_address table contains the addresses of the customers(one customer may have multiple address), And aio_customer_tracking table contains visiting records of the customer last_seen_date is the visiting date.

tbl_customers表包含客户信息，tbl_customer_address表包含客户的地址(一个客户可能有多个地址)，aio_customer_tracking表包含客户last_seen_date的访问记录。

Now, simply I want to fetch and count the customers, with their one of the address, and visiting info. Also, I may order by any of the column from these 3 tables, In my example i am ordering by last_seen_date (the default order). Hope this explanation helps to understand what i am trying to do.

现在，我只想获取和计算客户的地址和访问信息。另外，我可以从这3个表中任意一个列进行排序，在我的示例中，我使用的是last_seen_date(默认顺序)排序。希望这个解释能帮助我理解我在做什么。

4 个解决方案

#1

In query #1, but not the other two, the optimizer can use

在查询#1中，但不是其他两个，优化器可以使用

UNIQUE INDEX `shopify_customer_id_unique` (`shopify_customer_id`)

to cut the query short for

缩短查询

GROUP BY c.shopify_customer_id
LIMIT 20

This is because it can stop after 20 items of the index. The query is not ultra-fast because of the derived table (subquery t) that hits about 51K rows.

这是因为它可以在索引的20项之后停止。查询不是超快的，因为派生表(子查询t)会访问大约51K行。

Query #2 may be slow simply because the Optimizer failed to notice and remove the redundant DISTINCT. Instead, it may be thinking it can't stop after 20.

查询#2可能很慢，因为优化器没有注意到并删除冗余的不同项。相反，它可能认为20岁之后就不能停止。

Query #3 must go entirely through table c to get every shopify_customer_id group. This is because the ORDER BY prevents a short curcuit to get to the LIMIT 20.

查询#3必须完全遍历表c才能获取每个shopify_customer_id组。这是因为顺序由防止短规到20的极限。

The columns in a GROUP BY must include all the non-aggregate columns in the SELECT except any that are uniquely defined by the group by columns. Since you have said that there can be multiple addresses for a single shopify_customer_id, then fetching ca.address1 is not proper in connection with GROUP BY shopify_customer_id. Similarly, the subquery seems to be improper with respect to last_seen_date, last_contact_date.

组BY中的列必须包含SELECT中的所有非聚合列，但任何列由组按列惟一定义的列除外。既然您已经说过一个shopify_customer_id可以有多个地址，那么通过shopify_customer_id获取ca.address1在与组的连接中是不合适的。类似地，对于last_seen_date、last_contact_date，子查询似乎不合适。

In aio_customer_tracking, this change (to a "covering" index) may help a little:

在aio_customer_tracking中，这个更改(到“覆盖”索引)可能会有一点帮助:

INDEX (`shopify_customer_id`)

来

INDEX (`shopify_customer_id`, `last_seen_date`, `last_contact_date`)

Dissecting the goal

解剖的目标

Now, simply I want to ... count the customers

现在，我只想……计算客户

To count the customers, do this, but don't try to combine it with "fetching":

要计算客户数量，请这样做，但不要试图将其与“取货”结合起来:

SELECT COUNT(*) FROM tbl_customers;

Now, simply I want to fetch ... the customers...

现在，我只想取回。客户……

tbl_customers - #Rows: 20 million.

tbl_customers - #行:2000万。

Surely you don't want to fetch 20 million rows! I don't want to think about how to try to do that. Please clarify. And I won't accept paginating through that many rows. Perhaps there is a WHERE clause?? The WHERE clause is (usually) the most important part of Optimization!

你肯定不希望获得2000万行!我不想去想怎么做。请澄清。我不接受这么多行的分页。也许有WHERE子句??WHERE子句(通常)是优化的最重要部分!

Now, simply I want to fetch ... the customers, with their one of the address, and visiting info.

现在，我只想取回。客户，有他们的一个地址，和访问信息。

Assuming that the WHERE filters down to a "few" customers, then JOINing to another table to get "any" address and "any" visiting info, may be problematical and/or inefficient. To require the "first" or "last" instead of "any" won't be any easier, but might be more meaningful.

假设在这里筛选“少数”客户，然后加入另一个表来获得“任何”地址和“任何”访问信息，可能会有问题，或者效率低下。要求用“第一个”或“最后一个”而不是“任何”不会更容易，但可能更有意义。

May I suggest that your UI first find a few customers, then if the user wants, go to another page with all the addresses and all the visits. Or can the visits be in the hundreds or more?

我建议你的UI先找几个客户，然后如果用户想要，去另一个有所有地址和所有访问的页面。或者参观人数可能在数百人以上?

Also, I may order by any of the column from these 3 tables, In my example i am ordering by last_seen_date (the default order).

而且，我可以从这3个表中的任何一个列排序，在我的示例中，我是通过last_seen_date(默认的顺序)排序的。

Let's focus on optimizing the WHERE, then tack last_seen_date on the end of any index.

让我们集中精力优化WHERE，然后在任何索引的末尾添加last_seen_date。

#2

shopify_customer_id is unique in tbl_customers table, then in 2nd query why you use distinct and group by in shopify_customer_id column?

shopifcustomery__id在tbl_customers表中是惟一的，那么在第二个查询中为什么要在shopify_customer_id列中使用distinct和group by呢?

Please get rid of that.

请把它处理掉。

#3

Query 2 contains a logical mistake as pointed out by others: the count(distinct(c.shopify_customer_id)) will return a single value, therefore your group by is only complicating the query (this might indeed make MySQL grouping by shopify_customer_id first and then executing the count(distinct(shopify_customer_id )) which could be the reason for the somehow long execution time

查询2包含一个逻辑错误所指出的:计数(不同(c.shopify_customer_id))将返回一个值,因此你的group by只是复杂查询(这可能确实让MySQL分组shopify_customer_id,然后执行计数(不同(shopify_customer_id)),可能的原因长时间执行

The order by of Query 3 can not be optimized as you are joining on a subselect which cannot be indexed. The time it takes is simply the time the system needs to order the result set.

查询3的顺序不能被优化，因为您正在加入一个不能被索引的子选择。它所花费的时间仅仅是系统订购结果集所需的时间。

The solution to your problem would be to:

解决你的问题的办法是:

change the index shopify_customer_id (shopify_customer_id) of table tbl_customers_address to shopify_customer_id (shopify_customer_id,default) to optimize the following query

将表tbl_customers_address的索引shopify_customer_id (shopify_customer_id)更改为shopify_customer_id (shopify_customer_id，默认)以优化以下查询
create a table with the result from Query 1 (result) but without

使用查询1(结果)的结果创建一个表，但是不使用

LEFT JOIN (SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id) as t ON t.shopify_customer_id = c.shopify_customer_id.

LEFT JOIN(选择shopify_customer_id, last_seen_date, last_contact_date，由shopify_customer_id提供给aio_customer_tracking组)，如t ON t。shopify_customer_id = c.shopify_customer_id。
alter the result table and add a column for last_seen_date and indexes for last_seen_date and shopify_customer_id

修改结果表，并为last_seen_date和shopify_customer_id添加列和索引last_seen_date
create a table for the result of this query (last_Date) :

为这个查询的结果(last_Date)创建一个表:

SELECT shopify_customer_id, last_seen_date, last_contact_date FROM aio_customer_tracking GROUP BY shopify_customer_id

通过shopify_customer_id从aio_customer_tracking组中选择shopify_customer_id、last_seen_date、last_contact_date

Update the result table with the values from table last_Date
从表last_Date的值更新结果表。

Now you can run a query against the result table ordered by last_Date using the index you created.

现在可以使用创建的索引对last_Date排序的结果表运行查询。

The whole process should take way less time than executing query 2 or query 3

整个过程所需的时间应该比执行查询2或查询3少得多

#4

You have too many indexes and that can be a real performance killer when it comes to inserts, updates, and deletes, as well as occasionally for selects depending on optimization settings.

您有太多的索引，当涉及到插入、更新和删除时，这可能是一个真正的性能杀手，有时还会根据优化设置进行选择。

Also, remove the GROUP BY statement.

另外，按语句删除组。

There's more I could say about correct use of clustered vs. nonclustered indexes, GROUP BY, ORDER BY, WHERE, and views, for query optimization. However, I think if you remove some indexes, your queries will speed up a lot. (Maybe also rework your queries to follow stricter SQL standards and be a bit more logical, but that's outside the scope of this question.)

关于正确地使用集群和非聚集索引(GROUP BY, ORDER BY, WHERE和views)来进行查询优化，我还能说得更多。但是，我认为如果您删除了一些索引，那么您的查询将会加快很多速度(可能还需要重新处理查询，以遵循更严格的SQL标准，并且更符合逻辑，但这超出了这个问题的范围)。

One more thing - what are you doing with the query results? Is this being stored somewhere and accessed for lookups, used for calculations, used for automated reports, displaying through web database connection, etc? This makes a difference because if you just need a report/backup or export to a flat file, then there are way more efficient ways to get this data out. Lots of different options depending on what you're doing.

还有一件事——您正在对查询结果做什么?它是否被存储在某处并用于查找、用于计算、用于自动报告、通过web数据库连接显示等等?这是有区别的，因为如果您只需要一个报告/备份或导出到一个平面文件，那么就有更有效的方法来获取这些数据。很多不同的选择取决于你在做什么。

#1