I have this query, that on a table with ~300.000 rows take about 14sec to extract data. This table will increase its size in the near future...over a million rows. I have used the EXISTS
clause instead of the IN
clause, and I give an improvement. But the query is too slow. Can you have any solution? Thanks in advance.
我有这个查询,在一个拥有30 .000行的表上,提取数据需要大约14秒。这张桌子将在不久的将来扩大它的尺寸。超过一百万行。我使用了exist子句而不是IN子句,我做了改进。但是查询太慢了。你有什么办法吗?提前谢谢。
This is the query:
这是查询:
SELECT
flow,
COUNT(*) tot
FROM
(
SELECT
ff.session_id,
GROUP_CONCAT(ff.page, '#', ff.snippet_params,'$',ff.is_lead SEPARATOR '|') flow
FROM table_a ff
WHERE EXISTS
(
SELECT
f.session_id
FROM table_a f
WHERE f.session_id = ff.session_id
AND f.is_lead = 1
GROUP BY f.user_id
ORDER BY f.user_id, f.`timestamp`
)
GROUP BY ff.user_id
ORDER BY ff.user_id, ff.`timestamp`, ff.session_id
)
AS flow
GROUP BY flow
ORDER BY tot DESC LIMIT 10
This is the explain:
这是解释:
id select_type table type possible_keys key key_len ref rows Extra
------ ------------------ ---------- ------ ------------------ ---------- ------- ----------------------------- ------ ----------------------------------------------
1 PRIMARY <derived2> ALL (NULL) (NULL) (NULL) (NULL) 532 Using temporary; Using filesort
2 DERIVED ff ALL (NULL) (NULL) (NULL) (NULL) 322154 Using temporary; Using filesort
3 DEPENDENT SUBQUERY f ref is_lead,session_id session_id 767 ff.session_id 3 Using where; Using temporary; Using filesort
4 个解决方案
#1
2
The extra expressions in the ORDER BY don't make any sense, since the "GROUP BY user_id
" is going to guarantee a unique value of user_id
.
ORDER BY中的额外表达式没有任何意义,因为“user_id组”将保证user_id的唯一值。
The ORDER BY
operation are applied after the GROUP BY
operation. If my intent is to get the lowest session_id
for each user_id
, I would use a MIN
aggregate. In the original query, the ORDER BY
doesn't have any influence on which session_id
is returned. The value returned for session_id
is indeterminate.
操作的顺序在组操作之后应用。如果我的目的是为每个user_id获取最低的session_id,我将使用MIN聚合。在原始查询中,ORDER BY对返回session_id没有任何影响。session_id返回的值是不确定的。
(Other databases would throw an error with this query. A MySQL-specific extension to GROUP BY allows the query to run, but we can get more standard behavior by including ONLY_FULL_GROUP_BY in the sql_mode.)
(其他数据库会对该查询抛出错误。通过一个特定于mysql的扩展GROUP BY允许查询运行,但是我们可以通过在sql_mode中包含ONLY_FULL_GROUP_BY来获得更多的标准行为。
The GROUP BY
within the EXISTS subquery doesn't make any sense. If row is found, then a row exists. There's no need to do a GROUP BY and aggregate the rows that are found.
存在子查询中的组没有任何意义。如果找到该行,则存在一行。没有必要对已找到的行进行分组。
And looking at it more closely, there doesn't appear to be any need to return session_id
in the SELECT list. (Either in the flow
view query, or in the EXISTS subquery.)
更仔细地观察,似乎不需要在SELECT列表中返回session_id。(要么在流视图查询中,要么在exist子查询中。)
If we remove the extraneous syntax and whittle the query down to its essence, to the parts that actually matter, we are left with a query that looks like this:
如果我们删除无关的语法并将查询精简到其本质,到真正重要的部分,就会得到如下的查询:
SELECT flow.flow AS flow
, COUNT(*) AS tot
FROM (
SELECT GROUP_CONCAT(ff.page,'#',ff.snippet_params,'$',ff.is_lead SEPARATOR '|') AS flow
FROM table_a ff
WHERE EXISTS
( SELECT 1
FROM table_a f
WHERE f.is_lead = 1
AND f.session_id = ff.session_id
)
GROUP BY ff.user_id
) flow
GROUP BY flow.flow
ORDER BY tot DESC
LIMIT 10
The query basically says to get all rows from (the unfortunately named table) table_a
which have a session_id
which matches at least one row in table_a
with the same value of session_id
which also has is_lead
value of 1.
查询基本上是说从(不幸命名的表)table_a中获取所有的行,它有session_id,它匹配表_a中的至少一行,与session_id的值相同,session_id的值也是is_lead值为1。
And then take all of the found rows, and aggregate them based on the value in the user_id
column.
然后获取所有找到的行,并根据user_id列中的值对它们进行聚合。
It's very odd that there isn't an ORDER BY in the GROUP_CONCAT, and somewhat odd that there isn't a DISTINCT keyword.
GROUP_CONCAT中没有ORDER是非常奇怪的,而没有一个单独的关键字是有点奇怪的。
It's strange for the GROUP_CONCAT aggregation to return an indeterminate ordering of the rows, and also potentially include repeated values. (Given that the outer query is going to performing another aggregation based on the value returned from that GROUP_CONCAT aggregate.)
GROUP_CONCAT聚合返回不确定的行顺序,并且可能包含重复值,这很奇怪。(考虑到外部查询将根据GROUP_CONCAT集合返回的值执行另一个聚合。)
But, I'm not sure what question this query is supposed to be answering. And I don't have any knowledge of what's unique and what's not.
但是,我不确定这个查询应该回答什么问题。我不知道什么是独特的,什么不是。
We do know that the EXISTS subquery could be re-written as a JOIN operation:
我们知道存在的子查询可以重写为连接操作:
SELECT flow.flow AS flow
, COUNT(*) AS tot
FROM (
SELECT GROUP_CONCAT(ff.page,'#',ff.snippet_params,'$',ff.is_lead SEPARATOR '|') AS flow
FROM ( SELECT d.session_id
FROM table_a d
WHERE d.is_lead = 1
GROUP BY d.session_id
) e
JOIN table_a ff
ON ff.session_id = e.session_id
GROUP BY ff.user_id
) flow
GROUP BY flow.flow
ORDER BY tot DESC
LIMIT 10
We could work on making the query run faster. But before I did that, I would want to make sure that the query is returning a set that matches the specification. I need to make sure the query is actually answering the question that it's designed to answer.
我们可以使查询运行得更快。但在此之前,我希望确保查询返回的集合与规范匹配。我需要确保这个查询实际上回答了这个问题。
I suspect that the original query isn't correct. That is, I think that if the query is returning "correct" results, it's doing so accidentally, not because it's guaranteed to. Or because there is something peculiar about the uniqueness (cardinality) of rows in the table, or due to an accidental order that the rows are being processed in.
我怀疑原始查询不正确。也就是说,我认为如果查询返回的是“正确的”结果,那么它是意外地返回的,而不是因为它得到了保证。或者因为表中行的惟一性(基数)有些特殊,或者由于正在处理的行的偶然顺序。
I want to be sure that the query is guaranteed to return correct results, before I spend time tuning it, and adding indexes.
我希望确保查询在进行调优和添加索引之前返回正确的结果。
Q: Why isn't there an ORDER BY
in the GROUP_CONCAT
? e.g.
问:为什么GROUP_CONCAT中没有订单?如。
GROUP_CONCAT( foo ORDER BY something)
Q: Is there a specific reason there isn't a DISTINCT keyword?
问:有没有什么特别的原因没有一个独特的关键词?
GROUP_CONCAT(DISTINCT foo ORDER BY something)
Q: Should we be concerned with the potential for the GROUP_CONCAT to (silently) return a truncated value? (based on the setting of the group_concat_max_length
variable?)
问:我们应该关注GROUP_CONCAT返回截断值的可能性吗?(基于group_concat_max_length变量的设置?)
FOLLOWUP
跟踪
For best performance of that last query in the answer above, I recommend the following index be added:
为了更好地实现上面回答中的最后一个查询,我建议添加以下索引:
... ON table_a (session_id, is_lead, page, snippet_params)
or any similar index, with has session_id
and is_lead
as the leading columns (in that order), and also includes the page
and snippet_params
columns. If an ORDER BY is added to the GROUP_CONCAT, we may want a slightly different index.
或任何类似的索引,以session_id和is_lead作为主导列(按该顺序),还包括页面和snippet_params列。如果一个ORDER BY被添加到GROUP_CONCAT,我们可能需要一个稍微不同的索引。
For the outer query, there's no getting around the "Using filesort" operation the derived flow
column. (Unless you are running a more recent version of MySQL, where an index might be created. Or we're open to breaking the query into two separate operations. One query to materialize the inline view into a table, and a second query to run against that.)
对于外部查询,不能绕过派生流列的“使用filesort”操作。(除非您运行的是最新版本的MySQL,在那里可以创建索引。或者我们可以将查询分解为两个独立的操作。一个查询将内联视图显示到一个表中,另一个查询将根据它运行。
#2
1
In this subquery you are using group by but you don't have aggregation function.
在这个子查询中,您使用的是group by,但是没有聚合函数。
For the check of EXIST having a result for f.session_id
based on group by or not is the same .. you should remove the group by and the order by too
对于存在对f的结果的检验。基于组的session_id是相同的。您应该删除组by和订单by
WHERE EXISTS
(
SELECT
f.session_id
FROM table_a f
WHERE f.session_id = ff.session_id
AND f.is_lead = 1
GROUP BY f.user_id
ORDER BY f.user_id, f.`timestamp`
)
this way
这种方式
WHERE EXISTS
(
SELECT
f.session_id
FROM table_a f
WHERE f.session_id = ff.session_id
AND f.is_lead = 1
)
Looking at your query i think could be refactored eg:
看看你的问题,我认为可以重构
SELECT flow , COUNT(*) tot
FROM (
select
GROUP_CONCAT(ff.page, '#', ff.snippet_params,'$',ff.is_lead SEPARATOR '|') flow ,
FROM table_a ff
WHERE f.is_lead = 1
GROUP BY ff.user_id ) as new_flow
GROUP BY flow
ORDER BY tot DESC LIMIT 10
#3
0
You need to make sure that f.session_id and f.is_lead are indexed. It is currently doing a table scan of f for each row in the interim result against the ff reference of table_a.
你需要确定f。session_id和f。is_lead索引。它目前正在对临时结果中的每一行f进行表扫描,以反对table_a的ff引用。
#4
0
- Get rid of the count(*), IIRC MySQL can no longer cache queries if functions exist, try another approach to this.
- 去掉count(*), IIRC MySQL不能再缓存查询,如果函数存在,尝试另一种方法。
- Get rid of the subqueries, IIRC MySQL can't cache subqueries either.
- 去掉子查询,IIRC MySQL也不能缓存子查询。
It's difficult to give an optimized version of this query (or these queries). You might wan't to change your database structure so it allows simpler queries. Perhaps some caching (redis etc.) for other values...
很难给出这个查询(或这些查询)的优化版本。您可能不会改变您的数据库结构,因此它允许更简单的查询。也许一些缓存(redis等)用于其他值……
#1
2
The extra expressions in the ORDER BY don't make any sense, since the "GROUP BY user_id
" is going to guarantee a unique value of user_id
.
ORDER BY中的额外表达式没有任何意义,因为“user_id组”将保证user_id的唯一值。
The ORDER BY
operation are applied after the GROUP BY
operation. If my intent is to get the lowest session_id
for each user_id
, I would use a MIN
aggregate. In the original query, the ORDER BY
doesn't have any influence on which session_id
is returned. The value returned for session_id
is indeterminate.
操作的顺序在组操作之后应用。如果我的目的是为每个user_id获取最低的session_id,我将使用MIN聚合。在原始查询中,ORDER BY对返回session_id没有任何影响。session_id返回的值是不确定的。
(Other databases would throw an error with this query. A MySQL-specific extension to GROUP BY allows the query to run, but we can get more standard behavior by including ONLY_FULL_GROUP_BY in the sql_mode.)
(其他数据库会对该查询抛出错误。通过一个特定于mysql的扩展GROUP BY允许查询运行,但是我们可以通过在sql_mode中包含ONLY_FULL_GROUP_BY来获得更多的标准行为。
The GROUP BY
within the EXISTS subquery doesn't make any sense. If row is found, then a row exists. There's no need to do a GROUP BY and aggregate the rows that are found.
存在子查询中的组没有任何意义。如果找到该行,则存在一行。没有必要对已找到的行进行分组。
And looking at it more closely, there doesn't appear to be any need to return session_id
in the SELECT list. (Either in the flow
view query, or in the EXISTS subquery.)
更仔细地观察,似乎不需要在SELECT列表中返回session_id。(要么在流视图查询中,要么在exist子查询中。)
If we remove the extraneous syntax and whittle the query down to its essence, to the parts that actually matter, we are left with a query that looks like this:
如果我们删除无关的语法并将查询精简到其本质,到真正重要的部分,就会得到如下的查询:
SELECT flow.flow AS flow
, COUNT(*) AS tot
FROM (
SELECT GROUP_CONCAT(ff.page,'#',ff.snippet_params,'$',ff.is_lead SEPARATOR '|') AS flow
FROM table_a ff
WHERE EXISTS
( SELECT 1
FROM table_a f
WHERE f.is_lead = 1
AND f.session_id = ff.session_id
)
GROUP BY ff.user_id
) flow
GROUP BY flow.flow
ORDER BY tot DESC
LIMIT 10
The query basically says to get all rows from (the unfortunately named table) table_a
which have a session_id
which matches at least one row in table_a
with the same value of session_id
which also has is_lead
value of 1.
查询基本上是说从(不幸命名的表)table_a中获取所有的行,它有session_id,它匹配表_a中的至少一行,与session_id的值相同,session_id的值也是is_lead值为1。
And then take all of the found rows, and aggregate them based on the value in the user_id
column.
然后获取所有找到的行,并根据user_id列中的值对它们进行聚合。
It's very odd that there isn't an ORDER BY in the GROUP_CONCAT, and somewhat odd that there isn't a DISTINCT keyword.
GROUP_CONCAT中没有ORDER是非常奇怪的,而没有一个单独的关键字是有点奇怪的。
It's strange for the GROUP_CONCAT aggregation to return an indeterminate ordering of the rows, and also potentially include repeated values. (Given that the outer query is going to performing another aggregation based on the value returned from that GROUP_CONCAT aggregate.)
GROUP_CONCAT聚合返回不确定的行顺序,并且可能包含重复值,这很奇怪。(考虑到外部查询将根据GROUP_CONCAT集合返回的值执行另一个聚合。)
But, I'm not sure what question this query is supposed to be answering. And I don't have any knowledge of what's unique and what's not.
但是,我不确定这个查询应该回答什么问题。我不知道什么是独特的,什么不是。
We do know that the EXISTS subquery could be re-written as a JOIN operation:
我们知道存在的子查询可以重写为连接操作:
SELECT flow.flow AS flow
, COUNT(*) AS tot
FROM (
SELECT GROUP_CONCAT(ff.page,'#',ff.snippet_params,'$',ff.is_lead SEPARATOR '|') AS flow
FROM ( SELECT d.session_id
FROM table_a d
WHERE d.is_lead = 1
GROUP BY d.session_id
) e
JOIN table_a ff
ON ff.session_id = e.session_id
GROUP BY ff.user_id
) flow
GROUP BY flow.flow
ORDER BY tot DESC
LIMIT 10
We could work on making the query run faster. But before I did that, I would want to make sure that the query is returning a set that matches the specification. I need to make sure the query is actually answering the question that it's designed to answer.
我们可以使查询运行得更快。但在此之前,我希望确保查询返回的集合与规范匹配。我需要确保这个查询实际上回答了这个问题。
I suspect that the original query isn't correct. That is, I think that if the query is returning "correct" results, it's doing so accidentally, not because it's guaranteed to. Or because there is something peculiar about the uniqueness (cardinality) of rows in the table, or due to an accidental order that the rows are being processed in.
我怀疑原始查询不正确。也就是说,我认为如果查询返回的是“正确的”结果,那么它是意外地返回的,而不是因为它得到了保证。或者因为表中行的惟一性(基数)有些特殊,或者由于正在处理的行的偶然顺序。
I want to be sure that the query is guaranteed to return correct results, before I spend time tuning it, and adding indexes.
我希望确保查询在进行调优和添加索引之前返回正确的结果。
Q: Why isn't there an ORDER BY
in the GROUP_CONCAT
? e.g.
问:为什么GROUP_CONCAT中没有订单?如。
GROUP_CONCAT( foo ORDER BY something)
Q: Is there a specific reason there isn't a DISTINCT keyword?
问:有没有什么特别的原因没有一个独特的关键词?
GROUP_CONCAT(DISTINCT foo ORDER BY something)
Q: Should we be concerned with the potential for the GROUP_CONCAT to (silently) return a truncated value? (based on the setting of the group_concat_max_length
variable?)
问:我们应该关注GROUP_CONCAT返回截断值的可能性吗?(基于group_concat_max_length变量的设置?)
FOLLOWUP
跟踪
For best performance of that last query in the answer above, I recommend the following index be added:
为了更好地实现上面回答中的最后一个查询,我建议添加以下索引:
... ON table_a (session_id, is_lead, page, snippet_params)
or any similar index, with has session_id
and is_lead
as the leading columns (in that order), and also includes the page
and snippet_params
columns. If an ORDER BY is added to the GROUP_CONCAT, we may want a slightly different index.
或任何类似的索引,以session_id和is_lead作为主导列(按该顺序),还包括页面和snippet_params列。如果一个ORDER BY被添加到GROUP_CONCAT,我们可能需要一个稍微不同的索引。
For the outer query, there's no getting around the "Using filesort" operation the derived flow
column. (Unless you are running a more recent version of MySQL, where an index might be created. Or we're open to breaking the query into two separate operations. One query to materialize the inline view into a table, and a second query to run against that.)
对于外部查询,不能绕过派生流列的“使用filesort”操作。(除非您运行的是最新版本的MySQL,在那里可以创建索引。或者我们可以将查询分解为两个独立的操作。一个查询将内联视图显示到一个表中,另一个查询将根据它运行。
#2
1
In this subquery you are using group by but you don't have aggregation function.
在这个子查询中,您使用的是group by,但是没有聚合函数。
For the check of EXIST having a result for f.session_id
based on group by or not is the same .. you should remove the group by and the order by too
对于存在对f的结果的检验。基于组的session_id是相同的。您应该删除组by和订单by
WHERE EXISTS
(
SELECT
f.session_id
FROM table_a f
WHERE f.session_id = ff.session_id
AND f.is_lead = 1
GROUP BY f.user_id
ORDER BY f.user_id, f.`timestamp`
)
this way
这种方式
WHERE EXISTS
(
SELECT
f.session_id
FROM table_a f
WHERE f.session_id = ff.session_id
AND f.is_lead = 1
)
Looking at your query i think could be refactored eg:
看看你的问题,我认为可以重构
SELECT flow , COUNT(*) tot
FROM (
select
GROUP_CONCAT(ff.page, '#', ff.snippet_params,'$',ff.is_lead SEPARATOR '|') flow ,
FROM table_a ff
WHERE f.is_lead = 1
GROUP BY ff.user_id ) as new_flow
GROUP BY flow
ORDER BY tot DESC LIMIT 10
#3
0
You need to make sure that f.session_id and f.is_lead are indexed. It is currently doing a table scan of f for each row in the interim result against the ff reference of table_a.
你需要确定f。session_id和f。is_lead索引。它目前正在对临时结果中的每一行f进行表扫描,以反对table_a的ff引用。
#4
0
- Get rid of the count(*), IIRC MySQL can no longer cache queries if functions exist, try another approach to this.
- 去掉count(*), IIRC MySQL不能再缓存查询,如果函数存在,尝试另一种方法。
- Get rid of the subqueries, IIRC MySQL can't cache subqueries either.
- 去掉子查询,IIRC MySQL也不能缓存子查询。
It's difficult to give an optimized version of this query (or these queries). You might wan't to change your database structure so it allows simpler queries. Perhaps some caching (redis etc.) for other values...
很难给出这个查询(或这些查询)的优化版本。您可能不会改变您的数据库结构,因此它允许更简单的查询。也许一些缓存(redis等)用于其他值……