在应用LIMIT子句之前,组按唯一键计算所有组吗?

时间:2021-08-15 07:36:43

If I GROUP BY on a unique key, and apply a LIMIT clause to the query, will all the groups be calculated before the limit is applied?

如果我对一个唯一的键进行分组,并对查询应用一个LIMIT子句,那么是否会在应用该限制之前计算所有组?

If I have hundred records in the table (each has a unique key), Will I have 100 records in the temporary table created (for the GROUP BY) before a LIMIT is applied?

如果表中有100条记录(每个记录都有唯一的键),那么在应用限制之前,是否会在临时表中(为GROUP BY)创建100条记录?

A case study why I need this:

一个案例研究为什么我需要这个:

Take Stack Overflow for example.

以堆栈溢出为例。

Each query you run to show a list of questions, also shows the user who asked this question, and the number of badges he has.

您运行的每个查询都显示了一个问题列表,还显示了提出这个问题的用户以及他拥有的徽章数量。

So, while a user<->question is one to one, user<->badges is one has many.

因此,虽然用户<->问题是一对一的,但是用户<->徽章是一个有很多的。

The only way to do it in one query (and not one on questions and another one on users and then combine results), is to group the query by the primary key (question_id) and join+group_concat to the user_badges table.

在一个查询中(而不是在问题和用户上进行查询,然后合并结果)执行此操作的惟一方法是使用主键(question_id)对查询进行分组,并将+group_concat连接到user_badge表。

The same goes for the questions TAGS.

问题标签也是如此。

Code example:
Table Questions:
question_id  (int)(pk)|   question_body(varchar)


Table tag-question:
question-id (int) | tag_id (int)


SELECT:

SELECT quesuestions.question_id,
       questions.question_body,
       GROUP-CONCAT(tag_id,' ') AS 'tags-ids'
FROM
       questions
   JOIN
       tag_question
   ON
       questions.question_id=tag-question.question-id
GROUP BY
       questions.question-id
LIMIT 15

3 个解决方案

#1


1  

LIMIT does get applied after GROUP BY.

限制确实得到了应用。

Will the temporary table be created or not, depends on how your indexes are built.

是否创建临时表取决于如何构建索引。

If you have an index on the grouping field and don't order by the aggregate results, then an INDEX SCAN FOR GROUP BY is applied, and each aggregate is counted on the fly.

如果您在分组字段上有一个索引,并且不按聚合结果排序,那么将应用对GROUP by的索引扫描,并动态地计算每个聚合。

That means that if you don't select an aggregate due to the LIMIT, it won't ever be calculated.

这意味着,如果由于限制而不选择聚合,就永远不会计算它。

But if you order by an aggregate, then, of course, all of them need to be calculated before they can be sorted.

但是如果你按集合排序,那么,当然,在排序之前,所有这些都需要计算。

That's why they are calculated first and then the filesort is applied.

这就是为什么首先计算它们,然后应用filesort。

Update:

更新:

As for your query, see what EXPLAIN EXTENDED says for it.

至于您的查询,请参阅EXPLAIN EXTENDED为它表示什么。

Most probably, question_id is a PRIMARY KEY for your table, and most probably, it will be used in a scan.

多半情况下,question_id是表的主键,而且很可能在扫描中使用。

That means no filesort will be applies and the join itself will not ever happen after the 15'th row.

这意味着不会应用任何文件排序,连接本身也不会在第15行之后发生。

To make sure, rewrite your query as following:

为了确保这一点,请将查询重写为以下内容:

SELECT question_id,
       question_body,
       (
       SELECT  GROUP_CONCAT(tag_id, ' ')
       FROM    tag_question t
       WHERE   t.question_id = q.question_id
       )
FROM   questions q
ORDER BY
       question_id
LIMIT 15
  • First, it is more readable,
  • 首先,它更容易读,
  • Second, it is more efficient, and
  • 第二,它更有效率
  • Third, it will return even untagged questions (which your current query doesn't).
  • 第三,它甚至会返回未标记的问题(当前查询没有)。

#2


4  

Yes, the order the query executes is:

是的,查询执行的顺序是:

  • FROM
  • WHERE
  • 在哪里
  • GROUP
  • 集团
  • HAVING
  • SORT
  • 排序
  • SELECT
  • 选择
  • LIMIT
  • 限制

LIMIT is the last thing calculated, so your grouping will be just fine.

极限是最后计算出来的,所以你的分组将会很好。

Now, looking at your rephrased question, then you're not having just one row per group, but many: in the case of *, you'll have just one user per row, but many badges - i.e.

现在,看看你重新提出的问题,然后不是每个组只有一行,而是很多:在*的例子中,每一行只有一个用户,但是有很多徽章。

(uid, badge_id, etc.)
(1, 2, ...)
(1, 3, ...)
(1, 12, ...)

all those would be grouped together.

所有这些会被组合在一起。

To avoid full table scan all you need are indexes. Besides that, if you need to SUM, for example, you cannot avoid a full scan.

为了避免全表扫描,您需要的是索引。除此之外,如果您需要求和,例如,您无法避免完整扫描。

EDIT:

编辑:

You'll need something like this (look at the WHERE clause):

你需要这样的东西(请看WHERE子句):

SELECT
  quesuestions.question_id,
  questions.question_body,
  GROUP_CONCAT(tag_id,' ') AS 'tags_ids'
FROM
  questions q1
  JOIN tag_question tq
    ON q1.question_id = tq.question-id
WHERE
  q1.question_id IN (
    SELECT
      tq2.question_id
    FROM
      tag_question tq2
        ON q2.question_id = tq2.question_id
      JOIN tag t
        tq2.tag_id = t.tag_id
    WHERE
      t.name = 'the-misterious-tag'
  )
GROUP BY
  q1.question_id
LIMIT 15

#3


1  

If the field you're grouping on is indexed, it shouldn't do a full table scan.

如果正在分组的字段被索引,那么它不应该执行完整的表扫描。

#1


1  

LIMIT does get applied after GROUP BY.

限制确实得到了应用。

Will the temporary table be created or not, depends on how your indexes are built.

是否创建临时表取决于如何构建索引。

If you have an index on the grouping field and don't order by the aggregate results, then an INDEX SCAN FOR GROUP BY is applied, and each aggregate is counted on the fly.

如果您在分组字段上有一个索引,并且不按聚合结果排序,那么将应用对GROUP by的索引扫描,并动态地计算每个聚合。

That means that if you don't select an aggregate due to the LIMIT, it won't ever be calculated.

这意味着,如果由于限制而不选择聚合,就永远不会计算它。

But if you order by an aggregate, then, of course, all of them need to be calculated before they can be sorted.

但是如果你按集合排序,那么,当然,在排序之前,所有这些都需要计算。

That's why they are calculated first and then the filesort is applied.

这就是为什么首先计算它们,然后应用filesort。

Update:

更新:

As for your query, see what EXPLAIN EXTENDED says for it.

至于您的查询,请参阅EXPLAIN EXTENDED为它表示什么。

Most probably, question_id is a PRIMARY KEY for your table, and most probably, it will be used in a scan.

多半情况下,question_id是表的主键,而且很可能在扫描中使用。

That means no filesort will be applies and the join itself will not ever happen after the 15'th row.

这意味着不会应用任何文件排序,连接本身也不会在第15行之后发生。

To make sure, rewrite your query as following:

为了确保这一点,请将查询重写为以下内容:

SELECT question_id,
       question_body,
       (
       SELECT  GROUP_CONCAT(tag_id, ' ')
       FROM    tag_question t
       WHERE   t.question_id = q.question_id
       )
FROM   questions q
ORDER BY
       question_id
LIMIT 15
  • First, it is more readable,
  • 首先,它更容易读,
  • Second, it is more efficient, and
  • 第二,它更有效率
  • Third, it will return even untagged questions (which your current query doesn't).
  • 第三,它甚至会返回未标记的问题(当前查询没有)。

#2


4  

Yes, the order the query executes is:

是的,查询执行的顺序是:

  • FROM
  • WHERE
  • 在哪里
  • GROUP
  • 集团
  • HAVING
  • SORT
  • 排序
  • SELECT
  • 选择
  • LIMIT
  • 限制

LIMIT is the last thing calculated, so your grouping will be just fine.

极限是最后计算出来的,所以你的分组将会很好。

Now, looking at your rephrased question, then you're not having just one row per group, but many: in the case of *, you'll have just one user per row, but many badges - i.e.

现在,看看你重新提出的问题,然后不是每个组只有一行,而是很多:在*的例子中,每一行只有一个用户,但是有很多徽章。

(uid, badge_id, etc.)
(1, 2, ...)
(1, 3, ...)
(1, 12, ...)

all those would be grouped together.

所有这些会被组合在一起。

To avoid full table scan all you need are indexes. Besides that, if you need to SUM, for example, you cannot avoid a full scan.

为了避免全表扫描,您需要的是索引。除此之外,如果您需要求和,例如,您无法避免完整扫描。

EDIT:

编辑:

You'll need something like this (look at the WHERE clause):

你需要这样的东西(请看WHERE子句):

SELECT
  quesuestions.question_id,
  questions.question_body,
  GROUP_CONCAT(tag_id,' ') AS 'tags_ids'
FROM
  questions q1
  JOIN tag_question tq
    ON q1.question_id = tq.question-id
WHERE
  q1.question_id IN (
    SELECT
      tq2.question_id
    FROM
      tag_question tq2
        ON q2.question_id = tq2.question_id
      JOIN tag t
        tq2.tag_id = t.tag_id
    WHERE
      t.name = 'the-misterious-tag'
  )
GROUP BY
  q1.question_id
LIMIT 15

#3


1  

If the field you're grouping on is indexed, it shouldn't do a full table scan.

如果正在分组的字段被索引,那么它不应该执行完整的表扫描。