如何根据多个排序列选择每组的顶行？

I have a query that looks like the following:

我有一个如下所示的查询:

SELECT time_start, some_count
    FROM foo
    WHERE user_id = 1
    AND DATE(time_start) = '2016-07-27'
    ORDER BY some_count DESC, time_start DESC LIMIT 1;

What this does is return me one row, where some_count is the highest count for user_id = 1. It also gives me the time stamp which is the most current for that some_count, as some_count could be the same for multiple time_start values and I want the most current one.

这样做会返回一行,其中some_count是user_id = 1的最高计数。它还为我提供了some_count最新的时间戳,因为some_count对于多个time_start值可能是相同的,我想要最新的一个。

Now I'm trying to do is run a query that will figure this out for every single user_id that occurred at least once for a specific date, in this case 2016-07-27. Ultimately it's going to probably require a GROUP BY as I'm looking for a group maximum per user_id

现在我要做的是运行一个查询,该查询将针对特定日期至少发生过一次的每个user_id进行计算,在本例中为2016-07-27。最终它可能需要GROUP BY,因为我正在寻找每个user_id的组最大值

What's the best way to write a query of that nature?

编写这种性质的查询的最佳方法是什么?

8 个解决方案

#1

I am sharing two of my approaches.

我正在分享我的两种方法。

Approach #1 (scalable):

方法#1(可扩展):

Using MySQL user_defined variables

使用MySQL user_defined变量

SELECT
    t.user_id,
    t.time_start,
    t.time_stop,
    t.some_count
FROM 
(
    SELECT
        user_id,
        time_start,
        time_stop,
        some_count,
        IF(@sameUser = user_id, @rn := @rn + 1,
             IF(@sameUser := user_id, @rn := 1, @rn := 1)
        ) AS row_number

    FROM    foo
    CROSS JOIN (
        SELECT
            @sameUser := - 1,
            @rn := 1
    ) var
    WHERE   DATE(time_start) = '2016-07-27'
    ORDER BY    user_id,    some_count DESC,    time_stop DESC
) AS t
WHERE t.row_number <= 1
ORDER BY t.user_id;

Scalable because if you want latest n rows for each user then you just need to change this line :

可扩展,因为如果您想为每个用户提供最新的n行,那么您只需要更改此行:

... WHERE t.row_number <= n...

...在哪里t.row_number <= n ...

_{I can add explanation later if the query provides expected result}

如果查询提供了预期的结果,我可以稍后添加解释

Approach #2:(Not scalable)

方法#2 :(不可扩展)

Using INNER JOIN and GROUP BY

使用INNER JOIN和GROUP BY

SELECT 
 F.user_id,
 F.some_count,
 F.time_start,
 MAX(F.time_stop) AS max_time_stop
FROM foo F
INNER JOIN 
(
    SELECT 
        user_id,
        MAX(some_count) AS max_some_count
    FROM foo
    WHERE DATE(time_start) = '2016-07-27'
    GROUP BY user_id
) AS t
ON F.user_id = t.user_id AND F.some_count = t.max_some_count
WHERE DATE(time_start) = '2016-07-27'
GROUP BY F.user_id

#2

You can use NOT EXISTS() :

你可以使用NOT EXISTS():

SELECT * FROM foo t
WHERE (DATE(time_start) = '2016-07-27'
   OR DATE(time_stop) = '2016-07-27') 
  AND NOT EXISTS(SELECT 1 FROM foo s
                 WHERE t.user_id = s.user_id
                 AND (s.some_count > t.some_count
                  OR (s.some_count = t.some_count
                      AND s.time_stop > t.time_stop)))

The NOT EXISTS() will select only records that another record with a larger count or a another record with the same count but a newer time_stop doesn't exists for them.

NOT EXISTS()将仅选择具有较大计数的另一记录或具有相同计数的另一记录但不存在较新的time_stop的记录。

#3

You can use your original query as a correlated subquery in the WHERE clause.

您可以将原始查询用作WHERE子句中的相关子查询。

SELECT user_id, time_stop, some_count
FROM foo f
WHERE f.id = (
    SELECT f1.id
    FROM foo f1
    WHERE f1.user_id = f.user_id -- correlate
    AND DATE(f1.time_start) = '2016-07-27'
    ORDER BY f1.some_count DESC, f1.time_stop DESC LIMIT 1
)

MySQL should be able to cache the result of the subquery for every distinct user_id.

MySQL应该能够为每个不同的user_id缓存子查询的结果。

Another way is to use nested GROUP BY queries:

另一种方法是使用嵌套的GROUP BY查询:

select f.user_id, f.some_count, max(f.time_stop) as time_stop
from (
    select f.user_id, max(f.some_count) as some_count
    from foo f
    where date(f.time_start) = '2016-07-27'
    group by f.user_id
) sub
join foo f using(user_id, some_count)
where date(f.time_start) = '2016-07-27'
group by f.user_id, f.some_count

#4

SELECT user_id,
       some_count,
       max(time_start) AS time_start
FROM
  (SELECT a.*
   FROM foo AS a
   INNER JOIN
     (SELECT user_id,
             max(some_count) AS some_count
      FROM foo
      WHERE DATE(time_start) = '2016-07-27'
      GROUP BY user_id) AS b ON a.user_id = b.user_id
   AND a.some_count = b.some_count) AS c
GROUP BY user_id,
         some_count;

Explaining from inside out: The most inner table (b) will give you the max some_count per user. this is not enough as you want the max for two columns - so I'm joining it with the full table (a) to get the records that has these max values (c), and from that I'm taking the max time_start for each user/some_count combination.

从内到外解释:最内层的表(b)将为每个用户提供最大some_count。这还不够,因为你想要两列的最大值 - 所以我用全表(a)加入它以获得具有这些最大值(c)的记录,并从中我采用最大time_start每个user / some_count组合。

#5

Strategy

In general it's more efficient to find maximum values rather than sorting groups of records. In this case, the ordering is on an integer (some_count) followed by a date/time (time_start) - so to find a single maximum row, we need to combine these in some way.

通常,查找最大值而不是排序记录组更有效。在这种情况下,排序是一个整数(some_count),后跟一个日期/时间(time_start) - 所以要找到一个最大的行,我们需要以某种方式组合这些。

A simple way of doing this is to combine the two into a string but there is the usual snag of string comparison valuing "4" as higher than "12" for example. This is easily overcome by using LPAD to add leading zeros so 4 becomes "0000000004", which is lower than "0000000012" in a string comparison. Assuming time_start is a DATETIME field, it can simply be appended to this for a secondary ordering since its string conversion results in a sortable format (yyyy-mm-dd hh:MM:ss).

这样做的一个简单方法是将两者组合成一个字符串,但是通常的字符串比较值为“4”,例如高于“12”。这可以通过使用LPAD添加前导零来轻松克服,因此4变为“0000000004”,在字符串比较中低于“0000000012”。假设time_start是DATETIME字段,它可以简单地附加到此用于辅助排序,因为它的字符串转换导致可排序的格式(yyyy-mm-dd hh:MM:ss)。

SQL

Using this strategy, we can restrict via a simple subselect:

使用此策略,我们可以通过简单的子选择进行限制:

SELECT time_start, some_count
FROM foo f1
WHERE DATE(time_start) = '2016-07-27'
  AND CONCAT(LPAD(some_count, 10, '0'), time_start) = 
      (SELECT MAX(CONCAT(LPAD(some_count, 10, '0'), time_start))
       FROM foo f2
       WHERE DATE(f2.time_start) = '2016-07-27'
         AND f2.user_id = f1.user_id);

Demo

Rextester demo here: http://rextester.com/HCGY1362

Rextester演示版:http://rextester.com/HCGY1362

#6

I believe, you don't need to do anything fancy for the query. Just sort the table by user_id in ascending order and some_count and time_start in a descending order and select expected fields from the ordered table GROUP BY user_id. Its simple. Try and let me know if works.

我相信,你不需要为查询做任何花哨的事情。只需按user_id按升序对表进行排序,然后按降序对some_count和time_start进行排序,并从有序表GROUP BY user_id中选择预期的字段。这很简单。如果有效,请尝试告诉我。

SELECT user_id, some_count, time_start
FROM (SELECT * FROM foo ORDER BY user_id ASC, some_count DESC, time_start DESC)sorted_foo
WHERE DATE( time_start ) = '2016-07-27'
GROUP BY user_id

#7

SELECT  c1.user_id, c1.some_count, MAX(c1.time_start) AS time_start
    FROM  foo AS c1
    JOIN
      ( SELECT  user_id, MAX(some_count) AS some_count
            FROM  foo
            WHERE time_start >= '2016-07-27'
              AND time_start  < '2016-07-27' + INTERVAL 1 DAY
            GROUP BY  user_id
      ) AS c2 USING (user_id, some_count)
    GROUP BY c1.user_id, c1.some_count

And, add these for better performance:

并且,添加这些以获得更好的性能:

INDEX(user_id, some_count, time_start)
INDEX(time_start)

The test for the time_start range was changed so that the second index could be used.

更改了time_start范围的测试,以便可以使用第二个索引。

This was loosely derived from by blog on groupwise max .

这很大程度上来源于groupwise max上的博客。

#8

Your problem could be solved with something called window functions, but MySQL has no support for this kind of feature.

您的问题可以通过一个称为窗口函数的东西来解决,但MySQL不支持这种功能。

I have two solutions for you. One is simulating a window function and the other is the common way you'll write some queries to address these situations in MySQL.

我有两个解决方案。一个是模拟窗口函数,另一个是你在MySQL中编写一些查询来解决这些情况的常用方法。

This is the first one, which I answered this question:

这是第一个,我回答了这个问题:

-- simulates the window function
-- first_value(<col>) over(partition by user_id order by some_count DESC, time_start DESC)
SELECT
  user_id,
  substring_index(group_concat(time_start ORDER BY some_count DESC, time_start DESC), ',', 1) time_start,
  substring_index(group_concat(some_count ORDER BY some_count DESC, time_start DESC), ',', 1) some_count
FROM foo
WHERE DATE(time_start) = '2016-07-27'
GROUP BY user_id
;

Basically, you group your data by user_id and concatenates all values from a specified column using the , separator, ordered by the columns you want, for each group, and then substrings only the first ordered value. This is not an optimal approach...

基本上,您通过user_id对数据进行分组,并使用,所需的列,按所需列的顺序对每个组连接来自指定列的所有值,然后仅对第一个有序值进行子串。这不是一种最佳方法......

And that's the second one, which I answered this question:

那是第二个,我回答了这个问题:

SELECT 
  user_id,
  some_count,
  MAX(time_start) time_start
FROM foo outq
WHERE 1=1
  AND DATE(time_start) = '2016-07-27'
  AND NOT EXISTS
  (
    SELECT 1
    FROM foo 
    WHERE 1=1
      AND user_id    = outq.user_id
      AND some_count > outq.some_count
      AND DATE(time_start) = DATE(outq.time_start)
  )
GROUP BY
  user_id,
  some_count
;

Basically, the subquery checks for each user_id if there are any some_count higher them the current one been checked on that date, as the main query expects it to NOT EXISTS. You'll left with all highest some_count per user_id in a date, but for a same highest value from a user may exists several different time_start in that date. Now things are simple. You can securely GROUP BY user and count, because they are already the data you want, and get from the group the maximum time_start.

基本上,子查询检查每个user_id是否有任何some_count更高的当前在那个日期检查的,因为主查询期望它不是EXISTS。在日期中,每个user_id将保留所有最高some_count,但是对于用户的相同最高值,该日期可能存在几个不同的time_start。现在事情很简单。您可以安全地GROUP BY用户和计数,因为它们已经是您想要的数据,并从组中获取最大time_start。

This kind of subquery is the common way of solving problems like that in MySQL. I recommend you to try both solutions, but choose the second one and remember the subquery sintax to solve any future problem.

这种子查询是解决像MySQL这样的问题的常用方法。我建议你尝试两种解决方案,但选择第二种解决方案并记住子查询sintax以解决任何未来的问题。

Also, in MySQL, an implicit ORDER BY <columns> is applied in all queries having a GROUP BY <columns>. If you don't bother with the result order, you can save some processing by declaring ORDER BY NULL, which will disable the implicit ordenation feature in your query.

此外,在MySQL中,隐式ORDER BY 应用于具有GROUP BY 的所有查询中。如果您不打扰结果顺序,则可以通过声明ORDER BY NULL来保存一些处理,这将禁用查询中的隐式ordenation功能。

#1