
时间:2021-02-28 12:55:11

I am facing a very common issue regarding "Selecting top N rows for each group in a table".


Consider a table with id, name, hair_colour, score columns.


I want a resultset such that, for each hair colour, get me top 3 scorer names.


To solve this i got exactly what i need on Rick Osborne's blogpost "sql-getting-top-n-rows-for-a-grouped-query"

为了解决这个问题,我在Rick Osborne的博文“sql-getting-top-n-rows-for-a -class-query”中得到了我所需要的东西。

That solution doesn't work as expected when my scores are equal.


In above example the result as follow.


 id  name  hair  score  ranknum
 12  Kit    Blonde  10  1
  9  Becca  Blonde  9  2
  8  Katie  Blonde  8  3
  3  Sarah  Brunette 10  1    
  4  Deborah Brunette 9  2 - ------- - - > if
  1  Kim  Brunette 8  3

Consider the row 4 Deborah Brunette 9 2. If this also has same score (10) same as Sarah, then ranknum will be 2,2,3 for "Brunette" type of hair.

考虑第4行Deborah Brunette 9 2.如果这也与Sarah相同(10),那么“褐发女郎”型头发的排名将为2,2,3。

What's the solution to this?


3 个解决方案



If you're using SQL Server 2005 or newer, you can use the ranking functions and a CTE to achieve this:

如果您使用的是SQL Server 2005或更高版本,则可以使用排名函数和CTE来实现此目的:

;WITH HairColors AS
(SELECT id, name, hair, score, 
        ROW_NUMBER() OVER(PARTITION BY hair ORDER BY score DESC) as 'RowNum'
SELECT id, name, hair, score
FROM HairColors
WHERE RowNum <= 3

This CTE will "partition" your data by the value of the hair column, and each partition is then order by score (descending) and gets a row number; the highest score for each partition is 1, then 2 etc.


So if you want to the TOP 3 of each group, select only those rows from the CTE that have a RowNum of 3 or less (1, 2, 3) --> there you go!

因此,如果你想要每组的TOP 3,只选择CTE中RowNum为3或更少(1,2,3)的那些行 - >你去!



The way the algorithm comes up with the rank, is to count the number of rows in the cross-product with a score equal to or greater than the girl in question, in order to generate rank. Hence in the problem case you're talking about, Sarah's grid would look like


a.name | a.score | b.name  | b.score
Sarah  | 9       | Sarah   | 9
Sarah  | 9       | Deborah | 9

and similarly for Deborah, which is why both girls get a rank of 2 here.


The problem is that when there's a tie, all girls take the lowest value in the tied range due to this count, when you'd want them to take the highest value instead. I think a simple change can fix this:


Instead of a greater-than-or-equal comparison, use a strict greater-than comparison to count the number of girls who are strictly better. Then, add one to that and you have your rank (which will deal with ties as appropriate). So the inner select would be:


SELECT a.id, COUNT(*) + 1 AS ranknum
FROM girl AS a
  INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score < b.score)

Can anyone see any problems with this approach that have escaped my notice?




Use this compound select which handles OP problem properly


SELECT g.* FROM girls as g
WHERE g.score > IFNULL( (SELECT g2.score FROM girls as g2
                WHERE g.hair=g2.hair ORDER BY g2.score DESC LIMIT 3,1), 0)

Note that you need to use IFNULL here to handle case when table girls has less rows for some type of hair then we want to see in sql answer (in OP case it is 3 items).




If you're using SQL Server 2005 or newer, you can use the ranking functions and a CTE to achieve this:

如果您使用的是SQL Server 2005或更高版本,则可以使用排名函数和CTE来实现此目的:

;WITH HairColors AS
(SELECT id, name, hair, score, 
        ROW_NUMBER() OVER(PARTITION BY hair ORDER BY score DESC) as 'RowNum'
SELECT id, name, hair, score
FROM HairColors
WHERE RowNum <= 3

This CTE will "partition" your data by the value of the hair column, and each partition is then order by score (descending) and gets a row number; the highest score for each partition is 1, then 2 etc.


So if you want to the TOP 3 of each group, select only those rows from the CTE that have a RowNum of 3 or less (1, 2, 3) --> there you go!

因此,如果你想要每组的TOP 3,只选择CTE中RowNum为3或更少(1,2,3)的那些行 - >你去!



The way the algorithm comes up with the rank, is to count the number of rows in the cross-product with a score equal to or greater than the girl in question, in order to generate rank. Hence in the problem case you're talking about, Sarah's grid would look like


a.name | a.score | b.name  | b.score
Sarah  | 9       | Sarah   | 9
Sarah  | 9       | Deborah | 9

and similarly for Deborah, which is why both girls get a rank of 2 here.


The problem is that when there's a tie, all girls take the lowest value in the tied range due to this count, when you'd want them to take the highest value instead. I think a simple change can fix this:


Instead of a greater-than-or-equal comparison, use a strict greater-than comparison to count the number of girls who are strictly better. Then, add one to that and you have your rank (which will deal with ties as appropriate). So the inner select would be:


SELECT a.id, COUNT(*) + 1 AS ranknum
FROM girl AS a
  INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score < b.score)

Can anyone see any problems with this approach that have escaped my notice?




Use this compound select which handles OP problem properly


SELECT g.* FROM girls as g
WHERE g.score > IFNULL( (SELECT g2.score FROM girls as g2
                WHERE g.hair=g2.hair ORDER BY g2.score DESC LIMIT 3,1), 0)

Note that you need to use IFNULL here to handle case when table girls has less rows for some type of hair then we want to see in sql answer (in OP case it is 3 items).
