为表中的每个组选择前N行

时间:2021-02-28 12:55:11

I am facing a very common issue regarding "Selecting top N rows for each group in a table".

我正面临一个非常常见的问题,即“为表中的每个组选择前N行”。

Consider a table with id, name, hair_colour, score columns.

考虑一个包含id,name,hair_colour,score列的表。

I want a resultset such that, for each hair colour, get me top 3 scorer names.

我想要一个结果集,对于每种头发颜色,让我获得前三名得分手的名字。

To solve this i got exactly what i need on Rick Osborne's blogpost "sql-getting-top-n-rows-for-a-grouped-query"

为了解决这个问题,我在Rick Osborne的博文“sql-getting-top-n-rows-for-a -class-query”中得到了我所需要的东西。

That solution doesn't work as expected when my scores are equal.

当我的分数相等时,该解决方案无法按预期工作。

In above example the result as follow.

在上面的例子中,结果如下。

 id  name  hair  score  ranknum
---------------------------------
 12  Kit    Blonde  10  1
  9  Becca  Blonde  9  2
  8  Katie  Blonde  8  3
  3  Sarah  Brunette 10  1    
  4  Deborah Brunette 9  2 - ------- - - > if
  1  Kim  Brunette 8  3

Consider the row 4 Deborah Brunette 9 2. If this also has same score (10) same as Sarah, then ranknum will be 2,2,3 for "Brunette" type of hair.

考虑第4行Deborah Brunette 9 2.如果这也与Sarah相同(10),那么“褐发女郎”型头发的排名将为2,2,3。

What's the solution to this?

这是什么解决方案?

3 个解决方案

#1


17  

If you're using SQL Server 2005 or newer, you can use the ranking functions and a CTE to achieve this:

如果您使用的是SQL Server 2005或更高版本,则可以使用排名函数和CTE来实现此目的:

;WITH HairColors AS
(SELECT id, name, hair, score, 
        ROW_NUMBER() OVER(PARTITION BY hair ORDER BY score DESC) as 'RowNum'
)
SELECT id, name, hair, score
FROM HairColors
WHERE RowNum <= 3

This CTE will "partition" your data by the value of the hair column, and each partition is then order by score (descending) and gets a row number; the highest score for each partition is 1, then 2 etc.

此CTE将按照发色列的值“分区”您的数据,然后每个分区按分数(降序)排序并获取行号;每个分区的最高分为1,然后是2等。

So if you want to the TOP 3 of each group, select only those rows from the CTE that have a RowNum of 3 or less (1, 2, 3) --> there you go!

因此,如果你想要每组的TOP 3,只选择CTE中RowNum为3或更少(1,2,3)的那些行 - >你去!

#2


0  

The way the algorithm comes up with the rank, is to count the number of rows in the cross-product with a score equal to or greater than the girl in question, in order to generate rank. Hence in the problem case you're talking about, Sarah's grid would look like

算法提出排名的方法是计算交叉积中的行数,其得分等于或大于所讨论的女孩,以便生成排名。因此,在您谈论的问题案例中,Sarah的网格看起来像

a.name | a.score | b.name  | b.score
-------+---------+---------+--------
Sarah  | 9       | Sarah   | 9
Sarah  | 9       | Deborah | 9

and similarly for Deborah, which is why both girls get a rank of 2 here.

同样对于Deborah来说,这就是为什么这两个女孩在这里获得2级的原因。

The problem is that when there's a tie, all girls take the lowest value in the tied range due to this count, when you'd want them to take the highest value instead. I think a simple change can fix this:

问题在于,当有一个平局时,所有女孩都会因为这个计数而在平局范围内取得最低值,而你希望他们取得最高值。我认为一个简单的改变可以解决这个问题

Instead of a greater-than-or-equal comparison, use a strict greater-than comparison to count the number of girls who are strictly better. Then, add one to that and you have your rank (which will deal with ties as appropriate). So the inner select would be:

而不是大于或等于比较,使用严格的大于比较来计算严格更好的女孩的数量。然后,添加一个,你有你的排名(将适当处理关系)。所以内部选择将是:

SELECT a.id, COUNT(*) + 1 AS ranknum
FROM girl AS a
  INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score < b.score)
GROUP BY a.id
HAVING COUNT(*) <= 3

Can anyone see any problems with this approach that have escaped my notice?

任何人都可以看到这种方法的任何问题都没有引起我的注意吗?

#3


0  

Use this compound select which handles OP problem properly

使用此复合选择正确处理OP问题

SELECT g.* FROM girls as g
WHERE g.score > IFNULL( (SELECT g2.score FROM girls as g2
                WHERE g.hair=g2.hair ORDER BY g2.score DESC LIMIT 3,1), 0)

Note that you need to use IFNULL here to handle case when table girls has less rows for some type of hair then we want to see in sql answer (in OP case it is 3 items).

请注意,你需要在这里使用IFNULL处理表格女孩的某些类型的头发行数较少的情况,然后我们想在sql答案中看到(在OP情况下它是3个项目)。

#1


17  

If you're using SQL Server 2005 or newer, you can use the ranking functions and a CTE to achieve this:

如果您使用的是SQL Server 2005或更高版本,则可以使用排名函数和CTE来实现此目的:

;WITH HairColors AS
(SELECT id, name, hair, score, 
        ROW_NUMBER() OVER(PARTITION BY hair ORDER BY score DESC) as 'RowNum'
)
SELECT id, name, hair, score
FROM HairColors
WHERE RowNum <= 3

This CTE will "partition" your data by the value of the hair column, and each partition is then order by score (descending) and gets a row number; the highest score for each partition is 1, then 2 etc.

此CTE将按照发色列的值“分区”您的数据,然后每个分区按分数(降序)排序并获取行号;每个分区的最高分为1,然后是2等。

So if you want to the TOP 3 of each group, select only those rows from the CTE that have a RowNum of 3 or less (1, 2, 3) --> there you go!

因此,如果你想要每组的TOP 3,只选择CTE中RowNum为3或更少(1,2,3)的那些行 - >你去!

#2


0  

The way the algorithm comes up with the rank, is to count the number of rows in the cross-product with a score equal to or greater than the girl in question, in order to generate rank. Hence in the problem case you're talking about, Sarah's grid would look like

算法提出排名的方法是计算交叉积中的行数,其得分等于或大于所讨论的女孩,以便生成排名。因此,在您谈论的问题案例中,Sarah的网格看起来像

a.name | a.score | b.name  | b.score
-------+---------+---------+--------
Sarah  | 9       | Sarah   | 9
Sarah  | 9       | Deborah | 9

and similarly for Deborah, which is why both girls get a rank of 2 here.

同样对于Deborah来说,这就是为什么这两个女孩在这里获得2级的原因。

The problem is that when there's a tie, all girls take the lowest value in the tied range due to this count, when you'd want them to take the highest value instead. I think a simple change can fix this:

问题在于,当有一个平局时,所有女孩都会因为这个计数而在平局范围内取得最低值,而你希望他们取得最高值。我认为一个简单的改变可以解决这个问题

Instead of a greater-than-or-equal comparison, use a strict greater-than comparison to count the number of girls who are strictly better. Then, add one to that and you have your rank (which will deal with ties as appropriate). So the inner select would be:

而不是大于或等于比较,使用严格的大于比较来计算严格更好的女孩的数量。然后,添加一个,你有你的排名(将适当处理关系)。所以内部选择将是:

SELECT a.id, COUNT(*) + 1 AS ranknum
FROM girl AS a
  INNER JOIN girl AS b ON (a.hair = b.hair) AND (a.score < b.score)
GROUP BY a.id
HAVING COUNT(*) <= 3

Can anyone see any problems with this approach that have escaped my notice?

任何人都可以看到这种方法的任何问题都没有引起我的注意吗?

#3


0  

Use this compound select which handles OP problem properly

使用此复合选择正确处理OP问题

SELECT g.* FROM girls as g
WHERE g.score > IFNULL( (SELECT g2.score FROM girls as g2
                WHERE g.hair=g2.hair ORDER BY g2.score DESC LIMIT 3,1), 0)

Note that you need to use IFNULL here to handle case when table girls has less rows for some type of hair then we want to see in sql answer (in OP case it is 3 items).

请注意,你需要在这里使用IFNULL处理表格女孩的某些类型的头发行数较少的情况,然后我们想在sql答案中看到(在OP情况下它是3个项目)。