熊猫很好地在每组中获得了最高的记录。

Suppose I have pandas DataFrame like this:

假设我有这样的熊猫DataFrame:

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

I want to get new DataFrame with top 2 records for each id, like this:

我想获得一个新的DataFrame，每个id的前2个记录，如下所示:

   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

I can do it with numbering records within group after group by:

我可以用组内的编号记录来完成:

>>> dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
>>> dfN
   id  level_1  index  value
0   1        0      0      1
1   1        1      1      2
2   1        2      2      3
3   2        0      3      1
4   2        1      4      2
5   2        2      5      3
6   2        3      6      4
7   3        0      7      1
8   4        0      8      1
>>> dfN[dfN['level_1'] <= 1][['id', 'value']]
   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

但是否有更有效/更优雅的方法来做到这一点呢?在每个组中也有更优雅的数字记录方法(比如SQL window function row_number())。

Thanks in advance.

提前谢谢。

2 个解决方案

#1

Did you try df.groupby('id').head(2)

你试过df.groupby(id).head(2)

Ouput generated:

生成的输出:

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

(请记住，根据您的数据，您可能需要排序/排序。)

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True) to remove the multindex and flatten the results.

编辑:正如提问者所提到的，使用df.groupby('id').head(2).reset_index(drop=True)删除multindex并使结果变平。

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1

#2

Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:

从0.14.1开始，您现在可以在groupby对象上做最大和最小的:

In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]: 
id   
1   2    3
    1    2
2   6    4
    5    3
3   7    1
4   8    1
dtype: int64

There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.

也有一些奇怪的地方你也会得到原始的索引，但是这可能会非常有用取决于你的原始索引是什么。

If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.

如果您对它不感兴趣，您可以执行.reset_index(level=1, drop=True)，以完全摆脱它。

(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)

(注意:从0.17.1开始，您也可以在DataFrameGroupBy上进行此操作，但目前它只与系列和SeriesGroupBy一起使用。)

#1