Basically the same as Select first row in each GROUP BY group? only in pandas.
基本上和在每个组中按组选择第一行是一样的吗?只有在熊猫。
df = pd.DataFrame({'A' : ['foo', 'foo', 'foo', 'foo', 'bar', 'bar', 'bar', 'bar'],
'B' : ['3', '1', '2', '4','2', '4', '1', '3'],
})
Sorting looks promising:
排序是承诺:
df.sort('B')
A B
1 foo 1
6 bar 1
2 foo 2
4 bar 2
0 foo 3
7 bar 3
3 foo 4
5 bar 4
But then first won't give the desired result... df.groupby('A').first()
但是,首先不会给出想要的结果。df.groupby(A)当代()
B
A
bar 2
foo 3
3 个解决方案
#1
5
Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sort
the df prior to performing groupby
:
一般来说,如果你希望你的数据在groupby中排序,但它不是要分组的列之一,那么最好在执行groupby之前对df进行排序:
In [5]:
df.sort_values('B').groupby('A').first()
Out[5]:
B
A
bar 1
foo 1
#2
5
The pandas groupby function could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.
熊猫群组函数可以用于你想要的东西,但它实际上是用来聚集的。这是一个简单的“take the first”操作。
What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable
您真正想要的是熊猫drop_duplicate函数,默认情况下它将返回第一行。您通常认为groupby键是什么,您应该将其作为子集=变量传递
df.drop_duplicates(subset='A')
Should do what you want.
应该做你想做的。
Also, df.sort('A')
does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True
parameter.
而且,df.sort('A')并不对DataFrame df进行排序,它返回一个已排序的副本。如果要对它进行排序,必须添加inplace=True参数。
df.sort('A', inplace=True)
#3
5
Here's an alternative approach using groupby().rank()
:
这里有一个使用groupby().rank()的替代方法:
df[ df.groupby('A')['B'].rank() == 1 ]
A B
1 foo 1
6 bar 1
This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:
这将为您提供与OP的示例dataframe的@EdChum相同的答案,但如果您在这类数据中有任何关联,例如,使用如下数据,则可以给出不同的答案:
df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'],
'B': ['2', '1', '1', '1'] })
In this case you have some options using the optional method
argument, depending on how you wish to handle sorting ties:
在这种情况下,根据您希望如何处理排序关系,您可以使用可选方法参数:
df[ df.groupby('A')['B'].rank(method='average') == 1 ] # the default
df[ df.groupby('A')['B'].rank(method='min') == 1 ]
df[ df.groupby('A')['B'].rank(method='first') == 1 ] # doesn't work, not sure why
#1
5
Generally if you want your data sorted in a groupby but it's not one of the columns which are going to be grouped on then it's better to sort
the df prior to performing groupby
:
一般来说,如果你希望你的数据在groupby中排序,但它不是要分组的列之一,那么最好在执行groupby之前对df进行排序:
In [5]:
df.sort_values('B').groupby('A').first()
Out[5]:
B
A
bar 1
foo 1
#2
5
The pandas groupby function could be used for what you want, but it's really meant for aggregation. This is a simple 'take the first' operation.
熊猫群组函数可以用于你想要的东西,但它实际上是用来聚集的。这是一个简单的“take the first”操作。
What you actually want is the pandas drop_duplicates function, which by default will return the first row. What you usually would consider the groupby key, you should pass as the subset= variable
您真正想要的是熊猫drop_duplicate函数,默认情况下它将返回第一行。您通常认为groupby键是什么,您应该将其作为子集=变量传递
df.drop_duplicates(subset='A')
Should do what you want.
应该做你想做的。
Also, df.sort('A')
does not sort the DataFrame df, it returns a copy which is sorted. If you want to sort it, you have to add the inplace=True
parameter.
而且,df.sort('A')并不对DataFrame df进行排序,它返回一个已排序的副本。如果要对它进行排序,必须添加inplace=True参数。
df.sort('A', inplace=True)
#3
5
Here's an alternative approach using groupby().rank()
:
这里有一个使用groupby().rank()的替代方法:
df[ df.groupby('A')['B'].rank() == 1 ]
A B
1 foo 1
6 bar 1
This gives you the same answer as @EdChum's for the OP's sample dataframe, but could give a different answer if you have any ties during the sort, for example, with data like this:
这将为您提供与OP的示例dataframe的@EdChum相同的答案,但如果您在这类数据中有任何关联,例如,使用如下数据,则可以给出不同的答案:
df = pd.DataFrame({'A': ['foo', 'foo', 'bar', 'bar'],
'B': ['2', '1', '1', '1'] })
In this case you have some options using the optional method
argument, depending on how you wish to handle sorting ties:
在这种情况下,根据您希望如何处理排序关系,您可以使用可选方法参数:
df[ df.groupby('A')['B'].rank(method='average') == 1 ] # the default
df[ df.groupby('A')['B'].rank(method='min') == 1 ]
df[ df.groupby('A')['B'].rank(method='first') == 1 ] # doesn't work, not sure why