pandas:在groupby'date'中删除重复项

时间:2021-10-29 04:21:50

I have the following dataframe:

我有以下数据帧:

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/crsp.dime.mpl.df'

df=pd.read_csv(url)

df.groupby('date').cid.size()

date
2005       7
2006     237
2007    3610
2008    1318
2009    2664
2010     997
2011    6390
2012    2904
2013    7875
2014    3979

df.groupby('date').cid.nunique()

date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
Name: cid, dtype: int64

I would like to eliminate the duplicate cidvalues such that the output from df.groupby('date').cid.size() matches the output from df.groupby('date').cid.nunique(). I have looked at this post but it does not seem to have a solid solution to the problem.

我想消除重复的cidvalues,使得df.groupby('date')。cid.size()的输出与df.groupby('date')。cid.nunique()的输出匹配。我看过这篇文章,但似乎没有一个可靠的解决方案。

I have tried the following:

我尝试过以下方法:

df.groupby([df['date']]).drop_duplicates(cols='cid')

But I get this error:

但我得到这个错误:

AttributeError: Cannot access callable attribute 'drop_duplicates' of 'DataFrameGroupBy' objects, try using the 'apply' method

and this:

df.groupby(('date').drop_duplicates('cid'))

But I get this error:

但我得到这个错误:

AttributeError: 'str' object has no attribute 'drop_duplicates'

Does someone have an idea on this?

有人对此有所了解吗?

1 个解决方案

#1


14  

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

您不需要groupby基于几列删除重复项,您可以指定一个子集:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64

#1


14  

You don't need groupby to drop duplicates based on a few columns, you can specify a subset instead:

您不需要groupby基于几列删除重复项,您可以指定一个子集:

df2 = df.drop_duplicates(["date", "cid"])
df2.groupby('date').cid.size()
Out[99]: 
date
2005      3
2006     10
2007    227
2008     52
2009    142
2010     57
2011    219
2012     99
2013    238
2014    146
dtype: int64