I am new to pandas, I need to complete the following task, is there an effective way to do it? There are 2 different dataframes, dfa and dfb:
我是熊猫新手,我需要完成以下任务,有没有一种有效的方法呢?有两种不同的数据帧,dfa和dfb:
I used this to merge them together:
我用这个将它们合并在一起:
df = pd.merge(dfa, dfb, left_on = ['a_retry','a_cca', 'a_rssif', 'a_lqif'], right_on = ['b_retry','b_cca', 'b_rssif', 'b_lqif'])
我得到了df输出:
However it is not my expectation. The merged dataframe contains all columns, it is OK, but the rows shall not exceed the smaller one (aka. dfa), that means the row 3 must be dropped, the expected one is: How can I do that? Thanks.
然而,这不是我的期望。合并的数据框包含所有列,没关系,但行不应超过较小的行(也称为dfa),这意味着必须删除第3行,预期的是:我该怎么做?谢谢。
1 个解决方案
#1
0
It is expected, because duplicates per all 4 columns.
这是预料之中的,因为每4列都有重复数据。
So need remove duplicates rows by drop_duplicates
:
因此需要通过drop_duplicates删除重复行:
dfa = dfa.drop_duplicates(subset=['a_retry','a_cca', 'a_rssif', 'a_lqif'])
dfb = dfb.drop_duplicates(subset=['b_retry','b_cca', 'b_rssif', 'b_lqif'])
But if need match duplicates rows, is it possible with new column by cumcount
, which is used for merge
:
但是如果需要匹配重复行,是否可以使用cumcount的新列,用于合并:
dfa['new'] = dfa.groupby(['a_retry','a_cca', 'a_rssif', 'a_lqif']).cumcount()
dfb['new'] = dfb.groupby(['b_retry','b_cca', 'b_rssif', 'b_lqif']).cumcount()
df = (pd.merge(dfa,
dfb,
left_on = ['a_retry','a_cca', 'a_rssif', 'a_lqif', 'new'],
right_on = ['b_retry','b_cca', 'b_rssif','b_lqif', 'new']).drop('new', axis=1))
#1
0
It is expected, because duplicates per all 4 columns.
这是预料之中的,因为每4列都有重复数据。
So need remove duplicates rows by drop_duplicates
:
因此需要通过drop_duplicates删除重复行:
dfa = dfa.drop_duplicates(subset=['a_retry','a_cca', 'a_rssif', 'a_lqif'])
dfb = dfb.drop_duplicates(subset=['b_retry','b_cca', 'b_rssif', 'b_lqif'])
But if need match duplicates rows, is it possible with new column by cumcount
, which is used for merge
:
但是如果需要匹配重复行,是否可以使用cumcount的新列,用于合并:
dfa['new'] = dfa.groupby(['a_retry','a_cca', 'a_rssif', 'a_lqif']).cumcount()
dfb['new'] = dfb.groupby(['b_retry','b_cca', 'b_rssif', 'b_lqif']).cumcount()
df = (pd.merge(dfa,
dfb,
left_on = ['a_retry','a_cca', 'a_rssif', 'a_lqif', 'new'],
right_on = ['b_retry','b_cca', 'b_rssif','b_lqif', 'new']).drop('new', axis=1))