I have two pandas dataframes that look kind of like the following:
我有两个熊猫dataframes,有点像下面这样:
df1:
df1:
RecorderID GroupID Location ... SomeColumn
CT-1000001 BV- Cape Town SomeValue
CT-1000002 MP- Johannesburg SomeValue
CT-1000003 BV- Durban SomeValue
df2:
df2:
RecorderID GroupID Location ... SomeColumn
CT-1000001 BV- Durban ... SomeValue
CT-1000003 BV- Durban ... SomeValue
These two dataframes are large in reality, with many columns and many rows. I want to compare the two dataframes and end with one dataframe accomplishing the following (RecorderID is my primary key):
这两个dataframes实际上很大,包含许多列和许多行。我想比较两个dataframes和一个dataframe (RecorderID是我的主键):
- All rows who's values differ in the two dataframes must adopt df1's values and be kept.
- 所有在两个dataframes中值不同的行必须采用df1值并保持。
- All rows present in df1 but not present in df2 must be inserted.
- 必须插入df1中存在但df2中不存在的所有行。
- All values that are contained and the same in both dataframes must be removed.
- 必须删除两个dataframes中包含的和相同的所有值。
So, taking the above example, I would end up with the following dataframe:
因此,以上面的例子为例,我将以如下的dataframe结尾:
RecorderID GroupID Location ... SomeColumn
CT-1000001 BV- Cape Town SomeValue
CT-1000002 MP- Johannesburg SomeValue
PS: I've noticed when writing out a dataframe to Excel, it inserts an index column as the first column. How do I specify that RecorderID is my primary key and that it should use that to index values? I've tried:
PS:我注意到当写一个dataframe到Excel时,它会在第一列插入一个索引列。我如何指定RecorderID是我的主键并且它应该使用它来索引值?我试过了:
df = read_excel('file.xlsx', 'sheet1', index_col='RecorderID')
but that just removes the RecorderID column and adds a numbered index column anyway when I write it out to excel.
但这只是删除RecorderID列并在写excel时添加一个编号索引列。
Thanks!
谢谢!
1 个解决方案
#1
1
If you're running a recent version of pandas then you can merge
and specify the merge method to be left
, additionally we can set indicator=True
this adds a column _merge
which will you tell if the rows are present in left_only
or both
, we can then filter those rows out:
如果你正在运行一个熊猫的最新版本,那么你可以合并并指定要保留的合并方法,另外我们可以设置indicator=True这增加了一个_merge列,你可以知道这些行是在left_only中还是两者都有,然后我们可以过滤掉这些行:
In [91]:
merged = pd.merge(df1,df2,indicator=True, how='left' )
merged
Out[91]:
RecorderID GroupID Location SomeColumn _merge
0 CT-1000001 BV- Cape Town SomeValue left_only
1 CT-1000002 MP- Johannesburg SomeValue left_only
2 CT-1000003 BV- Durban SomeValue both
In [92]:
merged[merged['_merge'] == 'left_only']
Out[92]:
RecorderID GroupID Location SomeColumn _merge
0 CT-1000001 BV- Cape Town SomeValue left_only
1 CT-1000002 MP- Johannesburg SomeValue left_only
#1
1
If you're running a recent version of pandas then you can merge
and specify the merge method to be left
, additionally we can set indicator=True
this adds a column _merge
which will you tell if the rows are present in left_only
or both
, we can then filter those rows out:
如果你正在运行一个熊猫的最新版本,那么你可以合并并指定要保留的合并方法,另外我们可以设置indicator=True这增加了一个_merge列,你可以知道这些行是在left_only中还是两者都有,然后我们可以过滤掉这些行:
In [91]:
merged = pd.merge(df1,df2,indicator=True, how='left' )
merged
Out[91]:
RecorderID GroupID Location SomeColumn _merge
0 CT-1000001 BV- Cape Town SomeValue left_only
1 CT-1000002 MP- Johannesburg SomeValue left_only
2 CT-1000003 BV- Durban SomeValue both
In [92]:
merged[merged['_merge'] == 'left_only']
Out[92]:
RecorderID GroupID Location SomeColumn _merge
0 CT-1000001 BV- Cape Town SomeValue left_only
1 CT-1000002 MP- Johannesburg SomeValue left_only