将一个dataframe覆盖到另一个,并且只保留新的或已更改的行

时间:2021-08-22 21:32:12

I have two pandas dataframes that look kind of like the following:

我有两个熊猫dataframes,有点像下面这样:

df1:

df1:

RecorderID    GroupID    Location    ...    SomeColumn
CT-1000001    BV-        Cape Town          SomeValue
CT-1000002    MP-        Johannesburg       SomeValue
CT-1000003    BV-        Durban             SomeValue

df2:

df2:

RecorderID    GroupID    Location    ...    SomeColumn
CT-1000001    BV-        Durban      ...    SomeValue
CT-1000003    BV-        Durban      ...    SomeValue

These two dataframes are large in reality, with many columns and many rows. I want to compare the two dataframes and end with one dataframe accomplishing the following (RecorderID is my primary key):

这两个dataframes实际上很大,包含许多列和许多行。我想比较两个dataframes和一个dataframe (RecorderID是我的主键):

  1. All rows who's values differ in the two dataframes must adopt df1's values and be kept.
  2. 所有在两个dataframes中值不同的行必须采用df1值并保持。
  3. All rows present in df1 but not present in df2 must be inserted.
  4. 必须插入df1中存在但df2中不存在的所有行。
  5. All values that are contained and the same in both dataframes must be removed.
  6. 必须删除两个dataframes中包含的和相同的所有值。

So, taking the above example, I would end up with the following dataframe:

因此,以上面的例子为例,我将以如下的dataframe结尾:

RecorderID    GroupID    Location    ...    SomeColumn
CT-1000001    BV-        Cape Town          SomeValue
CT-1000002    MP-        Johannesburg       SomeValue

PS: I've noticed when writing out a dataframe to Excel, it inserts an index column as the first column. How do I specify that RecorderID is my primary key and that it should use that to index values? I've tried:

PS:我注意到当写一个dataframe到Excel时,它会在第一列插入一个索引列。我如何指定RecorderID是我的主键并且它应该使用它来索引值?我试过了:

df = read_excel('file.xlsx', 'sheet1', index_col='RecorderID')

but that just removes the RecorderID column and adds a numbered index column anyway when I write it out to excel.

但这只是删除RecorderID列并在写excel时添加一个编号索引列。

Thanks!

谢谢!

1 个解决方案

#1


1  

If you're running a recent version of pandas then you can merge and specify the merge method to be left, additionally we can set indicator=True this adds a column _merge which will you tell if the rows are present in left_only or both, we can then filter those rows out:

如果你正在运行一个熊猫的最新版本,那么你可以合并并指定要保留的合并方法,另外我们可以设置indicator=True这增加了一个_merge列,你可以知道这些行是在left_only中还是两者都有,然后我们可以过滤掉这些行:

In [91]:
merged = pd.merge(df1,df2,indicator=True, how='left' )
merged

Out[91]:
   RecorderID GroupID      Location SomeColumn     _merge
0  CT-1000001     BV-     Cape Town  SomeValue  left_only
1  CT-1000002     MP-  Johannesburg  SomeValue  left_only
2  CT-1000003     BV-        Durban  SomeValue       both

In [92]:
merged[merged['_merge'] == 'left_only']

Out[92]:
   RecorderID GroupID      Location SomeColumn     _merge
0  CT-1000001     BV-     Cape Town  SomeValue  left_only
1  CT-1000002     MP-  Johannesburg  SomeValue  left_only

#1


1  

If you're running a recent version of pandas then you can merge and specify the merge method to be left, additionally we can set indicator=True this adds a column _merge which will you tell if the rows are present in left_only or both, we can then filter those rows out:

如果你正在运行一个熊猫的最新版本,那么你可以合并并指定要保留的合并方法,另外我们可以设置indicator=True这增加了一个_merge列,你可以知道这些行是在left_only中还是两者都有,然后我们可以过滤掉这些行:

In [91]:
merged = pd.merge(df1,df2,indicator=True, how='left' )
merged

Out[91]:
   RecorderID GroupID      Location SomeColumn     _merge
0  CT-1000001     BV-     Cape Town  SomeValue  left_only
1  CT-1000002     MP-  Johannesburg  SomeValue  left_only
2  CT-1000003     BV-        Durban  SomeValue       both

In [92]:
merged[merged['_merge'] == 'left_only']

Out[92]:
   RecorderID GroupID      Location SomeColumn     _merge
0  CT-1000001     BV-     Cape Town  SomeValue  left_only
1  CT-1000002     MP-  Johannesburg  SomeValue  left_only