Pandas合并了两个具有不同列的数据帧

时间:2021-12-19 22:55:06

I'm surely missing something simple here. Trying to merge two dataframes in pandas that have mostly the same column names, but the right dataframe has some columns that the left doesn't have, and vice versa.

我肯定在这里遗漏了一些简单的东西。尝试在大多数具有相同列名的pandas中合并两个数据帧,但右侧数据帧有一些左侧没有的列,反之亦然。

>df_may

  id  quantity  attr_1  attr_2
0  1        20       0       1
1  2        23       1       1
2  3        19       1       1
3  4        19       0       0

>df_jun

  id  quantity  attr_1  attr_3
0  5         8       1       0
1  6        13       0       1
2  7        20       1       1
3  8        25       1       1

I've tried joining with an outer join:

我尝试加入外连接:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

But that yields:

但是产量:

Left data columns not unique: Index([....

I've also specified a single column to join on (on = "id", e.g.), but that duplicates all columns except "id" like attr_1_x, attr_1_y, which is not ideal. I've also passed the entire list of columns (there are many) to "on":

我还指定了一个要加入的列(on =“id”,例如),但是复制除“id”之外的所有列,如attr_1_x,attr_1_y,这是不理想的。我还将整个列列表(有很多)传递给“on”:

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

Which yields:

产量:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

What am I missing? I'd like to get a df with all rows appended, and attr_1, attr_2, attr_3 populated where possible, NaN where they don't show up. This seems like a pretty typical workflow for data munging, but I'm stuck.

我错过了什么?我想得到一个附加了所有行的df,attr_1,attr_2,attr_3尽可能填充NaN,它们没有出现。这似乎是一个非常典型的数据调整工作流程,但我被困住了。

Thanks in advance.

提前致谢。

2 个解决方案

#1


26  

I think in this case concat is what you want:

我想在这种情况下,concat就是你想要的:

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

通过传递axis = 0,你将df叠加在彼此的顶部,我相信这就是你想要的,然后产生NaN值,它们不在各自的dfs中。

#2


0  

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join

我今天使用concat,append或merge中的任何一个都遇到了这个问题,我通过添加一个顺序编号的辅助列然后进行外连接来解决它

```helper=1
for i in df1.index:
    df1.loc[i,'helper']=helper
    helper=helper+1
for i in df2.index:
    df2.loc[i,'helper']=helper
    helper=helper+1
df1.merge(df2,on='helper',how='outer')```

#1


26  

I think in this case concat is what you want:

我想在这种情况下,concat就是你想要的:

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

by passing axis=0 here you are stacking the df's on top of each other which I believe is what you want then producing NaN value where they are absent from their respective dfs.

通过传递axis = 0,你将df叠加在彼此的顶部,我相信这就是你想要的,然后产生NaN值,它们不在各自的dfs中。

#2


0  

I had this problem today using any of concat, append or merge, and I got around it by adding a helper column sequentially numbered and then doing an outer join

我今天使用concat,append或merge中的任何一个都遇到了这个问题,我通过添加一个顺序编号的辅助列然后进行外连接来解决它

```helper=1
for i in df1.index:
    df1.loc[i,'helper']=helper
    helper=helper+1
for i in df2.index:
    df2.loc[i,'helper']=helper
    helper=helper+1
df1.merge(df2,on='helper',how='outer')```