I have two pandas dataframes, df1 and df2. Both having atleast two columns with the same name (c1 and c2) and then there are other columns, such as:
我有两个pandas数据帧,df1和df2。两者都有至少两个具有相同名称的列(c1和c2),然后还有其他列,例如:
df1
-----------------
c1 c2 c4 c5
-----------------
10 1 hh 2231
11 1 fgf 2142
12 1 fg 1232
df2
-----------------
c1 c2 c6 c7
-----------------
10 2 110 231
10 3 111 332
11 2 112 123
11 3 113 432
12 2 114 432
12 3 115 432
13 2 116 432
13 3 117 432
14 2 118 432
14 3 119 432
I want to merge the two dataframes such that:
我想合并两个数据帧,以便:
- There is a union of c1 and c2 from both the df
- c4 and c5 are repeated for all the instances of their corresponding c1 and c2 values from df1
- The remaining columns, i.e., c6 and c7 are matched against c1 and c2 in df2 and matched to the c1 value in df1 and then added to the resulting df.
df中存在c1和c2的并集
对于来自df1的相应c1和c2值的所有实例重复c4和c5
剩余的列,即c6和c7与df2中的c1和c2匹配,并与df1中的c1值匹配,然后加到得到的df中。
In a previous question, I got the answer to perform outer join and I performed outer join based on c1 and c2 and that returns the following for example:
在上一个问题中,我得到了执行外连接的答案,并且我基于c1和c2执行了外连接,并返回以下示例:
c1 c2 c4 c5 c6 c7
--------------------------------
10 1 hh 2231
10 2 110 231
10 3 111 332
However, I would like to repeat the values of c4 and c5 as the resulting df as follows:
但是,我想重复c4和c5的值作为结果df,如下所示:
resulting dataframe:
c1 c2 c4 c5 c6 c7
-----------------------------------
10 1 hh 2231
10 2 hh 2231 110 231
10 3 hh 2231 111 332
11 1 fgf 2142
11 2 fgf 2142 112 123
11 3 fgf 2142 113 432
12 1 fg 1232
12 2 fg 1232 114 432
12 3 fg 1232 115 432
13 1 dd 4532
13 2 dd 4532 116 432
13 3 dd 4532 117 432
14 2 118 432
14 3 119 432
Can anyone help me in this matter? Thanks in advance!
有谁可以帮我解决这个问题?提前致谢!
2 个解决方案
#1
0
You can use:
您可以使用:
- first
concat
columnsc1, c2
, sort and if necessary remove duplicated -
merge
both DataFrames by left join - last repeat values by
ffill
andbfill
第一个concat列c1,c2,排序,如有必要,删除重复
通过左连接合并两个DataFrame
ffill和bfill的最后重复值
df = (pd.concat([df1[['c1','c2']], df2[['c1','c2']]])
.sort_values(['c1','c2'])
.drop_duplicates()
.merge(df1, on=['c1','c2'], how='left')
.merge(df2, on=['c1','c2'], how='left')
)
df[['c4','c5']] = df.groupby('c1')['c4','c5'].apply(lambda x: x.ffill().bfill())
print (df)
c1 c2 c4 c5 c6 c7
0 10 1 hh 2231.0 NaN NaN
1 10 2 hh 2231.0 110.0 231.0
2 10 3 hh 2231.0 111.0 332.0
3 11 1 fgf 2142.0 NaN NaN
4 11 2 fgf 2142.0 112.0 123.0
5 11 3 fgf 2142.0 113.0 432.0
6 12 1 fg 1232.0 NaN NaN
7 12 2 fg 1232.0 114.0 432.0
8 12 3 fg 1232.0 115.0 432.0
9 13 2 NaN NaN 116.0 432.0
10 13 3 NaN NaN 117.0 432.0
11 14 2 NaN NaN 118.0 432.0
12 14 3 NaN NaN 119.0 432.0
#2
0
You just need to use pd.concat
instead of merge or join. Here is an example
您只需要使用pd.concat而不是merge或join。这是一个例子
import pandas as pd
import numpy as np
a = np.arange(1,4)
b = np.arange(5,8)
c = np.random.randint(0,10,size=3)
d = np.random.randint(0,10,size=3)
df_1 = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})
out:
a b c d
0 1 5 5 1
1 2 6 7 5
2 3 7 6 9
a = np.arange(4,7)
b = np.arange(7,10)
e = np.random.randint(0,10,size=3)
f = np.random.randint(0,10,size=3)
df_2 = pd.DataFrame({'a':a,'b':b,'e':c,'f':d})
df_2
out:
a b e f
0 4 7 9 9
1 5 8 9 3
2 6 9 2 1
pd.concat([df_1,df_2])
out:
a b c d e f
0 1 5 5.0 1.0 NaN NaN
1 2 6 7.0 5.0 NaN NaN
2 3 7 6.0 9.0 NaN NaN
0 4 7 NaN NaN 9.0 9.0
1 5 8 NaN NaN 9.0 3.0
2 6 9 NaN NaN 2.0 1.0
#1
0
You can use:
您可以使用:
- first
concat
columnsc1, c2
, sort and if necessary remove duplicated -
merge
both DataFrames by left join - last repeat values by
ffill
andbfill
第一个concat列c1,c2,排序,如有必要,删除重复
通过左连接合并两个DataFrame
ffill和bfill的最后重复值
df = (pd.concat([df1[['c1','c2']], df2[['c1','c2']]])
.sort_values(['c1','c2'])
.drop_duplicates()
.merge(df1, on=['c1','c2'], how='left')
.merge(df2, on=['c1','c2'], how='left')
)
df[['c4','c5']] = df.groupby('c1')['c4','c5'].apply(lambda x: x.ffill().bfill())
print (df)
c1 c2 c4 c5 c6 c7
0 10 1 hh 2231.0 NaN NaN
1 10 2 hh 2231.0 110.0 231.0
2 10 3 hh 2231.0 111.0 332.0
3 11 1 fgf 2142.0 NaN NaN
4 11 2 fgf 2142.0 112.0 123.0
5 11 3 fgf 2142.0 113.0 432.0
6 12 1 fg 1232.0 NaN NaN
7 12 2 fg 1232.0 114.0 432.0
8 12 3 fg 1232.0 115.0 432.0
9 13 2 NaN NaN 116.0 432.0
10 13 3 NaN NaN 117.0 432.0
11 14 2 NaN NaN 118.0 432.0
12 14 3 NaN NaN 119.0 432.0
#2
0
You just need to use pd.concat
instead of merge or join. Here is an example
您只需要使用pd.concat而不是merge或join。这是一个例子
import pandas as pd
import numpy as np
a = np.arange(1,4)
b = np.arange(5,8)
c = np.random.randint(0,10,size=3)
d = np.random.randint(0,10,size=3)
df_1 = pd.DataFrame({'a':a,'b':b,'c':c,'d':d})
out:
a b c d
0 1 5 5 1
1 2 6 7 5
2 3 7 6 9
a = np.arange(4,7)
b = np.arange(7,10)
e = np.random.randint(0,10,size=3)
f = np.random.randint(0,10,size=3)
df_2 = pd.DataFrame({'a':a,'b':b,'e':c,'f':d})
df_2
out:
a b e f
0 4 7 9 9
1 5 8 9 3
2 6 9 2 1
pd.concat([df_1,df_2])
out:
a b c d e f
0 1 5 5.0 1.0 NaN NaN
1 2 6 7.0 5.0 NaN NaN
2 3 7 6.0 9.0 NaN NaN
0 4 7 NaN NaN 9.0 9.0
1 5 8 NaN NaN 9.0 3.0
2 6 9 NaN NaN 2.0 1.0