I am trying to merge ~300 dataframes. My constraint being, all of them have only 4 common columns and the rest may/may not be common. Which needs me to add columns to the dataframe everytime a new column is encountered during merging. I've simulated a toy dataset for the same.
我正在试着合并~300个dataframes。我的约束条件是,它们都只有4个公共列,其余的可能不常见。这需要我在合并过程中每次遇到新列时向dataframe添加列。我也模拟了一个玩具数据集。
Dataframe1:
Dataframe1:
Column_A : 'a', 'a', 'b', 'b', 'd'
Column_CounterName : 'Type1', 'Type2', 'Type3', 'Type4', 'Type1'
Column_CounterValue : 100, 300, 356, 288, 233, 453
Dataframe2:
Dataframe2:
Column_A : 'm', 'm', 'n', 'n', 'o'
Column_CounterName : 'Type1', 'Type5', 'Type6','Type5', 'Type1'
Column_CounterValue : 100, 300, 356, 846, 7455
Merged Dataframe should be:
合并Dataframe应该是:
Column_A : 'a', 'b', 'd', 'm', 'n', 'o'
Type1 : 100, null, 453, 100, null, 7455
Type2 : 300, null, null, null, null, null
Type3 : null, 356, null, null null, null
Type4 : null, 233, null, null, null, null
Type5 : null, null, null, 356, 846, null
Type6 : null, null, null, 356, null, null
Column_A, Type1, .... are all column names.
类型1,Column_A ....都是列名。
How do I do this ?
我该怎么做呢?
Also, how do I fill in the Null values after merging.
另外,如何在合并后填充Null值。
1 个解决方案
#1
1
I believe need set_index
with concat
for join all DataFrames by A
column:
我认为需要set_index with concat以一列的形式加入所有数据aframes:
dfs = [df1, df2]
#for each DataFrame create index by A column
dfs = [x.set_index('A') for x in dfs]
#for join by more columns
#dfs = [x.set_index(['A', 'col1', 'col2']) for x in dfs]
df = pd.concat(dfs, axis=1).rename_axis('A').reset_index()
print (df)
A B D
0 'a' 1.0 NaN
1 'b' 2.0 NaN
2 'c' 3.0 NaN
3 'd' 4.0 NaN
4 'm' NaN 's'
5 'n' NaN 'd'
6 'o' NaN 'k'
EDIT:
编辑:
dfs = [df1, df2]
#for each DataFrame create index by A column
dfs = [x.set_index(['Column_A','Column_CounterName']) for x in dfs]
df = pd.concat(dfs)['Column_CounterValue'].unstack().rename_axis(None, 1).reset_index()
print (df)
Column_A Type1 Type2 Type3 Type4 Type5 Type6
0 a 100.0 300.0 NaN NaN NaN NaN
1 b NaN NaN 356.0 233.0 NaN NaN
2 d 453.0 NaN NaN NaN NaN NaN
3 m 100.0 NaN NaN NaN 300.0 NaN
4 n NaN NaN NaN NaN 846.0 356.0
5 o 7455.0 NaN NaN NaN NaN NaN
If get:
如果得到:
ValueError: Index contains duplicate entries, cannot reshape
ValueError:索引包含重复的条目,不能重新塑形
it means duplicates in pairs Column_A
and Column_CounterName
like:
它表示成对的Column_A和Column_CounterName,如:
d1 = {'Column_A' : ['a', 'a', 'b', 'b', 'd'],
'Column_CounterName' : ['Type1', 'Type1', 'Type3', 'Type4', 'Type1'],
'Column_CounterValue' : [100, 300, 356, 233, 453]}
d2 = {'Column_A' :[ 'm', 'm', 'n', 'n', 'o'],
'Column_CounterName' : ['Type1', 'Type5', 'Type6','Type5', 'Type1'],
'Column_CounterValue' : [100, 300, 356, 846, 7455]}
df1 = pd.DataFrame(d1)
print (df1)
Column_A Column_CounterName Column_CounterValue
0 a Type1 100 <- same a, Type1
1 a Type1 300 <- same a, Type1
2 b Type3 356
3 b Type4 233
4 d Type1 453
df2 = pd.DataFrame(d2)
print (df2)
Column_A Column_CounterName Column_CounterValue
0 m Type1 100
1 m Type5 300
2 n Type6 356
3 n Type5 846
4 o Type1 7455
Then possible solution ias aggregate duplicated pairs, e.g. by mean
:
那么可能的解决方案ias聚合重复对,例如:
df = (pd.concat(dfs)['Column_CounterValue']
.groupby(level=[0,1])
.mean()
.unstack()
.rename_axis(None, 1)
.reset_index())
print (df)
Column_A Type1 Type3 Type4 Type5 Type6
0 a 200.0 NaN NaN NaN NaN <- (100 + 300) / 2 = 200
1 b NaN 356.0 233.0 NaN NaN
2 d 453.0 NaN NaN NaN NaN
3 m 100.0 NaN NaN 300.0 NaN
4 n NaN NaN NaN 846.0 356.0
5 o 7455.0 NaN NaN NaN NaN
#1
1
I believe need set_index
with concat
for join all DataFrames by A
column:
我认为需要set_index with concat以一列的形式加入所有数据aframes:
dfs = [df1, df2]
#for each DataFrame create index by A column
dfs = [x.set_index('A') for x in dfs]
#for join by more columns
#dfs = [x.set_index(['A', 'col1', 'col2']) for x in dfs]
df = pd.concat(dfs, axis=1).rename_axis('A').reset_index()
print (df)
A B D
0 'a' 1.0 NaN
1 'b' 2.0 NaN
2 'c' 3.0 NaN
3 'd' 4.0 NaN
4 'm' NaN 's'
5 'n' NaN 'd'
6 'o' NaN 'k'
EDIT:
编辑:
dfs = [df1, df2]
#for each DataFrame create index by A column
dfs = [x.set_index(['Column_A','Column_CounterName']) for x in dfs]
df = pd.concat(dfs)['Column_CounterValue'].unstack().rename_axis(None, 1).reset_index()
print (df)
Column_A Type1 Type2 Type3 Type4 Type5 Type6
0 a 100.0 300.0 NaN NaN NaN NaN
1 b NaN NaN 356.0 233.0 NaN NaN
2 d 453.0 NaN NaN NaN NaN NaN
3 m 100.0 NaN NaN NaN 300.0 NaN
4 n NaN NaN NaN NaN 846.0 356.0
5 o 7455.0 NaN NaN NaN NaN NaN
If get:
如果得到:
ValueError: Index contains duplicate entries, cannot reshape
ValueError:索引包含重复的条目,不能重新塑形
it means duplicates in pairs Column_A
and Column_CounterName
like:
它表示成对的Column_A和Column_CounterName,如:
d1 = {'Column_A' : ['a', 'a', 'b', 'b', 'd'],
'Column_CounterName' : ['Type1', 'Type1', 'Type3', 'Type4', 'Type1'],
'Column_CounterValue' : [100, 300, 356, 233, 453]}
d2 = {'Column_A' :[ 'm', 'm', 'n', 'n', 'o'],
'Column_CounterName' : ['Type1', 'Type5', 'Type6','Type5', 'Type1'],
'Column_CounterValue' : [100, 300, 356, 846, 7455]}
df1 = pd.DataFrame(d1)
print (df1)
Column_A Column_CounterName Column_CounterValue
0 a Type1 100 <- same a, Type1
1 a Type1 300 <- same a, Type1
2 b Type3 356
3 b Type4 233
4 d Type1 453
df2 = pd.DataFrame(d2)
print (df2)
Column_A Column_CounterName Column_CounterValue
0 m Type1 100
1 m Type5 300
2 n Type6 356
3 n Type5 846
4 o Type1 7455
Then possible solution ias aggregate duplicated pairs, e.g. by mean
:
那么可能的解决方案ias聚合重复对,例如:
df = (pd.concat(dfs)['Column_CounterValue']
.groupby(level=[0,1])
.mean()
.unstack()
.rename_axis(None, 1)
.reset_index())
print (df)
Column_A Type1 Type3 Type4 Type5 Type6
0 a 200.0 NaN NaN NaN NaN <- (100 + 300) / 2 = 200
1 b NaN 356.0 233.0 NaN NaN
2 d 453.0 NaN NaN NaN NaN
3 m 100.0 NaN NaN 300.0 NaN
4 n NaN NaN NaN 846.0 356.0
5 o 7455.0 NaN NaN NaN NaN