将dataframes与某些不常见的列合并。

时间:2021-10-24 19:35:22

I am trying to merge ~300 dataframes. My constraint being, all of them have only 4 common columns and the rest may/may not be common. Which needs me to add columns to the dataframe everytime a new column is encountered during merging. I've simulated a toy dataset for the same.

我正在试着合并~300个dataframes。我的约束条件是,它们都只有4个公共列,其余的可能不常见。这需要我在合并过程中每次遇到新列时向dataframe添加列。我也模拟了一个玩具数据集。

Dataframe1:

Dataframe1:

Column_A : 'a', 'a', 'b', 'b', 'd'
Column_CounterName : 'Type1', 'Type2', 'Type3', 'Type4', 'Type1'
Column_CounterValue : 100, 300, 356, 288, 233, 453

Dataframe2:

Dataframe2:

Column_A : 'm', 'm', 'n', 'n', 'o'
Column_CounterName : 'Type1', 'Type5', 'Type6','Type5', 'Type1'
Column_CounterValue : 100, 300, 356, 846, 7455

Merged Dataframe should be:

合并Dataframe应该是:

Column_A : 'a', 'b', 'd', 'm', 'n', 'o'
Type1 : 100, null, 453, 100, null, 7455
Type2 : 300, null, null, null, null, null
Type3 : null, 356, null, null null, null
Type4 : null, 233, null, null, null, null 
Type5 : null, null, null, 356, 846, null
Type6 : null, null, null, 356, null, null

Column_A, Type1, .... are all column names.

类型1,Column_A ....都是列名。

How do I do this ?

我该怎么做呢?

Also, how do I fill in the Null values after merging.

另外,如何在合并后填充Null值。

1 个解决方案

#1


1  

I believe need set_index with concat for join all DataFrames by A column:

我认为需要set_index with concat以一列的形式加入所有数据aframes:

dfs = [df1, df2]
#for each DataFrame create index by A column
dfs = [x.set_index('A') for x in dfs]
#for join by more columns
#dfs = [x.set_index(['A', 'col1', 'col2']) for x in dfs]

df = pd.concat(dfs, axis=1).rename_axis('A').reset_index()
print (df)
     A    B    D
0  'a'  1.0  NaN
1  'b'  2.0  NaN
2  'c'  3.0  NaN
3  'd'  4.0  NaN
4  'm'  NaN  's'
5  'n'  NaN  'd'
6  'o'  NaN  'k'

EDIT:

编辑:

dfs = [df1, df2]
#for each DataFrame create index by A column
dfs = [x.set_index(['Column_A','Column_CounterName']) for x in dfs]

df = pd.concat(dfs)['Column_CounterValue'].unstack().rename_axis(None, 1).reset_index()
print (df)
  Column_A   Type1  Type2  Type3  Type4  Type5  Type6
0        a   100.0  300.0    NaN    NaN    NaN    NaN
1        b     NaN    NaN  356.0  233.0    NaN    NaN
2        d   453.0    NaN    NaN    NaN    NaN    NaN
3        m   100.0    NaN    NaN    NaN  300.0    NaN
4        n     NaN    NaN    NaN    NaN  846.0  356.0
5        o  7455.0    NaN    NaN    NaN    NaN    NaN

If get:

如果得到:

ValueError: Index contains duplicate entries, cannot reshape

ValueError:索引包含重复的条目,不能重新塑形

it means duplicates in pairs Column_A and Column_CounterName like:

它表示成对的Column_A和Column_CounterName,如:

d1 = {'Column_A' : ['a', 'a', 'b', 'b', 'd'],
'Column_CounterName' : ['Type1', 'Type1', 'Type3', 'Type4', 'Type1'],
'Column_CounterValue' : [100, 300, 356,  233, 453]}

d2 = {'Column_A' :[ 'm', 'm', 'n', 'n', 'o'],
'Column_CounterName' : ['Type1', 'Type5', 'Type6','Type5', 'Type1'],
'Column_CounterValue' : [100, 300, 356, 846, 7455]}

df1 = pd.DataFrame(d1)
print (df1)
  Column_A Column_CounterName  Column_CounterValue
0        a              Type1                  100 <- same a, Type1
1        a              Type1                  300 <- same a, Type1
2        b              Type3                  356
3        b              Type4                  233
4        d              Type1                  453

df2 = pd.DataFrame(d2)
print (df2)
  Column_A Column_CounterName  Column_CounterValue
0        m              Type1                  100
1        m              Type5                  300
2        n              Type6                  356
3        n              Type5                  846
4        o              Type1                 7455

Then possible solution ias aggregate duplicated pairs, e.g. by mean:

那么可能的解决方案ias聚合重复对,例如:

df = (pd.concat(dfs)['Column_CounterValue']
        .groupby(level=[0,1])
        .mean()
        .unstack()
        .rename_axis(None, 1)
        .reset_index())
print (df)
  Column_A   Type1  Type3  Type4  Type5  Type6
0        a   200.0    NaN    NaN    NaN    NaN <- (100 + 300) / 2 = 200
1        b     NaN  356.0  233.0    NaN    NaN
2        d   453.0    NaN    NaN    NaN    NaN
3        m   100.0    NaN    NaN  300.0    NaN
4        n     NaN    NaN    NaN  846.0  356.0
5        o  7455.0    NaN    NaN    NaN    NaN

#1


1  

I believe need set_index with concat for join all DataFrames by A column:

我认为需要set_index with concat以一列的形式加入所有数据aframes:

dfs = [df1, df2]
#for each DataFrame create index by A column
dfs = [x.set_index('A') for x in dfs]
#for join by more columns
#dfs = [x.set_index(['A', 'col1', 'col2']) for x in dfs]

df = pd.concat(dfs, axis=1).rename_axis('A').reset_index()
print (df)
     A    B    D
0  'a'  1.0  NaN
1  'b'  2.0  NaN
2  'c'  3.0  NaN
3  'd'  4.0  NaN
4  'm'  NaN  's'
5  'n'  NaN  'd'
6  'o'  NaN  'k'

EDIT:

编辑:

dfs = [df1, df2]
#for each DataFrame create index by A column
dfs = [x.set_index(['Column_A','Column_CounterName']) for x in dfs]

df = pd.concat(dfs)['Column_CounterValue'].unstack().rename_axis(None, 1).reset_index()
print (df)
  Column_A   Type1  Type2  Type3  Type4  Type5  Type6
0        a   100.0  300.0    NaN    NaN    NaN    NaN
1        b     NaN    NaN  356.0  233.0    NaN    NaN
2        d   453.0    NaN    NaN    NaN    NaN    NaN
3        m   100.0    NaN    NaN    NaN  300.0    NaN
4        n     NaN    NaN    NaN    NaN  846.0  356.0
5        o  7455.0    NaN    NaN    NaN    NaN    NaN

If get:

如果得到:

ValueError: Index contains duplicate entries, cannot reshape

ValueError:索引包含重复的条目,不能重新塑形

it means duplicates in pairs Column_A and Column_CounterName like:

它表示成对的Column_A和Column_CounterName,如:

d1 = {'Column_A' : ['a', 'a', 'b', 'b', 'd'],
'Column_CounterName' : ['Type1', 'Type1', 'Type3', 'Type4', 'Type1'],
'Column_CounterValue' : [100, 300, 356,  233, 453]}

d2 = {'Column_A' :[ 'm', 'm', 'n', 'n', 'o'],
'Column_CounterName' : ['Type1', 'Type5', 'Type6','Type5', 'Type1'],
'Column_CounterValue' : [100, 300, 356, 846, 7455]}

df1 = pd.DataFrame(d1)
print (df1)
  Column_A Column_CounterName  Column_CounterValue
0        a              Type1                  100 <- same a, Type1
1        a              Type1                  300 <- same a, Type1
2        b              Type3                  356
3        b              Type4                  233
4        d              Type1                  453

df2 = pd.DataFrame(d2)
print (df2)
  Column_A Column_CounterName  Column_CounterValue
0        m              Type1                  100
1        m              Type5                  300
2        n              Type6                  356
3        n              Type5                  846
4        o              Type1                 7455

Then possible solution ias aggregate duplicated pairs, e.g. by mean:

那么可能的解决方案ias聚合重复对,例如:

df = (pd.concat(dfs)['Column_CounterValue']
        .groupby(level=[0,1])
        .mean()
        .unstack()
        .rename_axis(None, 1)
        .reset_index())
print (df)
  Column_A   Type1  Type3  Type4  Type5  Type6
0        a   200.0    NaN    NaN    NaN    NaN <- (100 + 300) / 2 = 200
1        b     NaN  356.0  233.0    NaN    NaN
2        d   453.0    NaN    NaN    NaN    NaN
3        m   100.0    NaN    NaN  300.0    NaN
4        n     NaN    NaN    NaN  846.0  356.0
5        o  7455.0    NaN    NaN    NaN    NaN