What is the most efficient way to merge multiple data frames (i.e., more than 2) in pandas? There are a few answers:
在pandas中合并多个数据帧(即超过2个)的最有效方法是什么?有几个答案:
- pandas joining multiple dataframes on columns
- Pandas left outer join multiple dataframes on multiple columns
pandas在列上连接多个数据帧
Pandas左外连接多个列上的多个数据帧
but these all involve multiple joins. If I have N data frames these would require N-1 joins.
但这些都涉及多个连接。如果我有N个数据帧,则需要N-1个连接。
If I weren't using pandas, another solution would be to just put everything into a hash table based on the common index as the key and build the final version. This is basically like a hash join in SQL I believe. Is there something like that in pandas?
如果我没有使用pandas,另一种解决方案是将所有内容放入基于公共索引作为键的哈希表中并构建最终版本。这基本上就像SQL中的哈希联接我相信。大熊猫有类似的东西吗?
If not, would it be more efficient to just create a new data frame with the common index and pass it the raw data from each data frame? It seems like that would at least prevent you from creating a new data frame in each of the N-1 joins.
如果不是,用公共索引创建一个新数据帧并从每个数据帧传递原始数据会更有效吗?看起来这至少会阻止你在每个N-1连接中创建一个新的数据帧。
Thanks.
1 个解决方案
#1
2
if you can join your data frames by index you can do it in one conveyor:
如果您可以通过索引加入数据框,则可以在一个传送带上进行:
df1.join(df2).join(df3).join(df4)
example:
In [187]: df1
Out[187]:
a b
0 5 2
1 6 7
2 6 5
3 1 6
4 0 2
In [188]: df2
Out[188]:
c d
0 5 7
1 5 5
2 2 4
3 4 3
4 9 0
In [189]: df3
Out[189]:
e f
0 8 1
1 0 9
2 4 5
3 3 9
4 9 5
In [190]: df1.join(df2).join(df3)
Out[190]:
a b c d e f
0 5 2 5 7 8 1
1 6 7 5 5 0 9
2 6 5 2 4 4 5
3 1 6 4 3 3 9
4 0 2 9 0 9 5
It should be pretty fast and effective
它应该非常快速有效
alternatively you can concatenate them:
或者你可以连接它们:
In [191]: pd.concat([df1,df2,df3], axis=1)
Out[191]:
a b c d e f
0 5 2 5 7 8 1
1 6 7 5 5 0 9
2 6 5 2 4 4 5
3 1 6 4 3 3 9
4 0 2 9 0 9 5
Time comparison for 3 DF's with 100K rows each:
3 DF的时间比较,每行100K行:
In [198]: %timeit pd.concat([df1,df2,df3], axis=1)
100 loops, best of 3: 5.67 ms per loop
In [199]: %timeit df1.join(df2).join(df3)
100 loops, best of 3: 3.93 ms per loop
so as you can see join
is bit faster
所以你可以看到连接有点快
#1
2
if you can join your data frames by index you can do it in one conveyor:
如果您可以通过索引加入数据框,则可以在一个传送带上进行:
df1.join(df2).join(df3).join(df4)
example:
In [187]: df1
Out[187]:
a b
0 5 2
1 6 7
2 6 5
3 1 6
4 0 2
In [188]: df2
Out[188]:
c d
0 5 7
1 5 5
2 2 4
3 4 3
4 9 0
In [189]: df3
Out[189]:
e f
0 8 1
1 0 9
2 4 5
3 3 9
4 9 5
In [190]: df1.join(df2).join(df3)
Out[190]:
a b c d e f
0 5 2 5 7 8 1
1 6 7 5 5 0 9
2 6 5 2 4 4 5
3 1 6 4 3 3 9
4 0 2 9 0 9 5
It should be pretty fast and effective
它应该非常快速有效
alternatively you can concatenate them:
或者你可以连接它们:
In [191]: pd.concat([df1,df2,df3], axis=1)
Out[191]:
a b c d e f
0 5 2 5 7 8 1
1 6 7 5 5 0 9
2 6 5 2 4 4 5
3 1 6 4 3 3 9
4 0 2 9 0 9 5
Time comparison for 3 DF's with 100K rows each:
3 DF的时间比较,每行100K行:
In [198]: %timeit pd.concat([df1,df2,df3], axis=1)
100 loops, best of 3: 5.67 ms per loop
In [199]: %timeit df1.join(df2).join(df3)
100 loops, best of 3: 3.93 ms per loop
so as you can see join
is bit faster
所以你可以看到连接有点快