I have 3 CSV files. Each has the first column as the (string) names of people, while all the other columns in each dataframe are attributes of that person.
我有3个CSV文件。每个都有第一列作为人的(字符串)名称,而每个数据框中的所有其他列都是该人的属性。
How can I "join" together all three CSV documents to create a single CSV with each row having all the attributes for each unique value of the person's string name?
如何将所有三个CSV文档“连接”在一起以创建单个CSV,每行具有该人员字符串名称的每个唯一值的所有属性?
The join()
function in pandas specifies that I need a multiindex, but I'm confused about what a hierarchical indexing scheme has to do with making a join based on a single index.
pandas中的join()函数指定我需要一个多索引,但我对层次索引方案与基于单个索引进行连接有什么关系感到困惑。
7 个解决方案
#1
269
Assumed imports:
假定进口:
import pandas as pd
John Galt's answer is basically a reduce
operation. If I have more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
John Galt的答案基本上是减少操作。如果我有一些以上的数据帧,我会将它们放在这样的列表中(通过列表推导或循环或诸如此类生成):
dfs = [df0, df1, df2, dfN]
Assuming they have some common column, like name
in your example, I'd do the following:
假设他们有一些共同的列,比如你的例子中的名字,我会做以下事情:
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.
这样,您的代码应该与您要合并的任意数量的数据帧一起使用。
Edit August 1, 2016: For those using Python 3: reduce
has been moved into functools
. So to use this function, you'll first need to import that module:
编辑2016年8月1日:对于那些使用Python 3的人:reduce已被移入functools。因此,要使用此功能,您首先需要导入该模块:
from functools import reduce
#2
61
You could try this if you have 3 dataframes
如果你有3个数据帧,你可以试试这个
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
或者,如cwharland所述
df1.merge(df2,on='name').merge(df3,on='name')
#3
11
This can also be done as follows for a list of dataframes df_list
:
对于数据帧列表df_list,这也可以如下完成:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
或者如果数据帧在生成器对象中(例如,以减少内存消耗):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')
#4
6
This is an ideal situation for the join
method
The join
method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
join方法完全针对这些类型的情况构建。您可以将任意数量的DataFrame加入其中。调用DataFrame与传递的DataFrames集合的索引相连接。要使用多个DataFrame,必须将连接列放在索引中。
The code would look something like this:
代码看起来像这样:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With @zero's data, you could do this:
使用@ zero的数据,您可以这样做:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9
#5
4
Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
这是一种合并数据帧字典同时保持列名与字典同步的方法。如果需要,它还会填写缺失值:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)
#6
3
One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name')
for example)
一个不需要多索引来执行连接操作。只需要正确设置执行连接操作的索引列(例如,命令df.set_index('Name'))
The join
operation is by default performed on index. In your case, you just have to specify that the Name
column corresponds to your index. Below is an example
默认情况下,对索引执行连接操作。在您的情况下,您只需指定Name列对应于您的索引。以下是一个例子
A tutorial may be useful.
教程可能很有用。
# Simple example where dataframes index are the name on which to perform the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name']=df1.index
# 1) Select the index from column 'Name'
df1=df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
#7
1
There is another solution from the pandas documentation (that I don't see here),
pandas文档还有另一个解决方案(我在这里没有看到),
using the .append
使用.append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True
is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
ignore_index = True用于忽略附加数据帧的索引,将其替换为源代码中可用的下一个索引。
If there are different column names, Nan
will be introduced.
如果有不同的列名,将引入Nan。
#1
269
Assumed imports:
假定进口:
import pandas as pd
John Galt's answer is basically a reduce
operation. If I have more than a handful of dataframes, I'd put them in a list like this (generated via list comprehensions or loops or whatnot):
John Galt的答案基本上是减少操作。如果我有一些以上的数据帧,我会将它们放在这样的列表中(通过列表推导或循环或诸如此类生成):
dfs = [df0, df1, df2, dfN]
Assuming they have some common column, like name
in your example, I'd do the following:
假设他们有一些共同的列,比如你的例子中的名字,我会做以下事情:
df_final = reduce(lambda left,right: pd.merge(left,right,on='name'), dfs)
That way, your code should work with whatever number of dataframes you want to merge.
这样,您的代码应该与您要合并的任意数量的数据帧一起使用。
Edit August 1, 2016: For those using Python 3: reduce
has been moved into functools
. So to use this function, you'll first need to import that module:
编辑2016年8月1日:对于那些使用Python 3的人:reduce已被移入functools。因此,要使用此功能,您首先需要导入该模块:
from functools import reduce
#2
61
You could try this if you have 3 dataframes
如果你有3个数据帧,你可以试试这个
# Merge multiple dataframes
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
pd.merge(pd.merge(df1,df2,on='name'),df3,on='name')
alternatively, as mentioned by cwharland
或者,如cwharland所述
df1.merge(df2,on='name').merge(df3,on='name')
#3
11
This can also be done as follows for a list of dataframes df_list
:
对于数据帧列表df_list,这也可以如下完成:
df = df_list[0]
for df_ in df_list[1:]:
df = df.merge(df_, on='join_col_name')
or if the dataframes are in a generator object (e.g. to reduce memory consumption):
或者如果数据帧在生成器对象中(例如,以减少内存消耗):
df = next(df_list)
for df_ in df_list:
df = df.merge(df_, on='join_col_name')
#4
6
This is an ideal situation for the join
method
The join
method is built exactly for these types of situations. You can join any number of DataFrames together with it. The calling DataFrame joins with the index of the collection of passed DataFrames. To work with multiple DataFrames, you must put the joining columns in the index.
join方法完全针对这些类型的情况构建。您可以将任意数量的DataFrame加入其中。调用DataFrame与传递的DataFrames集合的索引相连接。要使用多个DataFrame,必须将连接列放在索引中。
The code would look something like this:
代码看起来像这样:
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
With @zero's data, you could do this:
使用@ zero的数据,您可以这样做:
df1 = pd.DataFrame(np.array([
['a', 5, 9],
['b', 4, 61],
['c', 24, 9]]),
columns=['name', 'attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['a', 5, 19],
['b', 14, 16],
['c', 4, 9]]),
columns=['name', 'attr21', 'attr22'])
df3 = pd.DataFrame(np.array([
['a', 15, 49],
['b', 4, 36],
['c', 14, 9]]),
columns=['name', 'attr31', 'attr32'])
dfs = [df1, df2, df3]
dfs = [df.set_index('name') for df in dfs]
dfs[0].join(dfs[1:])
attr11 attr12 attr21 attr22 attr31 attr32
name
a 5 9 5 19 15 49
b 4 61 14 16 4 36
c 24 9 4 9 14 9
#5
4
Here is a method to merge a dictionary of data frames while keeping the column names in sync with the dictionary. Also it fills in missing values if needed:
这是一种合并数据帧字典同时保持列名与字典同步的方法。如果需要,它还会填写缺失值:
This is the function to merge a dict of data frames
def MergeDfDict(dfDict, onCols, how='outer', naFill=None):
keys = dfDict.keys()
for i in range(len(keys)):
key = keys[i]
df0 = dfDict[key]
cols = list(df0.columns)
valueCols = list(filter(lambda x: x not in (onCols), cols))
df0 = df0[onCols + valueCols]
df0.columns = onCols + [(s + '_' + key) for s in valueCols]
if (i == 0):
outDf = df0
else:
outDf = pd.merge(outDf, df0, how=how, on=onCols)
if (naFill != None):
outDf = outDf.fillna(naFill)
return(outDf)
OK, lets generates data and test this:
def GenDf(size):
df = pd.DataFrame({'categ1':np.random.choice(a=['a', 'b', 'c', 'd', 'e'], size=size, replace=True),
'categ2':np.random.choice(a=['A', 'B'], size=size, replace=True),
'col1':np.random.uniform(low=0.0, high=100.0, size=size),
'col2':np.random.uniform(low=0.0, high=100.0, size=size)
})
df = df.sort_values(['categ2', 'categ1', 'col1', 'col2'])
return(df)
size = 5
dfDict = {'US':GenDf(size), 'IN':GenDf(size), 'GER':GenDf(size)}
MergeDfDict(dfDict=dfDict, onCols=['categ1', 'categ2'], how='outer', naFill=0)
#6
3
One does not need a multiindex to perform join operations. One just need to set correctly the index column on which to perform the join operations (which command df.set_index('Name')
for example)
一个不需要多索引来执行连接操作。只需要正确设置执行连接操作的索引列(例如,命令df.set_index('Name'))
The join
operation is by default performed on index. In your case, you just have to specify that the Name
column corresponds to your index. Below is an example
默认情况下,对索引执行连接操作。在您的情况下,您只需指定Name列对应于您的索引。以下是一个例子
A tutorial may be useful.
教程可能很有用。
# Simple example where dataframes index are the name on which to perform the join operations
import pandas as pd
import numpy as np
name = ['Sophia' ,'Emma' ,'Isabella' ,'Olivia' ,'Ava' ,'Emily' ,'Abigail' ,'Mia']
df1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=name)
df2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=name)
df3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=name)
df = df1.join(df2)
df = df.join(df3)
# If you a 'Name' column that is not the index of your dataframe, one can set this column to be the index
# 1) Create a column 'Name' based on the previous index
df1['Name']=df1.index
# 1) Select the index from column 'Name'
df1=df1.set_index('Name')
# If indexes are different, one may have to play with parameter how
gf1 = pd.DataFrame(np.random.randn(8, 3), columns=['A','B','C'], index=range(8))
gf2 = pd.DataFrame(np.random.randn(8, 1), columns=['D'], index=range(2,10))
gf3 = pd.DataFrame(np.random.randn(8, 2), columns=['E','F'], index=range(4,12))
gf = gf1.join(gf2, how='outer')
gf = gf.join(gf3, how='outer')
#7
1
There is another solution from the pandas documentation (that I don't see here),
pandas文档还有另一个解决方案(我在这里没有看到),
using the .append
使用.append
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
A B
0 1 2
1 3 4
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
A B
0 5 6
1 7 8
>>> df.append(df2, ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
The ignore_index=True
is used to ignore the index of the appended dataframe, replacing it with the next index available in the source one.
ignore_index = True用于忽略附加数据帧的索引,将其替换为源代码中可用的下一个索引。
If there are different column names, Nan
will be introduced.
如果有不同的列名,将引入Nan。