Python - 如果键是字符串/整数,则在字典中合并多个pandas数据帧

时间:2022-09-06 22:57:36

The data that I'm using looks like this:

我正在使用的数据如下所示:

csv1 = pd.DataFrame({'D': [1-10, 2-10, 3-10, 4-10,...], #dates
...:                'C': [#, #, #, #,...]} #values

csv2 = pd.DataFrame({'D': [3-10, 4-10, 5-10, 6-10,...], #dates
...:                'C': [#, #, #, #,...]} #values

csv3 = pd.DataFrame({'D': [5-10, 6-10, 7-10, 8-10,...], #dates
...:                'C': [#, #, #, #,...]} #values
.
.
.
csv100 = pd.DataFrame({'D': [5-10, 6-10, 7-10, 8-10,...], #dates
...:                'C': [#, #, #, #,...]} #values

I want a data frame like this:

我想要一个像这样的数据框:

df_merged = pd.DataFrame({'D': [1-10,2-10,3-10,4-10,5-10,6-10...] #dates
...:                  'C1': [#, #, #, #, #, #...]} #values
                      'C2': [#, #, #, #, #, #...]} #values
                      'C3': [#, #, #, #, #, #...]} #values
                      .
                      .
                      .
                      'C100': [#, #, #, #, #, #]} #values

I have been trying to merge multiple data frames, around 100, that have the same columns but different rows (they don’t have the same order), I would like to do it by the column 'date' (to merge every row with the same date). Because the amount of data frames is high, and changes over time (today I could have 110, tomorrow I could have 90...), the method of using a loop to merge each one of them is too slow. By researching for a solution, I found that the consensus is to use dictionaries. I applied this solution to my code but I got an error and I don’t know how to solve it. The code is the following

我一直在尝试合并多个数据帧,大约100,具有相同的列但不同的行(它们没有相同的顺序),我想通过列'date'来实现(将每行合并到一起)同一天)。因为数据帧的数量很高,并且随着时间的推移而变化(今天我可以有110,明天我可以有90 ...),使用循环来合并它们中的每一个的方法太慢了。通过研究解决方案,我发现共识是使用字典。我将此解决方案应用于我的代码,但我收到了错误,我不知道如何解决它。代码如下

import pandas as pd
import subprocess
import os
from functools import reduce

path=r'C:\Users\ra\Desktop\Px\a' #Folder 'a' path

df = {} #Dictionary of data frames from csv files in Folder 'a'
x = [#vector that contains the name of the csv file as string]
i = 0
for j in range(len(x)):
    df['df%s' %j] = (pd.read_csv(os.path.join(path,r'%s.csv' % x[i]))) #Assigns a key to the data frame Ex.:'df1' (the key is a string and I think this is the problem)
    df['df%s' %j].rename(columns={'C': '%s' % x[i]}, inplace=True) #Renames the column 'C' of every data frame to the name of the file
    i += 1

df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['D'],how='outer'),df) #Merges every data frame to a single data frame 'df_merged' by column 'D' that represents the date.

The problem is in the last line, the output is the following:

问题出在最后一行,输出如下:

---> df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['D'],how='outer'),df)
.
.
.
ValueError: can not merge DataFrame with instance of type <class 'str'>

If I change the key from string to integer (by changing the vector x to simple numbers 'j') I get the following output:

如果我将键从字符串更改为整数(通过将向量x更改为简单数字'j'),我得到以下输出:

---> df_merged = reduce(lambda  left,right: pd.merge(left,right,on=['D'],how='outer'),df)
.
.
.
ValueError: can not merge DataFrame with instance of type <class 'int'>

To make the code work, I tried to find a way to convert the string keys to names. But, apparently, that is a sin. Also, according to @AnkitMalik the 'reduce' method can't be used with dictionaries. How can I merge all this data frames by the column 'D' in a pythonic way if the keys in the dictionary are strings/integers? Or, How can I make a dynamic list of data frames if their number changes over time depending on the amount of csv files in folder 'a'?

为了使代码工作,我试图找到一种方法将字符串键转换为名称。但是,显然,这是一种罪恶。另外,根据@AnkitMalik,'reduce'方法不能与字典一起使用。如果字典中的键是字符串/整数,如何以pythonic方式通过列'D'合并所有这些数据帧?或者,如果数据的数量随时间变化,我如何制作动态数据框列表,具体取决于文件夹'a'中的csv文件数量?

3 个解决方案

#1


0  

Merging or appending each DataFrame is very expensive, so it's important to make as few of calls as possible.

合并或附加每个DataFrame非常昂贵,因此尽可能少地进行调用很重要。

What you can do however, is make the date column of each DataFrame the index of the DataFrame, put them in a list, and then make one call to pandas.concat() for all of them.

但是,你可以做的是将每个DataFrame的日期列作为DataFrame的索引,将它们放在一个列表中,然后为所有这些调用pandas.concat()。

You will of course have to fiddle with the column names and what they represent, as unless you want a specific entry to be a tuple, you'll have some common columns.

你当然必须摆弄列名和它们代表的内容,除非你想要一个特定的条目是一个元组,你将有一些共同的列。

Example:

>>> import pandas
>>> df_0 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180101', '20180105'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_0
           a  b
0 2018-01-01  5
1 2018-01-02  6
2 2018-01-03  7
3 2018-01-04  8
4 2018-01-05  9
>>> df_1 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180103', '20180107'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_2 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180105', '20180109'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_0 = df_0.set_index('a')
>>> df_1 = df_1.set_index('a')
>>> df_2 = df_2.set_index('a')
>>> pandas.concat([df_0, df_1, df_2], axis=1)  # this is where the magic happens
              b    b    b
a
2018-01-01  5.0  NaN  NaN
2018-01-02  6.0  NaN  NaN
2018-01-03  7.0  5.0  NaN
2018-01-04  8.0  6.0  NaN
2018-01-05  9.0  7.0  5.0
2018-01-06  NaN  8.0  6.0
2018-01-07  NaN  9.0  7.0
2018-01-08  NaN  NaN  8.0
2018-01-09  NaN  NaN  9.0

#2


0  

reduce would work on a list instead of a dictionary.

reduce会在列表而不是字典上工作。

Try this:

Create a list of data frames (df)

创建数据框列表(df)

import pandas as pd
import subprocess
import os
from functools import reduce

path='C:\Users\ra\Desktop\Px\a\'

df = []
x = [#vector that contains the name of the csv files as string]
for j in x:
    df.append(pd.read_csv(path+j+'.csv')) 

df_merged = functools.reduce(lambda left, right: pd.merge(left, right, how= 'outer', on = ['D']), df)

#3


0  

First of all, I want to thank every one that helped me to find a solution. I have to say that this is my first time posting a question in * and the experience has been very nice. I also want to thank @AnkitMalik and @NoticeMeSenpai because their effort helped me to find a very good solution.

首先,我要感谢帮助我找到解决方案的每一个人。我不得不说这是我第一次在*中发布一个问题,而且经验非常好。我还要感谢@AnkitMalik和@NoticeMeSenpai,因为他们的努力帮助我找到了一个非常好的解决方案。

My question was about merging data frames in a dictionary {} by using functools.reduce(). But, as was pointed out by @AnkitMalik, this only works for lists []. @NoticeMeSenpai recomended the use of pandas.concat() in order to make this work. The code below is the one that works for me:

我的问题是使用functools.reduce()在字典{}中合并数据框。但是,正如@AnkitMalik指出的那样,这只适用于list []。 @NoticeMeSenpai建议使用pandas.concat()以使其工作。以下代码适用于我:

import pandas as pd
import subprocess
import os

path='C:\Users\ra\Desktop\Px\a'

df = [] #makes a list of data frames
x = [#vector that contains the name of the csv files as strings]
for j in x:
    df.append((pd.read_csv(os.path.join(path,r'%s.csv' % j))).set_index('D').rename(columns={'C':'%s' % j}), axis=1)) #appends every csv file in folder 'a' as a data frame in list 'df', sets the column 'D' as index and renames the column 'C' as the name of csv file.

df_concat = pd.concat(df, axis=1) #concats every data frame in the list 'df'
df_concat.to_csv(os.path.join(path,r'xxx.csv')) # saves the concatenated data frame in the 'xxx' csv file in folder 'a'.

#1


0  

Merging or appending each DataFrame is very expensive, so it's important to make as few of calls as possible.

合并或附加每个DataFrame非常昂贵,因此尽可能少地进行调用很重要。

What you can do however, is make the date column of each DataFrame the index of the DataFrame, put them in a list, and then make one call to pandas.concat() for all of them.

但是,你可以做的是将每个DataFrame的日期列作为DataFrame的索引,将它们放在一个列表中,然后为所有这些调用pandas.concat()。

You will of course have to fiddle with the column names and what they represent, as unless you want a specific entry to be a tuple, you'll have some common columns.

你当然必须摆弄列名和它们代表的内容,除非你想要一个特定的条目是一个元组,你将有一些共同的列。

Example:

>>> import pandas
>>> df_0 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180101', '20180105'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_0
           a  b
0 2018-01-01  5
1 2018-01-02  6
2 2018-01-03  7
3 2018-01-04  8
4 2018-01-05  9
>>> df_1 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180103', '20180107'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_2 = pandas.DataFrame(
        {
            'a': pandas.date_range('20180105', '20180109'), 
            'b': range(5, 10)
        }, 
        index=range(5)
    )
>>> df_0 = df_0.set_index('a')
>>> df_1 = df_1.set_index('a')
>>> df_2 = df_2.set_index('a')
>>> pandas.concat([df_0, df_1, df_2], axis=1)  # this is where the magic happens
              b    b    b
a
2018-01-01  5.0  NaN  NaN
2018-01-02  6.0  NaN  NaN
2018-01-03  7.0  5.0  NaN
2018-01-04  8.0  6.0  NaN
2018-01-05  9.0  7.0  5.0
2018-01-06  NaN  8.0  6.0
2018-01-07  NaN  9.0  7.0
2018-01-08  NaN  NaN  8.0
2018-01-09  NaN  NaN  9.0

#2


0  

reduce would work on a list instead of a dictionary.

reduce会在列表而不是字典上工作。

Try this:

Create a list of data frames (df)

创建数据框列表(df)

import pandas as pd
import subprocess
import os
from functools import reduce

path='C:\Users\ra\Desktop\Px\a\'

df = []
x = [#vector that contains the name of the csv files as string]
for j in x:
    df.append(pd.read_csv(path+j+'.csv')) 

df_merged = functools.reduce(lambda left, right: pd.merge(left, right, how= 'outer', on = ['D']), df)

#3


0  

First of all, I want to thank every one that helped me to find a solution. I have to say that this is my first time posting a question in * and the experience has been very nice. I also want to thank @AnkitMalik and @NoticeMeSenpai because their effort helped me to find a very good solution.

首先,我要感谢帮助我找到解决方案的每一个人。我不得不说这是我第一次在*中发布一个问题,而且经验非常好。我还要感谢@AnkitMalik和@NoticeMeSenpai,因为他们的努力帮助我找到了一个非常好的解决方案。

My question was about merging data frames in a dictionary {} by using functools.reduce(). But, as was pointed out by @AnkitMalik, this only works for lists []. @NoticeMeSenpai recomended the use of pandas.concat() in order to make this work. The code below is the one that works for me:

我的问题是使用functools.reduce()在字典{}中合并数据框。但是,正如@AnkitMalik指出的那样,这只适用于list []。 @NoticeMeSenpai建议使用pandas.concat()以使其工作。以下代码适用于我:

import pandas as pd
import subprocess
import os

path='C:\Users\ra\Desktop\Px\a'

df = [] #makes a list of data frames
x = [#vector that contains the name of the csv files as strings]
for j in x:
    df.append((pd.read_csv(os.path.join(path,r'%s.csv' % j))).set_index('D').rename(columns={'C':'%s' % j}), axis=1)) #appends every csv file in folder 'a' as a data frame in list 'df', sets the column 'D' as index and renames the column 'C' as the name of csv file.

df_concat = pd.concat(df, axis=1) #concats every data frame in the list 'df'
df_concat.to_csv(os.path.join(path,r'xxx.csv')) # saves the concatenated data frame in the 'xxx' csv file in folder 'a'.