如何连接多个pandas.DataFrames而不会遇到MemoryError

时间:2023-01-05 22:57:19

I have three DataFrames that I'm trying to concatenate.

我有三个我正在尝试连接的DataFrame。

concat_df = pd.concat([df1, df2, df3])

This results in a MemoryError. How can I resolve this?

这会导致MemoryError。我该如何解决这个问题?

Note that most of the existing similar questions are on MemoryErrors occuring when reading large files. I don't have that problem. I have read my files in into DataFrames. I just can't concatenate that data.

请注意,大多数现有的类似问题都是在读取大文件时发生的MemoryErrors上。我没有那个问题。我已将我的文件读入DataFrames。我只是不能连接那些数据。

9 个解决方案

#1


13  

I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.

我建议您通过连接将数据帧放入单个csv文件中。然后阅读你的csv文件。

Execute that:

执行:

# write df1 content in file.csv
df1.to_csv('file.csv', index=False)
# append df2 content to file.csv
df2.to_csv('file.csv', mode='a', columns=False, index=False)
# append df3 content to file.csv
df3.to_csv('file.csv', mode='a', columns=False, index=False)

# free memory
del df1, df2, df3

# read all df1, df2, df3 contents
df = pd.read_csv('file.csv')

If this solution isn't enougth performante, to concat larger files than usually. Do:

如果这个解决方案不符合性能,那么要连接比通常更大的文件。做:

df1.to_csv('file.csv', index=False)
df2.to_csv('file1.csv', index=False)
df3.to_csv('file2.csv', index=False)

del df1, df2, df3

Then run bash command:

然后运行bash命令:

cat file1.csv >> file.csv
cat file2.csv >> file.csv
cat file3.csv >> file.csv

Or concat csv files in python :

或者python中的concat csv文件:

def concat(file1, file2):
    with open(file2, 'r') as filename2:
        data = file2.read()
    with open(file1, 'a') as filename1:
        file.write(data)

concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')

After read:

阅读后:

df = pd.read_csv('file.csv')

#2


8  

Similar to what @glegoux suggests, also pd.DataFrame.to_csv can write in append mode, so you can do something like:

与@glegoux建议的类似,pd.DataFrame.to_csv也可以在追加模式下写入,因此您可以执行以下操作:

df1.to_csv(filename)
df2.to_csv(filename, mode='a', columns=False)
df3.to_csv(filename, mode='a', columns=False)

del df1, df2, df3
df_concat = pd.read_csv(filename)

#3


5  

Kinda taking a guess here, but maybe:

有点猜测,但也许:

df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3

Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go. As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.

显然,你可以做更多的循环,但关键是你要删除df2,df3等。正如您在问题中所做的那样,您永远不会清除旧的数据帧,因此您使用的内存大约是您需要的内存的两倍。

More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):

更一般地说,如果你正在阅读和连接,我会这样做(如果你有3个CSV:foo0,foo1,foo2):

concat_df = pd.DataFrame()
for i in range(3):
    temp_df = pd.read_csv('foo'+str(i)+'.csv')
    concat_df = pd.concat( [concat_df, temp_df] )

In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df. As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.

换句话说,当您正在读取文件时,您只会暂时将小数据帧保留在内存中,直到将它们连接到组合的df,concat_df中。正如您目前所做的那样,即使在连接它们之后,您仍然可以保留所有较小的数据帧。

#4


5  

Dask might be good option to try for handling large dataframes - Go through Dask Docs

Dask可能是尝试处理大型数据帧的好选择 - 浏览Dask Docs

#5


4  

The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.

问题是,就像在其他答案中看到的那样,存在记忆问题。解决方案是将数据存储在磁盘上,然后构建一个独特的数据帧。

With such huge data, performance is an issue.

有了这么大的数据,性能就成了问题。

csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.

csv解决方案非常慢,因为在文本模式下进行转换。自使用二进制模式以来,HDF5解决方案更短,更优雅,更快。我提出了二进制模式的第三种方式,使用pickle,它看起来更快,但更具技术性,需要更多空间。第四,手工。

Here the code:

这里的代码:

import numpy as np
import pandas as pd

# a DataFrame factory:
dfs=[]
for i in range(10):
    dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))

# a csv solution
def bycsv(dfs):
    md,hd='w',True
    for df in dfs:
        df.to_csv('df_all.csv',mode=md,header=hd,index=None)
        md,hd='a',False
    #del dfs
    df_all=pd.read_csv('df_all.csv',index_col=None)
    os.remove('df_all.csv') 
    return df_all    

Better solutions :

更好的方案:

def byHDF(dfs):
    store=pd.HDFStore('df_all.h5')
    for df in dfs:
        store.append('df',df,data_columns=list('0123'))
    #del dfs
    df=store.select('df')
    store.close()
    os.remove('df_all.h5')
    return df

def bypickle(dfs):
    c=[]
    with open('df_all.pkl','ab') as f:
        for df in dfs:
            pickle.dump(df,f)
            c.append(len(df))    
    #del dfs
    with open('df_all.pkl','rb') as f:
        df_all=pickle.load(f)
        offset=len(df_all)
        df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))

        for size in c[1:]:
            df=pickle.load(f)
            df_all.iloc[offset:offset+size]=df.values 
            offset+=size
    os.remove('df_all.pkl')
    return df_all

For homogeneous dataframes, we can do even better :

对于同构数据帧,我们可以做得更好:

def byhand(dfs):
    mtot=0
    with open('df_all.bin','wb') as f:
        for df in dfs:
            m,n =df.shape
            mtot += m
            f.write(df.values.tobytes())
            typ=df.values.dtype                
    #del dfs
    with open('df_all.bin','rb') as f:
        buffer=f.read()
        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
        df_all=pd.DataFrame(data=data,columns=list(range(n))) 
    os.remove('df_all.bin')
    return df_all

And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.

并对一些(少量,32 Mb)数据进行了一些测试以比较性能。对于4 Gb,你必须乘以大约128。

In [92]: %time w=bycsv(dfs)
Wall time: 8.06 s

In [93]: %time x=byHDF(dfs)
Wall time: 547 ms

In [94]: %time v=bypickle(dfs)
Wall time: 219 ms

In [95]: %time y=byhand(dfs)
Wall time: 109 ms

A check :

检查:

In [195]: (x.values==w.values).all()
Out[195]: True

In [196]: (x.values==v.values).all()
Out[196]: True

In [197]: (x.values==y.values).all()
Out[196]: True

Of course all of that must be improved and tuned to fit your problem.

当然,所有这一切都必须改进和调整,以适应您的问题。

For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle.

例如,df3可以在大小为'total_memory_size - df_total_size'的块中拆分,以便能够通过pickle运行。

I can edit it if you give more information on your data structure and size if you want. Beautiful question !

如果您根据需要提供有关数据结构和大小的更多信息,我可以对其进行编辑。美丽的问题!

#6


3  

You can store your individual dataframes in a HDF Store, and then call the store just like one big dataframe.

您可以将各个数据帧存储在HDF存储中,然后像一个大数据帧一样调用存储。

# name of store
fname = 'my_store'

with pd.get_store(fname) as store:

    # save individual dfs to store
    for df in [df1, df2, df3, df_foo]:
        store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending

    # access the store as a single df
    df = store.select('df', where = ['A>2'])  # change where condition as required (see documentation for examples)
    # Do other stuff with df #

# close the store when you're done
os.remove(fname)

#7


2  

Another option:

另外一个选择:

1) Write df1 to .csv file: df1.to_csv('Big file.csv')

1)将df1写入.csv文件:df1.to_csv('Big file.csv')

2) Open .csv file, then append df2:

2)打开.csv文件,然后追加df2:

with open('Big File.csv','a') as f:
    df2.to_csv(f, header=False)

3) Repeat Step 2 with df3

3)用df3重复步骤2

with open('Big File.csv','a') as f:
    df3.to_csv(f, header=False)

#8


1  

I'm grateful to the community for their answers. However, in my case, I found out that the problem was actually due to the fact that I was using 32 bit Python.

我很感谢社区的回答。但是,在我的情况下,我发现问题实际上是由于我使用的是32位Python。

There are memory limits defined for Windows 32 and 64 bit OS. For a 32 bit process, it is only 2 GB. So, even if your RAM has more than 2GB, and even if you're running the 64 bit OS, but you are running a 32 bit process, then that process will be limited to just 2 GB of RAM - in my case that process was Python.

为Windows 32和64位操作系统定义了内存限制。对于32位进程,它只有2 GB。因此,即使您的RAM超过2GB,即使您运行的是64位操作系统,但是您运行的是32位进程,那么该进程将仅限于2 GB的RAM - 在我的情况下,该进程是Python。

I upgraded to 64 bit Python, and haven't had a memory error since then!

我升级到64位Python,从那时起就没有内存错误!

Other relevant questions are: Python 32-bit memory limits on 64bit windows, Should I use Python 32bit or Python 64bit, Why is this numpy array too big to load?

其他相关问题是:64位窗口上的Python 32位内存限制,我应该使用Python 32位还是Python 64位,为什么这个numpy数组太大而无法加载?

#9


0  

I've had a similar performance issues while trying to concatenate a large number of DataFrames to a 'growing' DataFrame.

在尝试将大量DataFrame连接到“不断增长的”DataFrame时,我遇到了类似的性能问题。

My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed. This will bring the runtime to almost half.

我的解决方法是将所有子DataFrame附加到列表中,然后在完成子DataFrames的处理后连接DataFrames列表。这将使运行时几乎减半。

#1


13  

I advice you to put your dataframes into single csv file by concatenation. Then to read your csv file.

我建议您通过连接将数据帧放入单个csv文件中。然后阅读你的csv文件。

Execute that:

执行:

# write df1 content in file.csv
df1.to_csv('file.csv', index=False)
# append df2 content to file.csv
df2.to_csv('file.csv', mode='a', columns=False, index=False)
# append df3 content to file.csv
df3.to_csv('file.csv', mode='a', columns=False, index=False)

# free memory
del df1, df2, df3

# read all df1, df2, df3 contents
df = pd.read_csv('file.csv')

If this solution isn't enougth performante, to concat larger files than usually. Do:

如果这个解决方案不符合性能,那么要连接比通常更大的文件。做:

df1.to_csv('file.csv', index=False)
df2.to_csv('file1.csv', index=False)
df3.to_csv('file2.csv', index=False)

del df1, df2, df3

Then run bash command:

然后运行bash命令:

cat file1.csv >> file.csv
cat file2.csv >> file.csv
cat file3.csv >> file.csv

Or concat csv files in python :

或者python中的concat csv文件:

def concat(file1, file2):
    with open(file2, 'r') as filename2:
        data = file2.read()
    with open(file1, 'a') as filename1:
        file.write(data)

concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')

After read:

阅读后:

df = pd.read_csv('file.csv')

#2


8  

Similar to what @glegoux suggests, also pd.DataFrame.to_csv can write in append mode, so you can do something like:

与@glegoux建议的类似,pd.DataFrame.to_csv也可以在追加模式下写入,因此您可以执行以下操作:

df1.to_csv(filename)
df2.to_csv(filename, mode='a', columns=False)
df3.to_csv(filename, mode='a', columns=False)

del df1, df2, df3
df_concat = pd.read_csv(filename)

#3


5  

Kinda taking a guess here, but maybe:

有点猜测,但也许:

df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3

Obviously, you could do that more as a loop but the key is you want to delete df2, df3, etc. as you go. As you are doing it in the question, you never clear out the old dataframes so you are using about twice as much memory as you need to.

显然,你可以做更多的循环,但关键是你要删除df2,df3等。正如您在问题中所做的那样,您永远不会清除旧的数据帧,因此您使用的内存大约是您需要的内存的两倍。

More generally, if you are reading and concatentating, I'd do it something like this (if you had 3 CSVs: foo0, foo1, foo2):

更一般地说,如果你正在阅读和连接,我会这样做(如果你有3个CSV:foo0,foo1,foo2):

concat_df = pd.DataFrame()
for i in range(3):
    temp_df = pd.read_csv('foo'+str(i)+'.csv')
    concat_df = pd.concat( [concat_df, temp_df] )

In other words, as you are reading in files, you only keep the small dataframes in memory temporarily, until you concatenate them into the combined df, concat_df. As you currently do it, you are keeping around all the smaller dataframes, even after concatenating them.

换句话说,当您正在读取文件时,您只会暂时将小数据帧保留在内存中,直到将它们连接到组合的df,concat_df中。正如您目前所做的那样,即使在连接它们之后,您仍然可以保留所有较小的数据帧。

#4


5  

Dask might be good option to try for handling large dataframes - Go through Dask Docs

Dask可能是尝试处理大型数据帧的好选择 - 浏览Dask Docs

#5


4  

The problem is, like viewed in the others answers, a problem of memory. And a solution is to store data on disk, then to build an unique dataframe.

问题是,就像在其他答案中看到的那样,存在记忆问题。解决方案是将数据存储在磁盘上,然后构建一个独特的数据帧。

With such huge data, performance is an issue.

有了这么大的数据,性能就成了问题。

csv solutions are very slow, since conversion in text mode occurs. HDF5 solutions are shorter, more elegant and faster since using binary mode. I propose a third way in binary mode, with pickle, which seems to be even faster, but more technical and needing some more room. And a fourth, by hand.

csv解决方案非常慢,因为在文本模式下进行转换。自使用二进制模式以来,HDF5解决方案更短,更优雅,更快。我提出了二进制模式的第三种方式,使用pickle,它看起来更快,但更具技术性,需要更多空间。第四,手工。

Here the code:

这里的代码:

import numpy as np
import pandas as pd

# a DataFrame factory:
dfs=[]
for i in range(10):
    dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))

# a csv solution
def bycsv(dfs):
    md,hd='w',True
    for df in dfs:
        df.to_csv('df_all.csv',mode=md,header=hd,index=None)
        md,hd='a',False
    #del dfs
    df_all=pd.read_csv('df_all.csv',index_col=None)
    os.remove('df_all.csv') 
    return df_all    

Better solutions :

更好的方案:

def byHDF(dfs):
    store=pd.HDFStore('df_all.h5')
    for df in dfs:
        store.append('df',df,data_columns=list('0123'))
    #del dfs
    df=store.select('df')
    store.close()
    os.remove('df_all.h5')
    return df

def bypickle(dfs):
    c=[]
    with open('df_all.pkl','ab') as f:
        for df in dfs:
            pickle.dump(df,f)
            c.append(len(df))    
    #del dfs
    with open('df_all.pkl','rb') as f:
        df_all=pickle.load(f)
        offset=len(df_all)
        df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))

        for size in c[1:]:
            df=pickle.load(f)
            df_all.iloc[offset:offset+size]=df.values 
            offset+=size
    os.remove('df_all.pkl')
    return df_all

For homogeneous dataframes, we can do even better :

对于同构数据帧,我们可以做得更好:

def byhand(dfs):
    mtot=0
    with open('df_all.bin','wb') as f:
        for df in dfs:
            m,n =df.shape
            mtot += m
            f.write(df.values.tobytes())
            typ=df.values.dtype                
    #del dfs
    with open('df_all.bin','rb') as f:
        buffer=f.read()
        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
        df_all=pd.DataFrame(data=data,columns=list(range(n))) 
    os.remove('df_all.bin')
    return df_all

And some tests on (little, 32 Mb) data to compare performance. you have to multiply by about 128 for 4 Gb.

并对一些(少量,32 Mb)数据进行了一些测试以比较性能。对于4 Gb,你必须乘以大约128。

In [92]: %time w=bycsv(dfs)
Wall time: 8.06 s

In [93]: %time x=byHDF(dfs)
Wall time: 547 ms

In [94]: %time v=bypickle(dfs)
Wall time: 219 ms

In [95]: %time y=byhand(dfs)
Wall time: 109 ms

A check :

检查:

In [195]: (x.values==w.values).all()
Out[195]: True

In [196]: (x.values==v.values).all()
Out[196]: True

In [197]: (x.values==y.values).all()
Out[196]: True

Of course all of that must be improved and tuned to fit your problem.

当然,所有这一切都必须改进和调整,以适应您的问题。

For exemple df3 can be split in chuncks of size 'total_memory_size - df_total_size' to be able to run bypickle.

例如,df3可以在大小为'total_memory_size - df_total_size'的块中拆分,以便能够通过pickle运行。

I can edit it if you give more information on your data structure and size if you want. Beautiful question !

如果您根据需要提供有关数据结构和大小的更多信息,我可以对其进行编辑。美丽的问题!

#6


3  

You can store your individual dataframes in a HDF Store, and then call the store just like one big dataframe.

您可以将各个数据帧存储在HDF存储中,然后像一个大数据帧一样调用存储。

# name of store
fname = 'my_store'

with pd.get_store(fname) as store:

    # save individual dfs to store
    for df in [df1, df2, df3, df_foo]:
        store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending

    # access the store as a single df
    df = store.select('df', where = ['A>2'])  # change where condition as required (see documentation for examples)
    # Do other stuff with df #

# close the store when you're done
os.remove(fname)

#7


2  

Another option:

另外一个选择:

1) Write df1 to .csv file: df1.to_csv('Big file.csv')

1)将df1写入.csv文件:df1.to_csv('Big file.csv')

2) Open .csv file, then append df2:

2)打开.csv文件,然后追加df2:

with open('Big File.csv','a') as f:
    df2.to_csv(f, header=False)

3) Repeat Step 2 with df3

3)用df3重复步骤2

with open('Big File.csv','a') as f:
    df3.to_csv(f, header=False)

#8


1  

I'm grateful to the community for their answers. However, in my case, I found out that the problem was actually due to the fact that I was using 32 bit Python.

我很感谢社区的回答。但是,在我的情况下,我发现问题实际上是由于我使用的是32位Python。

There are memory limits defined for Windows 32 and 64 bit OS. For a 32 bit process, it is only 2 GB. So, even if your RAM has more than 2GB, and even if you're running the 64 bit OS, but you are running a 32 bit process, then that process will be limited to just 2 GB of RAM - in my case that process was Python.

为Windows 32和64位操作系统定义了内存限制。对于32位进程,它只有2 GB。因此,即使您的RAM超过2GB,即使您运行的是64位操作系统,但是您运行的是32位进程,那么该进程将仅限于2 GB的RAM - 在我的情况下,该进程是Python。

I upgraded to 64 bit Python, and haven't had a memory error since then!

我升级到64位Python,从那时起就没有内存错误!

Other relevant questions are: Python 32-bit memory limits on 64bit windows, Should I use Python 32bit or Python 64bit, Why is this numpy array too big to load?

其他相关问题是:64位窗口上的Python 32位内存限制,我应该使用Python 32位还是Python 64位,为什么这个numpy数组太大而无法加载?

#9


0  

I've had a similar performance issues while trying to concatenate a large number of DataFrames to a 'growing' DataFrame.

在尝试将大量DataFrame连接到“不断增长的”DataFrame时,我遇到了类似的性能问题。

My workaround was appending all sub DataFrames to a list, and then concatenating the list of DataFrames once processing of the sub DataFrames has been completed. This will bring the runtime to almost half.

我的解决方法是将所有子DataFrame附加到列表中,然后在完成子DataFrames的处理后连接DataFrames列表。这将使运行时几乎减半。