如何阅读带有pandas的6 GB csv文件

时间:2022-01-22 21:39:39

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting the following memory error:

我试图在pandas中读取一个大的csv文件(aprox.6 GB),我收到以下内存错误:

MemoryError                               Traceback (most recent call last)
<ipython-input-58-67a72687871b> in <module>()
----> 1 data=pd.read_csv('aphro.csv',sep=';')

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    450                     infer_datetime_format=infer_datetime_format)
    451 
--> 452         return _read(filepath_or_buffer, kwds)
    453 
    454     parser_f.__name__ = name

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in _read(filepath_or_buffer, kwds)
    242         return parser
    243 
--> 244     return parser.read()
    245 
    246 _parser_defaults = {

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
    693                 raise ValueError('skip_footer not supported for iteration')
    694 
--> 695         ret = self._engine.read(nrows)
    696 
    697         if self.options.get('as_recarray'):

C:\Python27\lib\site-packages\pandas\io\parsers.pyc in read(self, nrows)
   1137 
   1138         try:
-> 1139             data = self._reader.read(nrows)
   1140         except StopIteration:
   1141             if nrows is None:

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader.read (pandas\parser.c:7145)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7369)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._read_rows (pandas\parser.c:8194)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_column_data (pandas\parser.c:9402)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_tokens (pandas\parser.c:10057)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser.TextReader._convert_with_dtype (pandas\parser.c:10361)()

C:\Python27\lib\site-packages\pandas\parser.pyd in pandas.parser._try_int64 (pandas\parser.c:17806)()

MemoryError: 

Any help on this??

对此有何帮助?

8 个解决方案

#1


113  

The error shows that the machine does not have enough memory to read the entire CSV into a DataFrame at one time. Assuming you do not need the entire dataset in memory all at one time, one way to avoid the problem would be to process the CSV in chunks (by specifying the chunksize parameter):

该错误表明机器没有足够的内存来将整个CSV一次读入DataFrame。假设您一次不需要内存中的整个数据集,避免此问题的一种方法是以块的形式处理CSV(通过指定chunksize参数):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

#2


25  

I proceeded like this:

我继续这样做:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

#3


11  

Chunking shouldn't always be the first port of call for this problem.

分块不应该总是这个问题的第一个停靠点。

1. Is the file large due to repeated non-numeric data or unwanted columns?

1.由于重复的非数字数据或不需要的列,文件是否很大?

If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.

如果是这样,您有时可以通过读取列作为类别并通过pd.read_csv usecols参数选择所需列来节省大量内存。

2. Does your workflow require slicing, manipulating, exporting?

2.您的工作流程是否需要切片,操作,导出?

If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.

如果是这样,您可以使用dask.dataframe进行切片,执行计算并迭代导出。通过dask静默执行分块,它也支持pandas API的子集。

3. If all else fails, read line by line via chunks.

3.如果所有其他方法都失败,请通过块逐行读取。

Chunk via pandas or via csv library as a last resort.

通过熊猫或csv库作为最后的手段。

#4


7  

The above answer is already satisfying the topic. Anyway, if you need all the data in memory - have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features

上面的答案已经令人满意了。无论如何,如果你需要内存中的所有数据 - 看看bcolz。它压缩内存中的数据。我有很好的经验。但它缺少了很多熊猫的功能

Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.

编辑:我认为压缩率约为1/10或原始尺寸,当然这取决于数据类型。缺少的重要功能是聚合。

#5


5  

For large data l recommend you use the library "dask"
e.g:

对于大数据,我建议您使用库“dask”,例如:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

#6


2  

You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.

您可以尝试使用与pandas具有相同语法的sframe,但允许您操作大于RAM的文件。

#7


2  

The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.

函数read_csv和read_table几乎相同。但是,当您在程序中使用函数read_table时,必须指定分隔符“,”。

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

#8


2  

If you use pandas read large file into chunk and then yield row by row, here is what I have done

如果你使用pandas将大文件读入块然后逐行产生,这就是我所做的

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))

#1


113  

The error shows that the machine does not have enough memory to read the entire CSV into a DataFrame at one time. Assuming you do not need the entire dataset in memory all at one time, one way to avoid the problem would be to process the CSV in chunks (by specifying the chunksize parameter):

该错误表明机器没有足够的内存来将整个CSV一次读入DataFrame。假设您一次不需要内存中的整个数据集,避免此问题的一种方法是以块的形式处理CSV(通过指定chunksize参数):

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    process(chunk)

#2


25  

I proceeded like this:

我继续这样做:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\
       names=['lat','long','rf','date','slno'],index_col='slno',\
       header=None,parse_dates=['date'])

df=pd.DataFrame()
%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

#3


11  

Chunking shouldn't always be the first port of call for this problem.

分块不应该总是这个问题的第一个停靠点。

1. Is the file large due to repeated non-numeric data or unwanted columns?

1.由于重复的非数字数据或不需要的列,文件是否很大?

If so, you can sometimes see massive memory savings by reading in columns as categories and selecting required columns via pd.read_csv usecols parameter.

如果是这样,您有时可以通过读取列作为类别并通过pd.read_csv usecols参数选择所需列来节省大量内存。

2. Does your workflow require slicing, manipulating, exporting?

2.您的工作流程是否需要切片,操作,导出?

If so, you can use dask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.

如果是这样,您可以使用dask.dataframe进行切片,执行计算并迭代导出。通过dask静默执行分块,它也支持pandas API的子集。

3. If all else fails, read line by line via chunks.

3.如果所有其他方法都失败,请通过块逐行读取。

Chunk via pandas or via csv library as a last resort.

通过熊猫或csv库作为最后的手段。

#4


7  

The above answer is already satisfying the topic. Anyway, if you need all the data in memory - have a look at bcolz. Its compressing the data in memory. I have had really good experience with it. But its missing a lot of pandas features

上面的答案已经令人满意了。无论如何,如果你需要内存中的所有数据 - 看看bcolz。它压缩内存中的数据。我有很好的经验。但它缺少了很多熊猫的功能

Edit: I got compression rates at around 1/10 or orig size i think, of course depending of the kind of data. Important features missing were aggregates.

编辑:我认为压缩率约为1/10或原始尺寸,当然这取决于数据类型。缺少的重要功能是聚合。

#5


5  

For large data l recommend you use the library "dask"
e.g:

对于大数据,我建议您使用库“dask”,例如:

# Dataframes implement the Pandas API
import dask.dataframe as dd
df = dd.read_csv('s3://.../2018-*-*.csv')

#6


2  

You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.

您可以尝试使用与pandas具有相同语法的sframe,但允许您操作大于RAM的文件。

#7


2  

The function read_csv and read_table is almost the same. But you must assign the delimiter “,” when you use the function read_table in your program.

函数read_csv和read_table几乎相同。但是,当您在程序中使用函数read_table时,必须指定分隔符“,”。

def get_from_action_data(fname, chunk_size=100000):
    reader = pd.read_csv(fname, header=0, iterator=True)
    chunks = []
    loop = True
    while loop:
        try:
            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped")

    df_ac = pd.concat(chunks, ignore_index=True)

#8


2  

If you use pandas read large file into chunk and then yield row by row, here is what I have done

如果你使用pandas将大文件读入块然后逐行产生,这就是我所做的

import pandas as pd

def chunck_generator(filename, header=False,chunk_size = 10 ** 5):
   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ): 
        yield (chunk)

def _generator( filename, header=False,chunk_size = 10 ** 5):
    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)
    for row in chunk:
        yield row

if __name__ == "__main__":
filename = r'file.csv'
        generator = generator(filename=filename)
        while True:
           print(next(generator))