I have a file that has 50 million records and I have a list of indexes that I need to drop from the file. If I want to use pandas dataframe to read the file - I can run into memory issues (if I have limited memory). Let's say I do this:
我有一个文件,它有5000万条记录,我有一个索引列表,我需要从文件中删除它。如果我想使用熊猫dataframe来读取文件——我可能会遇到内存问题(如果内存有限的话)。假设我这么做:
df = pd.read_csv('input_file')
df = df.drop(df.index[example_ix_list])
df.to_csv('input_file', index=False)
I might run into memory issues:
我可能会遇到内存问题:
File "/home/ec2-user/CloudMatcher/cloudmatcher/core/execution/user_interaction.py", line 768, in process
new_unlabel_df = unlabel_df.drop(unlabel_df.index[list_ix])
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/generic.py", line 2162, in drop
dropped = self.reindex(**{axis_name: new_axis})
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/frame.py", line 2733, in reindex
**kwargs)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/generic.py", line 2515, in reindex
fill_value, copy).__finalize__(self)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/frame.py", line 2679, in _reindex_axes
fill_value, limit, tolerance)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/frame.py", line 2690, in _reindex_index
allow_dups=False)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/generic.py", line 2627, in _reindex_with_indexers
copy=copy)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/internals.py", line 3897, in reindex_indexer
for blk in self.blocks]
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/internals.py", line 1046, in take_nd
allow_fill=True, fill_value=fill_value)
File "/home/ec2-user/anaconda2/envs/cloudmatch/lib/python2.7/site-packages/pandas/core/algorithms.py", line 1467, in take_nd
out = np.empty(out_shape, dtype=dtype)
MemoryError
Q: Can I read the file in chunks using pandas dataframe and remove indexes using a list? and if so how? or there is some better way that I am missing out.
问:我可以用熊猫数据识别读取文件块,用列表删除索引吗?如果是如何?或者我错过了更好的方法。
Thanks a lot.
非常感谢。
2 个解决方案
#1
3
Try this:
试试这个:
pd.read_csv('input_file', skiprows=example_ix_list).to_csv('input_file', index=False)
if you still get MemoryError, you can use chunksize
parameter:
如果仍然得到MemoryError,可以使用chunksize参数:
example_ix_list = pd.Index(example_ix_list)
for df in pd.read_csv('input_file', chunksize=10**5):
df.loc[df.index.difference(example_ix_list)] \
.to_csv('new_file_name', index=False, header=None, mode='a')
#2
-1
You can pass a chunk_size parameter to read_table() or read_csv() commands:
您可以将chunk_size参数传递给read_table()或read_csv()命令:
pd.read_csv('fname.csv', sep=',', chunksize=4)
Further information in the documentation. Have you checked that?
文件中的进一步信息。你检查了吗?
#1
3
Try this:
试试这个:
pd.read_csv('input_file', skiprows=example_ix_list).to_csv('input_file', index=False)
if you still get MemoryError, you can use chunksize
parameter:
如果仍然得到MemoryError,可以使用chunksize参数:
example_ix_list = pd.Index(example_ix_list)
for df in pd.read_csv('input_file', chunksize=10**5):
df.loc[df.index.difference(example_ix_list)] \
.to_csv('new_file_name', index=False, header=None, mode='a')
#2
-1
You can pass a chunk_size parameter to read_table() or read_csv() commands:
您可以将chunk_size参数传递给read_table()或read_csv()命令:
pd.read_csv('fname.csv', sep=',', chunksize=4)
Further information in the documentation. Have you checked that?
文件中的进一步信息。你检查了吗?