I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).
我猜这是一个简单的解决方案,但我遇到了一个问题,即使用to_csv()函数将pandas数据帧保存到csv文件需要将近一个小时。我正在使用带有pandas(0.19.1)的anaconda python 2.7.12。
import os
import glob
import pandas as pd
src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))
# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
stage = pd.DataFrame()
iter_csv = pd.read_csv(file_
, sep=','
, index_col=False
, header=0
, low_memory=False
, iterator=True
, chunksize=100000
, compression='gzip'
, memory_map=True
, encoding='utf-8')
df = pd.concat([chunk for chunk in iter_csv])
stage = stage.append(df, ignore_index=True)
# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
, sep='|'
, header=True
, index=False
, chunksize=100000
, encoding='utf-8')
del stage
I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.
我已经确认硬件和内存正在运行,但这些是相当宽的表(~100列),主要是数字(十进制)数据。
Thank you,
谢谢,
1 个解决方案
#1
3
You are reading compressed files and writing plaintext file. Could be IO bottleneck.
您正在读取压缩文件并编写纯文本文件。可能是IO瓶颈。
Writing compressed file could speedup writing up to 10x
编写压缩文件可以加速写入10倍
stage.to_csv('output.csv.gz'
, sep='|'
, header=True
, index=False
, chunksize=100000
, compression='gzip'
, encoding='utf-8')
Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).
此外,您可以尝试不同的块大小和压缩方法('bz2','xz')。
#1
3
You are reading compressed files and writing plaintext file. Could be IO bottleneck.
您正在读取压缩文件并编写纯文本文件。可能是IO瓶颈。
Writing compressed file could speedup writing up to 10x
编写压缩文件可以加速写入10倍
stage.to_csv('output.csv.gz'
, sep='|'
, header=True
, index=False
, chunksize=100000
, compression='gzip'
, encoding='utf-8')
Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).
此外,您可以尝试不同的块大小和压缩方法('bz2','xz')。