Pandas to_csv（）减慢了保存大型数据帧的速度

I'm guessing this is an easy fix, but I'm running into an issue that it's taking nearly an hour to save a pandas dataframe to a csv file using the to_csv() function. I'm using anaconda python 2.7.12 with pandas (0.19.1).

我猜这是一个简单的解决方案，但我遇到了一个问题，即使用to_csv（）函数将pandas数据帧保存到csv文件需要将近一个小时。我正在使用带有pandas（0.19.1）的anaconda python 2.7.12。

import os
import glob
import pandas as pd

src_files = glob.glob(os.path.join('/my/path', "*.csv.gz"))

# 1 - Takes 2 min to read 20m records from 30 files
for file_ in sorted(src_files):
    stage = pd.DataFrame()
    iter_csv = pd.read_csv(file_
                     , sep=','
                     , index_col=False
                     , header=0
                     , low_memory=False
                     , iterator=True
                     , chunksize=100000
                     , compression='gzip'
                     , memory_map=True
                     , encoding='utf-8')

    df = pd.concat([chunk for chunk in iter_csv])
    stage = stage.append(df, ignore_index=True)

# 2 - Takes 55 min to write 20m records from one dataframe
stage.to_csv('output.csv'
             , sep='|'
             , header=True
             , index=False
             , chunksize=100000
             , encoding='utf-8')

del stage

I've confirmed the hardware and memory are working, but these are fairly wide tables (~ 100 columns) of mostly numeric (decimal) data.

我已经确认硬件和内存正在运行，但这些是相当宽的表（~100列），主要是数字（十进制）数据。

Thank you,

谢谢，

1 个解决方案

#1

You are reading compressed files and writing plaintext file. Could be IO bottleneck.

您正在读取压缩文件并编写纯文本文件。可能是IO瓶颈。

Writing compressed file could speedup writing up to 10x

编写压缩文件可以加速写入10倍

    stage.to_csv('output.csv.gz'
         , sep='|'
         , header=True
         , index=False
         , chunksize=100000
         , compression='gzip'
         , encoding='utf-8')

Additionally you could experiment with different chunk sizes and compression methods (‘bz2’, ‘xz’).

此外，您可以尝试不同的块大小和压缩方法（'bz2'，'xz'）。

#1