将多个csv文件连接到具有相同标头的单个csv - Python

时间:2021-06-17 20:30:40

I am currently using the below code to import 6,000 csv files (with headers) and export them into a single csv file (with a single header row).

我目前正在使用以下代码导入6,000个csv文件(带标题)并将它们导出到单个csv文件中(带有单个标题行)。

#import csv files from folder
path =r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
stockstats_data = pd.DataFrame()
list_ = []

for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None,)
    list_.append(df)
    stockstats_data = pd.concat(list_)
    print(file_ + " has been imported.")

This code works fine, but it is slow. It can take up to 2 days to process.

这段代码工作正常,但速度很慢。处理最多可能需要2天。

I was given a single line script for Terminal command line that does the same (but with no headers). This script takes 20 seconds.

我得到了终端命令行的单行脚本,它执行相同的操作(但没有标题)。这个脚本需要20秒。

 for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

Does anyone know how I can speed up the first Python script? To cut the time down, I have thought about not importing it into a DataFrame and just concatenating the CSVs, but I cannot figure it out.

有谁知道如何加速第一个Python脚本?为了缩短时间,我考虑过不将它导入DataFrame并只是连接CSV,但我无法弄清楚。

Thanks.

谢谢。

3 个解决方案

#1


6  

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

如果你不需要内存中的CSV,只需要从输入复制到输出,那么避免解析就会便宜很多,并且在没有在内存中构建的情况下进行复制:

import shutil

#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

而已; shutil.copyfileobj处理有效复制数据,大大减少了Python级别的工作来解析和重新序列化。

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

这假设所有CSV文件具有相同的格式,编码,行结尾等,并且标题不包含嵌入的换行符,但如果是这种情况,则比替代品快得多。

#2


4  

Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:

您是否需要在Python中执行此操作?如果您完全在shell中执行此操作,那么您需要做的就是在运行单行程序之前首先将随机选择的输入.csv文件中的标题行添加到merged.csv中:

cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

#3


0  

You don't need pandas for this, just the simple csv module would work fine.

你不需要pandas,只需简单的csv模块就能正常工作。

import csv

df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
    writer = csv.writer(fout)
    for filename in allFiles:
        with open(filename) as fin:
            reader = csv.reader(fin)
            headers = reader.next()
            if write_headers:
                write_headers = False  # Only write headers once.
                writer.writerow(headers)
            writer.writerows(reader)  # Write all remaining rows.

#1


6  

If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:

如果你不需要内存中的CSV,只需要从输入复制到输出,那么避免解析就会便宜很多,并且在没有在内存中构建的情况下进行复制:

import shutil

#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
with open('someoutputfile.csv', 'wb') as outfile:
    for i, fname in enumerate(allFiles):
        with open(fname, 'rb') as infile:
            if i != 0:
                infile.readline()  # Throw away header on all but first file
            # Block copy rest of file from input to output without parsing
            shutil.copyfileobj(infile, outfile)
            print(fname + " has been imported.")

That's it; shutil.copyfileobj handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize.

而已; shutil.copyfileobj处理有效复制数据,大大减少了Python级别的工作来解析和重新序列化。

This assumes all the CSV files have the same format, encoding, line endings, etc., and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.

这假设所有CSV文件具有相同的格式,编码,行结尾等,并且标题不包含嵌入的换行符,但如果是这种情况,则比替代品快得多。

#2


4  

Are you required to do this in Python? If you are open to doing this entirely in shell, all you'd need to do is first cat the header row from a randomly selected input .csv file into merged.csv before running your one-liner:

您是否需要在Python中执行此操作?如果您完全在shell中执行此操作,那么您需要做的就是在运行单行程序之前首先将随机选择的输入.csv文件中的标题行添加到merged.csv中:

cat a-randomly-selected-csv-file.csv | head -n1 > merged.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> merged.csv; done 

#3


0  

You don't need pandas for this, just the simple csv module would work fine.

你不需要pandas,只需简单的csv模块就能正常工作。

import csv

df_out_filename = 'df_out.csv'
write_headers = True
with open(df_out_filename, 'wb') as fout:
    writer = csv.writer(fout)
    for filename in allFiles:
        with open(filename) as fin:
            reader = csv.reader(fin)
            headers = reader.next()
            if write_headers:
                write_headers = False  # Only write headers once.
                writer.writerow(headers)
            writer.writerows(reader)  # Write all remaining rows.