将大文件(> = 7 GB)合并为一种快速方法

时间:2020-12-19 16:48:56

I have three huge files, with just 2 columns, and I need both. I want to merge them into one file which I can then write to a SQLite database.

我有三个巨大的文件,只有2列,我需要两个。我想将它们合并到一个文件中,然后我可以将其写入SQLite数据库。

I used Python and got the job done, but it took >30 minutes and also hung my system for 10 of those. I was wondering if there is a faster way by using awk or any other unix-tool. A faster way within Python would be great too. Code written below:

我使用了Python并完成了工作,但是花了大约30分钟,还挂了我的系统中的10个。我想知道是否有更快的方法使用awk或任何其他unix工具。 Python中更快的方式也会很棒。代码如下:

'''We have tweets of three months in 3 different files.
Combine them to a single file '''
import sys, os
data1 = open(sys.argv[1], 'r')
data2 = open(sys.argv[2], 'r')
data3 = open(sys.argv[3], 'r')
data4 = open(sys.argv[4], 'w')
for line in data1:
    data4.write(line)
data1.close()
for line in data2:
    data4.write(line)
data2.close()
for line in data3:
    data4.write(line)
data3.close()
data4.close()

3 个解决方案

#1


12  

The standard Unix way to merge files is cat. It may not be much faster but it will be faster.

合并文件的标准Unix方法是cat。它可能不会快得多,但会更快。

cat file1 file2 file3 > bigfile

Rather than make a temporary file, you may be able to cat directly to sqlite

您可以直接使用sqlite,而不是制作临时文件

cat file1 file2 file3 | sqlite database

In python, you will probably get better performance if you copy the file in blocks rather than lines. Use file.read(65536) to read 64k of data at a time, rather than iterating through the files with for

在python中,如果以块而不是行复制文件,则可能会获得更好的性能。使用file.read(65536)一次读取64k数据,而不是用for迭代文件

#2


2  

On UNIX-like systems:

在类UNIX系统上:

cat file1 file2 file3 > file4

#3


1  

I'm assuming that you need to repeat this process and that speed is a critical factor.

我假设你需要重复这个过程,速度是一个关键因素。

Try opening the files as binary files and experiment with the size of the block that you are reading. Try 4096 and 8192 bytes as these are common underlying buffer sizes.

尝试将文件打开为二进制文件,并试验您正在阅读的块的大小。尝试4096和8192字节,因为这些是常见的底层缓冲区大小。

There is a similar question, Is it possible to speed-up python IO?, that might be of interest too.

有一个类似的问题,是否有可能加速python IO?,这也可能是有趣的。

#1


12  

The standard Unix way to merge files is cat. It may not be much faster but it will be faster.

合并文件的标准Unix方法是cat。它可能不会快得多,但会更快。

cat file1 file2 file3 > bigfile

Rather than make a temporary file, you may be able to cat directly to sqlite

您可以直接使用sqlite,而不是制作临时文件

cat file1 file2 file3 | sqlite database

In python, you will probably get better performance if you copy the file in blocks rather than lines. Use file.read(65536) to read 64k of data at a time, rather than iterating through the files with for

在python中,如果以块而不是行复制文件,则可能会获得更好的性能。使用file.read(65536)一次读取64k数据,而不是用for迭代文件

#2


2  

On UNIX-like systems:

在类UNIX系统上:

cat file1 file2 file3 > file4

#3


1  

I'm assuming that you need to repeat this process and that speed is a critical factor.

我假设你需要重复这个过程,速度是一个关键因素。

Try opening the files as binary files and experiment with the size of the block that you are reading. Try 4096 and 8192 bytes as these are common underlying buffer sizes.

尝试将文件打开为二进制文件,并试验您正在阅读的块的大小。尝试4096和8192字节,因为这些是常见的底层缓冲区大小。

There is a similar question, Is it possible to speed-up python IO?, that might be of interest too.

有一个类似的问题,是否有可能加速python IO?,这也可能是有趣的。