逐行处理非常大（> 20GB）的文本文件

I have a number of very large text files which I need to process, the largest being about 60GB.

我有一些非常大的文本文件需要处理，最大的是大约60GB。

Each line has 54 characters in seven fields and I want to remove the last three characters from each of the first three fields - which should reduce the file size by about 20%.

每行在七个字段中有54个字符，我想从前三个字段中删除最后三个字符 - 这应该将文件大小减少大约20％。

I am brand new to Python and have a code which will do what I want to do at about 3.4 GB per hour, but to be a worthwhile exercise I really need to be getting at least 10 GB/hr - is there any way to speed this up? This code doesn't come close to challenging my processor, so I am making an uneducated guess that it is limited by the read and write speed to the internal hard drive?

我是Python的新手，并且有一个代码可以按照每小时大约3.4 GB的速度完成我想做的事情，但这是一个值得的练习，我真的需要达到至少10 GB /小时 - 有没有办法加速这个？这段代码并没有接近挑战我的处理器，所以我做了一个没有受过教育的猜测，它受到内部硬盘读写速度的限制？

ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    l = r.readline()
    while l:
        x = l.split(' ')[0]
        y = l.split(' ')[1]
        z = l.split(' ')[2]
        w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
        l = r.readline()
r.close()
w.close()

Any help would be really appreciated. I am using the IDLE Python GUI on Windows 7 and have 16GB of memory - perhaps a different OS would be more efficient?.

任何帮助将非常感激。我在Windows 7上使用IDLE Python GUI并拥有16GB的内存 - 也许一个不同的操作系统会更有效率？

Edit: Here is an extract of the file to be processed.

编辑：这是要处理的文件的摘录。

70700.642014 31207.277115 -0.054123 -1585 255 255 255
70512.301468 31227.990799 -0.255600 -1655 155 158 158
70515.727097 31223.828659 -0.066727 -1734 191 187 180
70566.756699 31217.065598 -0.205673 -1727 254 255 255
70566.695938 31218.030807 -0.047928 -1689 249 251 249
70536.117874 31227.837662 -0.033096 -1548 251 252 252
70536.773270 31212.970322 -0.115891 -1434 155 158 163
70533.530777 31215.270828 -0.154770 -1550 148 152 156
70533.555923 31215.341599 -0.138809 -1480 150 154 158

11 个解决方案

#1

It's more idiomatic to write your code like this

编写像这样的代码更加惯用

def ProcessLargeTextFile():
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            w.write(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

The main saving here is to just do the split once, but if the CPU is not being taxed, this is likely to make very little difference

这里的主要优点是只进行一次拆分，但如果CPU没有征税，这可能会产生很小的差别

It may help to save up a few thousand lines at a time and write them in one hit to reduce thrashing of your harddrive. A million lines is only 54MB of RAM!

它可能有助于一次保存几千行并一次写入它们以减少硬盘的颠簸。一百万行只有54MB的RAM！

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z = line.split(' ')[:3]
            bunch.append(line.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

suggested by @Janne, an alternative way to generate the lines

@Janne建议，另一种生成线的方法

def ProcessLargeTextFile():
    bunchsize = 1000000     # Experiment with different sizes
    bunch = []
    with open("filepath", "r") as r, open("outfilepath", "w") as w:
        for line in r:
            x, y, z, rest = line.split(' ', 3)
            bunch.append(' '.join((x[:-3], y[:-3], z[:-3], rest)))
            if len(bunch) == bunchsize:
                w.writelines(bunch)
                bunch = []
        w.writelines(bunch)

#2

Measure! You got quite some useful hints how to improve your python code and I agree with them. But you should first figure out, what your real problem is. My first steps to find your bottleneck would be:

测量！你有一些有用的提示如何改进你的python代码，我同意他们。但你应该先弄明白，你真正的问题是什么。我找到瓶颈的第一步是：

Remove any processing from your code. Just read and write the data and measure the speed. If just reading and writing the files is too slow, it's not a problem of your code.
从代码中删除任何处理。只需读取和写入数据并测量速度。如果只是读取和写入文件太慢，那么代码就不是问题了。
If just reading and writing is already slow, try to use multiple disks. You are reading and writing at the same time. On the same disc? If yes, try to use different discs and try again.
如果只是读写速度很慢，请尝试使用多个磁盘。你正在同时阅读和写作。在同一张光盘上？如果是，请尝试使用不同的光盘，然后重试。
Some async io library (Twisted?) might help too.
一些异步io库（Twisted？）也可能有所帮助。

If you figured out the exact problem, ask again for optimizations of that problem.

如果您发现了确切的问题，请再次询问该问题的优化。

#3

As you don't seem to be limited by CPU, but rather by I/O, have you tried with some variations on the third parameter of open?

由于您似乎不受CPU的限制，而是受I / O的限制，您是否尝试过对open的第三个参数进行一些修改？

Indeed, this third parameter can be used to give the buffer size to be used for file operations!

实际上，第三个参数可用于给出用于文件操作的缓冲区大小！

Simply writing open( "filepath", "r", 16777216 ) will use 16 MB buffers when reading from the file. It must help.

只需写入open（“filepath”，“r”，16777216），从文件读取时将使用16 MB缓冲区。它必须有所帮助。

Use the same for the output file, and measure/compare with identical file for the rest.

对输出文件使用相同的文件，然后测量/比较相同的文件。

Note: This is the same kind of optimization suggested by other, but you can gain it here for free, without changing your code, without having to buffer yourself.

注意：这是其他人建议的同类优化，但您可以在此免费获取，无需更改代码，无需自行缓冲。

#4

I'll add this answer to explain why buffering makes sense and also offer one more solution

我将添加这个答案来解释为什么缓冲是有意义的，并提供一个更多的解决方案

You are getting breathtakingly bad performance. This article Is it possible to speed-up python IO? shows that a 10 gb read should take in the neighborhood of 3 minutes. Sequential write is the same speed. So you're missing a factor of 30 and your performance target is still 10 times slower than what ought to be possible.

你的表现令人惊叹。这篇文章是否有可能加速python IO？显示10 gb读取应该在3分钟附近。顺序写入速度相同。所以你错过了30倍的因素，你的表现目标仍然比应该可能的慢10倍。

Almost certainly this kind of disparity lies in the number of head seeks the disk is doing. A head seek takes milliseconds. A single seek corresponds to several megabytes of sequential read-write. Enormously expensive. Copy operations on the same disk require seeking between input and output. As has been stated, one way to reduce seeks is to buffer in such a way that many megabytes are read before writing to disk and vice versa. If you can convince the python io system to do this, great. Otherwise you can read and process lines into a string array and then write after perhaps 50 mb of output are ready. This size means a seek will induce a <10% performance hit with respect to the data transfer itself.

几乎可以肯定，这种差异在于磁盘正在寻找的磁头数量。头部搜索需要几毫秒。单个搜索对应于几兆字节的顺序读写。非常昂贵。在同一磁盘上复制操作需要在输入和输出之间进行搜索。如上所述，减少搜索的一种方法是以这样的方式缓冲，即在写入磁盘之前读取许多兆字节，反之亦然。如果你可以说服python io系统做到这一点，那太好了。否则，您可以读取并处理行到字符串数组中，然后在准备好50 MB的输出后写入。这个大小意味着搜索将导致数据传输本身的性能损失<10％。

The other very simple way to eliminate seeks between input and output files altogether is to use a machine with two physical disks and fully separate io channels for each. Input from one. Output to other. If you're doing lots of big file transformations, it's good to have a machine with this feature.

消除输入和输出文件之间寻求的另一种非常简单的方法是使用具有两个物理磁盘的机器，并为每个物理磁盘完全分离io通道。从一个输入。输出到其他。如果您正在进行大量的大文件转换，那么拥有具有此功能的计算机是件好事。

#5

Your code is rather un-idiomatic and makes far more function calls than needed. A simpler version is:

您的代码非常不恰当，并且所需的函数调用次数要多得多。更简单的版本是：

ProcessLargeTextFile():
    with open("filepath") as r, open("output") as w:
        for line in r:
            fields = line.split(' ')
            fields[0:2] = [fields[0][:-3], 
                           fields[1][:-3],
                           fields[2][:-3]]
            w.write(' '.join(fields))

and I don't know of a modern filesystem that is slower than Windows. Since it appears you are using these huge data files as databases, have you considered using a real database?

我不知道比Windows慢的现代文件系统。既然看起来您正在使用这些巨大的数据文件作为数据库，您是否考虑过使用真正的数据库？

Finally, if you are just interested in reducing file size, have you considered compressing / zipping the files?

最后，如果您只想减小文件大小，是否考虑过压缩/压缩文件？

#6

ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    l = r.readline()
    while l:

As has been suggested already, you may want to use a for loop to make this more optimal.

正如已经建议的那样，您可能希望使用for循环来使其更加优化。

    x = l.split(' ')[0]
    y = l.split(' ')[1]
    z = l.split(' ')[2]

You are performing a split operation 3 times here, depending on the size of each line this will have a detremental impact on performance. You should split once and assign x,y,z to the entries in the array that comes back.

您在此处执行拆分操作3次，具体取决于每行的大小，这将对性能产生不利影响。您应该拆分一次并将x，y，z分配给返回的数组中的条目。

    w.write(l.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3]))

Each line you are reading, you are writing immediately to the file, which is very I/O intensive. You should consider buffering your output to memory and pushing to the disk periodically. Something like this:

您正在阅读的每一行，您正在立即写入该文件，这非常I / O密集。您应该考虑将输出缓冲到内存并定期推送到磁盘。像这样的东西：

BUFFER_SIZE_LINES = 1024 # Maximum number of lines to buffer in memory

def ProcessLargeTextFile():
    r = open("filepath", "r")
    w = open("filepath", "w")
    buf = ""
    bufLines = 0
    for lineIn in r:

        x, y, z = lineIn.split(' ')[:3]
        lineOut = lineIn.replace(x,x[:-3]).replace(y,y[:-3]).replace(z,z[:-3])
        bufLines+=1

        if bufLines >= BUFFER_SIZE:
            # Flush buffer to disk
            w.write(buf)
            buf = ""
            bufLines=1

        buf += lineOut + "\n"

    # Flush remaining buffer to disk
    w.write(buf)
    buf.close()
    r.close()
    w.close()

You can tweak BUFFER_SIZE to determine an optimal balance between memory usage and speed.

您可以调整BUFFER_SIZE以确定内存使用和速度之间的最佳平衡。

#7

Those seem like very large files... Why are they so large? What processing are you doing per line? Why not use a database with some map reduce calls (if appropriate) or simple operations of the data? The point of a database is to abstract the handling and management large amounts of data that can't all fit in memory.

那些看似非常大的文件......为什么它们如此之大？你在做什么处理？为什么不使用带有一些map reduce调用的数据库（如果适用）或简单的数据操作？数据库的要点是抽象处理和管理大量数据，这些数据不能全部适合内存。

You can start to play with the idea with sqlite3 which just uses flat files as databases. If you find the idea useful then upgrade to something a little more robust and versatile like postgresql.

您可以开始使用sqlite3来实现这个想法，它只使用平面文件作为数据库。如果您发现这个想法很有用，那么升级到像postgresql一样更强大和多功能的东西。

Create a database

创建一个数据库

 conn = sqlite3.connect('pts.db')
 c = conn.cursor()

Creates a table

创建一个表

c.execute('''CREATE TABLE ptsdata (filename, line, x, y, z''')

Then use one of the algorithms above to insert all the lines and points in the database by calling

然后使用上面的算法之一通过调用插入数据库中的所有行和点

c.execute("INSERT INTO ptsdata VALUES (filename, lineNumber, x, y, z)")

Now how you use it depends on what you want to do. For example to work with all the points in a file by doing a query

现在你如何使用它取决于你想做什么。例如，通过执行查询来处理文件中的所有点

c.execute("SELECT lineNumber, x, y, z FROM ptsdata WHERE filename=file.txt ORDER BY lineNumber ASC")

And get n lines at a time from this query with

并且从此查询中一次获得n行

c.fetchmany(size=n)

I'm sure there is a better wrapper for the sql statements somewhere, but you get the idea.

我确定在某处有一个更好的sql语句包装器，但你明白了。

#8

Since you only mention saving space as a benefit, is there some reason you can't just store the files gzipped? That should save 70% and up on this data. Or consider getting NTFS to compress the files if random access is still important. You'll get much more dramatic savings on I/O time after either of those.

既然你只提到节省空间作为一种好处，是否有一些理由你不能只存储gzip压缩的文件？这应该节省70％以上的数据。或者如果随机访问仍然很重要，可以考虑让NTFS压缩文件。在其中任何一个之后，您将在I / O时间上获得更大幅度的节省。

More importantly, where is your data that you're getting only 3.4GB/hr? That's down around USBv1 speeds.

更重要的是，您的数据在哪里只能达到3.4GB /小时？这与USBv1速度有关。

#9

Read the file using for l in r: to benefit from buffering.

在r中使用for l读取文件以从缓冲中受益。

#10

You can try to save your split result first you do it and not do it every time you need a field. May be this will speed up.

您可以先尝试保存拆分结果，而不是每次需要字段时都这样做。可能这会加快。

you can also try not to run it in gui. Run it in cmd.

你也可以尝试不用gui来运行它。在cmd中运行它。

#11

Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files. It will run smoothly on any kind of machine, you just need to configure CHUNK_SIZE based on your system RAM. More the CHUNK_SIZE, more will be the data read at a time

下面是加载任何大小的文本文件而不会导致内存问题的代码。它支持千兆字节大小的文件。它可以在任何类型的机器上平稳运行，您只需根据系统RAM配置CHUNK_SIZE。 CHUNK_SIZE越多，一次读取的数据就越多

https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

download the file data_loading_utils.py and import it into your code

下载文件data_loading_utils.py并将其导入您的代码

usage

用法

import data_loading_utils.py.py
file_name = 'file_name.ext'
CHUNK_SIZE = 1000000


def process_lines(line, eof, file_name):

    # check if end of file reached
    if not eof:
         # process data, data is one single line of the file

    else:
         # end of file reached

data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=self.process_lines)

process_lines method is the callback function. It will be called for all the lines, with parameter line representing one single line of the file at a time.

process_lines方法是回调函数。它将被调用所有行，参数行一次代表文件的一行。

You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.

您可以根据机器硬件配置配置变量CHUNK_SIZE。

#1