Python文本文件处理速度问题

I'm having a problem with processing a largeish file in Python. All I'm doing is

我在Python中处理较大的文件时遇到了问题。我所做的

f = gzip.open(pathToLog, 'r')
for line in f:
        counter = counter + 1
        if (counter % 1000000 == 0):
                print counter
f.close

This takes around 10m25s just to open the file, read the lines and increment this counter.

这需要大约10m25秒才能打开文件，读取行并增加这个计数器。

In perl, dealing with the same file and doing quite a bit more (some regular expression stuff), the whole process takes around 1m17s.

在perl中，处理相同的文件并做更多的工作(一些正则表达式)，整个过程大约需要1m17s。

Perl Code:

Perl代码:

open(LOG, "/bin/zcat $logfile |") or die "Cannot read $logfile: $!\n";
while (<LOG>) {
        if (m/.*\[svc-\w+\].*login result: Successful\.$/) {
                $_ =~ s/some regex here/$1,$2,$3,$4/;
                push @an_array, $_
        }
}
close LOG;

Can anyone advise what I can do to make the Python solution run at a similar speed to the Perl solution?

有谁能告诉我，要让Python解决方案以类似于Perl解决方案的速度运行，我能做些什么?

EDIT I've tried just uncompressing the file and dealing with it using open instead of gzip.open, but that only changes the total time to around 4m14.972s, which is still too slow.

我试过解压文件，用open代替gzip处理它。打开，但这只会将总时间更改为4m14.972s，这仍然太慢。

I also removed the modulo and print statements and replaced them with pass, so all that is being done now is moving from file to file.

我还删除了modulo和print语句，并将它们替换为pass，所以现在正在做的就是从一个文件移动到另一个文件。

5 个解决方案

#1

In Python (at least <= 2.6.x), gzip format parsing is implemented in Python (over zlib). More, it appears to be doing some strange things, namely, decompress to the end of file to memory and then discard everything beyond the requested read size (then do it again for next read). DISCLAIMER: I've just looked at gzip.read() for 3 minutes, so I can be wrong here. Regardless of whether my understanding of gzip.read() is correct or not, gzip module appears to be not optimized for large data volumes. Try doing the same thing as in Perl, i.e. launching an external process (e.g. see module subprocess).

在Python(至少<= 2.6.x)中，gzip格式解析是在Python(超过zlib)中实现的。更重要的是，它似乎在做一些奇怪的事情，即将文件的末尾解压缩到内存中，然后丢弃超出请求的读大小的所有内容(然后再次执行，以便下次读取)。免责声明:我只看了gzip.read() 3分钟，所以这里可能是错误的。不管我对gzip.read()的理解是否正确，gzip模块似乎并没有针对大数据量进行优化。尝试做与Perl中相同的事情，例如启动外部进程(例如，参见模块子进程)。

EDIT Actually, I missed the OP's remark about plain file I/O being just as slow as compressed (thanks to ire_and_curses for pointing it out). This striken me as unlikely, so I did some measurements...

实际上，我忽略了OP关于普通文件I/O的评论，它和压缩一样慢(感谢ire_and_curses指出)。这让我觉得不太可能，所以我做了一些测量……

from timeit import Timer

def w(n):
    L = "*"*80+"\n"
    with open("ttt", "w") as f:
        for i in xrange(n) :
            f.write(L)

def r():
    with open("ttt", "r") as f:
        for n,line in enumerate(f) :
            if n % 1000000 == 0 :
                print n

def g():
    f = gzip.open("ttt.gz", "r")
    for n,line in enumerate(f) :
        if n % 1000000 == 0 :
        print n

Now, running it...

现在,运行它…

>>> Timer("w(10000000)", "from __main__ import w").timeit(1)
14.153118133544922
>>> Timer("r()", "from __main__ import r").timeit(1)
1.6482770442962646
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)

...and after having a tea break and discovering that it's still running, I've killed it, sorry. Then I tried 100'000 lines instead of 10'000'000:

…休息了一会儿，发现它还在运行，我把它给杀了，对不起。然后我尝试了10万行而不是1万行:

>>> Timer("w(100000)", "from __main__ import w").timeit(1)
0.05810999870300293
>>> Timer("r()", "from __main__ import r").timeit(1)
0.09662318229675293
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
11.939290046691895

Module gzip's time is O(file_size**2), so with number of lines on the order of millions, gzip read time just cannot be the same as plain read time (as we see confirmed by an experiment). Anonymouslemming, please check again.

模块gzip的时间是O(file_size* 2)，因此，由于有数百万行，gzip的读取时间不可能与普通读取时间相同(正如我们在实验中看到的那样)。Anonymouslemming,请再检查一次。

#2

If you Google "why is python gzip slow" you'll find plenty of discussion of this, including patches for improvements in Python 2.7 and 3.2. In the meantime, use zcat as you did in Perl which is wicked fast. Your (first) function takes me about 4.19s with a 5MB compressed file, and the second function takes 0.78s. However, I don't know what's going on with your uncompressed files. If I uncompress the log files (apache logs) and run the two function on them with a simple Python open(file), and Popen('cat'), Python is faster (0.17s) than cat (0.48s).

如果您谷歌“为什么python gzip慢”，您将会发现大量关于这方面的讨论，包括python 2.7和3.2中的改进补丁。与此同时，使用zcat，就像您在Perl中所做的那样，它非常快。您的(第一个)函数使用一个5MB压缩文件占用我大约4.19秒，而第二个函数占用0.78秒。然而，我不知道你的未压缩文件是怎么回事。如果我解压日志文件(apache日志)并使用简单的Python open(file)和Popen('cat')运行两个函数，那么Python比cat更快(0.17秒)。

#!/usr/bin/python

import gzip
from subprocess import PIPE, Popen
import sys
import timeit

#pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed)
pathToLog = 'small.log.gz' # 5M ""

def test_ori():
    counter = 0
    f = gzip.open(pathToLog, 'r')
    for line in f:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line
    f.close

def test_new():
    counter = 0
    content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n')
    for line in content:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line

if '__main__' == __name__:
    to = timeit.Timer('test_ori()', 'from __main__ import test_ori')
    print "Original function time", to.timeit(1)

    tn = timeit.Timer('test_new()', 'from __main__ import test_new')
    print "New function time", tn.timeit(1)

#3

I spent a while on this. Hopefully this code will do the trick. It uses zlib and no external calls.

我在这上面花了一段时间。希望这段代码能达到目的。它使用zlib，没有外部调用。

The gunzipchunks method reads the compressed gzip file in chunks which can be iterated over (generator).

gunzipchunks方法以块的形式读取压缩后的gzip文件，可以在生成器中进行迭代。

The gunziplines method reads these uncompressed chunks and provides you with one line at a time which can also be iterated over (another generator).

gunziplines方法读取这些未压缩的块，并每次为您提供一行，也可以对其进行迭代(另一个生成器)。

Finally, the gunziplinescounter method gives you what you're looking for.

最后，gunziplinescounter方法给出了你要找的东西。

Cheers!

干杯!

import zlib

file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'

#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
    inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
    f = open(file_name,'rb')
    while True:
        packet = f.read(chunk_size)
        if not packet: break
        to_do = inflator.unconsumed_tail + packet
        while to_do:
            decompressed = inflator.decompress(to_do, chunk_size)
            if not decompressed:
                to_do = None
                break
            yield decompressed
            to_do = inflator.unconsumed_tail
    leftovers = inflator.flush()
    if leftovers: yield leftovers
    f.close()

#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
    for chunk in gunzipchunks(file_name): 
        chunk = "".join([leftovers,chunk])
        while line_ending in chunk:
            line, leftovers = chunk.split(line_ending,1)
            yield line
            chunk = leftovers
    if leftovers: yield leftovers

def gunziplinescounter(file_name):
    for counter,line in enumerate(gunziplines(file_name)):
        if (counter % 1000000 != 0): continue
        print "%12s: %10d" % ("checkpoint", counter)
    print "%12s: %10d" % ("final result", counter)
    print "DEBUG: last line: [%s]" % (line)

gunziplinescounter(file_name)

This should run a whole lot faster than using the builtin gzip module on extremely large files.

这比在超大文件上使用builtin gzip模块要快得多。

#4

It took your computer 10 minutes? It must be your hardware. I wrote this function to write 5 million lines:

你的电脑花了10分钟?一定是你的硬件。我写这个函数是为了写500万行:

def write():
    fout = open('log.txt', 'w')
    for i in range(5000000):
        fout.write(str(i/3.0) + "\n")
    fout.close

Then I read it with a program much like yours:

然后我用一个类似于你的程序来阅读:

def read():
    fin = open('log.txt', 'r')
    counter = 0
    for line in fin:
        counter += 1
        if counter % 1000000 == 0:
            print counter
    fin.close

It took my computer about 3 seconds to read all 5 million lines.

我的电脑花了大约3秒的时间阅读了所有的500万行。

#5

Try using StringIO to buffer the output from the gzip module. The following code to read a gzipped pickle cut the execution time of my code by well over 90%.

尝试使用StringIO缓冲来自gzip模块的输出。下面的代码读取一个gziked pickle将代码的执行时间缩短了90%以上。

Instead of...

而不是……

import cPickle

# Use gzip to open/read the pickle.
lPklFile = gzip.open("test.pkl", 'rb')
lData = cPickle.load(lPklFile)
lPklFile.close()

Use...

使用……

import cStringIO, cPickle

# Use gzip to open the pickle.
lPklFile = gzip.open("test.pkl", 'rb')

# Copy the pickle into a cStringIO.
lInternalFile = cStringIO.StringIO()
lInternalFile.write(lPklFile.read())
lPklFile.close()

# Set the seek position to the start of the StringIO, and read the
# pickled data from it.
lInternalFile.seek(0, os.SEEK_SET)
lData = cPickle.load(lInternalFile)
lInternalFile.close()

#1