Python使用utf-8编码逐行读取大型文件

I want to read some quite huge files(to be precise: the google ngram 1 word dataset) and count how many times a character occurs. Now I wrote this script:

我想读一些非常大的文件(准确地说:谷歌ngram 1词数据集)并计算一个字符出现的次数。现在我写了这个剧本:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)

which works fine, until it reaches approx. line 700.000 of the first file, I then get this error:

它工作得很好，直到达到大约。第一个文件的700.000行，然后我得到这个错误:

../../datasets/googlebooks-eng-all-1gram-20090715-0.csv
100000
200000
300000
400000
500000
600000
700000
Traceback (most recent call last):
  File "charactercounter.py", line 5, in <module>
    for line in fileinput.input(files):
  File "C:\Python31\lib\fileinput.py", line 254, in __next__
    line = self.readline()
  File "C:\Python31\lib\fileinput.py", line 349, in readline
    self._buffer = self._file.readlines(self._bufsize)
  File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7771: cha
racter maps to <undefined>

To solve this I searched the web a bit, and came up with this code:

为了解决这个问题，我搜索了一下web，并提出了以下代码:

import fileinput
files = ['../../datasets/googlebooks-eng-all-1gram-20090715-%i.csv' % value for value in range(0,9)]
charcounts = {}
lastfile = ''
for line in fileinput.input(files,False,'',0,'r',fileinput.hook_encoded('utf-8')):
    line = line.strip()
    data = line.split('\t')
    for character in list(data[0]):
        if (not character in charcounts):
            charcounts[character] = 0
        charcounts[character] += int(data[1])
    if (fileinput.filename() is not lastfile):
        print(fileinput.filename())
        lastfile = fileinput.filename()
    if(fileinput.filelineno() % 100000 == 0):
        print(fileinput.filelineno())
print(charcounts)

but the hook I now use tries to read the entire, 990MB, file into the memory at once, which kind of crashes my pc. Does anyone know how to rewrite this code so that it actually works?

但是我现在使用的钩子尝试一次读取整个990MB的文件到内存中，这有点崩溃了我的pc。有人知道如何重写这段代码，使它能够正常工作吗?

p.s: the code hasn't even run all the way yet, so I don't even know if it does what it has to do, but for that to happen I first need to fix this bug.

p。s:代码还没有全部运行，所以我甚至不知道它是否做了它必须做的事情，但是为了实现这一点，我首先需要修复这个bug。

Oh, and I use Python 3.2

哦，我用的是Python 3。2

6 个解决方案

#1

I do not know why fileinput does not work as expected.

我不知道为什么fileinput不能正常工作。

I suggest you use the open function instead. The return value can be iterated over and will return lines, just like fileinput.

我建议您使用open函数。返回值可以遍历并返回行，就像fileinput一样。

The code will then be something like:

代码会是这样的:

for filename in files:
    print(filename)
    for filelineno, line in enumerate(open(filename, encoding="utf-8")):
        line = line.strip()
        data = line.split('\t')
        # ...

Some documentation links: enumerate, open, io.TextIOWrapper (open returns an instance of TextIOWrapper).

一些文档链接:枚举、打开、io。TextIOWrapper(打开返回一个TextIOWrapper的实例)。

#2

The problem is that fileinput doesn't use file.xreadlines(), which reads line by line, but file.readline(bufsize), which reads bufsize bytes at once (and turns that into a list of lines). You are providing 0 for the bufsize parameter of fileinput.input() (which is also the default value). Bufsize 0 means that the whole file is buffered.

问题是fileinput不使用file.xreadlines()，它逐行读取，而使用file.readline(bufsize)，它一次读取bufsize字节(并将其转换为行列表)。您为fileinpu .input()的bufsize参数提供了0(也是默认值)。Bufsize 0表示整个文件被缓冲。

Solution: provide a reasonable bufsize.

解决方案:提供合理的尺寸。

#3

This works for me: you can use "utf-8" in the hook definition. I used it on a 50GB/200M lines file with no problem.

这对我有用:您可以在钩子定义中使用“utf-8”。我在50GB/200M行文件上使用它，没有问题。

fi = fileinput.FileInput(openhook=fileinput.hook_encoded("iso-8859-1"))

#4

Could you try to read not a whole file, but a part of it as binary, then decode(), then proccess, then call the function again to read another part?

您是否可以尝试不读取整个文件，而是读取其中的一部分作为二进制文件，然后解码()，然后处理，然后再次调用函数来读取另一部分?

#5

I don't if the one I have is the latest version (and I don't remember how I read them), but...

如果我的书是最新的版本(我不记得我是怎么读的)，我就不知道了，但是……

$ file -i googlebooks-eng-1M-1gram-20090715-0.csv 
googlebooks-eng-1M-1gram-20090715-0.csv: text/plain; charset=us-ascii

Have you tried fileinput.hook_encoded('ascii') or fileinput.hook_encoded('latin_1')? Not sure why this would make a difference, since I think the these are just subsets of unicode with the same mapping, but worth a try.

你试过fileinput.hook_encoded('ascii')或fileinput.hook_encoded('latin_1')吗?我不知道为什么会有不同，因为我认为这些只是具有相同映射的unicode子集，但值得一试。

EDIT I think this might be a bug in fileinput, neither of these work.

我认为这可能是fileinput中的一个错误，这两个都不行。

#6

If you are worried about the mem usage, why not read by line using readline()? This will get rid of the memory issues you are running into. Currently you are reading the full file before performing any actions on the fileObj, with readline() you are not saving the data, merely searching it on a per-line basis.

如果您担心mem的使用，为什么不使用readline()逐行读取呢?这将消除您正在运行的内存问题。目前，在对fileObj执行任何操作之前，您正在读取整个文件，使用readline()，您不是在保存数据，而只是在每一行上搜索数据。

def charCount1(_file, _char):
  result = []
  file   = open(_file, encoding="utf-8")
  data   = file.read()
  file.close()
  for index, line in enumerate(data.split("\n")):
    if _char in line:
      result.append(index)
  return result

def charCount2(_file, _char):
  result = []
  count  = 0
  file   = open(_file, encoding="utf-8")
  while 1:
    line = file.readline()
    if _char in line:
      result.append(count)
    count += 1
    if not line: break
  file.close()
  return result

I didn't have a chance to really look over your code but the above samples should give you an idea of how to make the appropriate changes to your structure. charCount1() demonstrates your method which caches the entire file in a single call from read(). I tested your method out on a +400MB text file and the python.exe process went as high as +900MB. when you run charCount2(), the python.exe process shouldn't exceed more than a few MB's (provided you haven't bulked up the size with other code) ;)

我没有机会真正地检查您的代码，但是上面的示例应该让您了解如何对您的结构进行适当的更改。演示了在read()的单个调用中缓存整个文件的方法。我在一个+400MB的文本文件和python上测试了您的方法。exe进程高达+900MB。运行charCount2()时，使用python。exe过程不应该超过几个MB(前提是您没有使用其他代码增大大小);

#1