使用简单的Python脚本读取字典单词文件时的UnicodeDecodeError

时间:2023-01-04 23:45:53

First time doing Python in a while, and I'm having trouble doing a simple scan of a file when I run the following script with Python 3.0.1,

第一次在一段时间内做Python,当我使用Python 3.0.1运行以下脚本时,我无法对文件进行简单的扫描,

with open("/usr/share/dict/words", 'r') as f:
   for line in f:
       pass

I get this exception:

我得到这个例外:

Traceback (most recent call last):
  File "/home/matt/install/test.py", line 2, in <module>
    for line in f:
  File "/home/matt/install/root/lib/python3.0/io.py", line 1744, in __next__
    line = self.readline()
  File "/home/matt/install/root/lib/python3.0/io.py", line 1817, in readline
    while self._read_chunk():
  File "/home/matt/install/root/lib/python3.0/io.py", line 1565, in _read_chunk
    self._set_decoded_chars(self._decoder.decode(input_chunk, eof))
  File "/home/matt/install/root/lib/python3.0/io.py", line 1299, in decode
    output = self.decoder.decode(input, final=final)
  File "/home/matt/install/root/lib/python3.0/codecs.py", line 300, in decode
   (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1689-1692: invalid data

The line in the file it blows up on is "Argentinian", which doesn't seem to be unusual in any way.

它爆炸的文件中的行是“阿根廷人”,这在任何方面似乎都不常见。

Update: I added,

更新:我补充说,

encoding="iso-8559-1"

to the open() call, and it fixed the problem.

到open()调用,它修复了问题。

2 个解决方案

#1


How have you determined from "position 1689-1692" what line in the file it has blown up on? Those numbers would be offsets in the chunk that it's trying to decode. You would have had to determine what chunk it was -- how?

您是如何从“位置1689-1692”确定文件中的哪一行被炸毁的?这些数字将是它试图解码的块中的偏移量。你不得不确定它是什么块 - 怎么样?

Try this at the interactive prompt:

在交互式提示下尝试此操作:

buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting

#2


Can you check to make sure it is valid UTF-8? A way to do that is given at this SO question:

你能检查一下,确保它是有效的UTF-8吗?在这个问题上给出了一种方法:

iconv -f UTF-8 /usr/share/dict/words -o /dev/null

There are other ways to do the same thing.

还有其他方法可以做同样的事情。

#1


How have you determined from "position 1689-1692" what line in the file it has blown up on? Those numbers would be offsets in the chunk that it's trying to decode. You would have had to determine what chunk it was -- how?

您是如何从“位置1689-1692”确定文件中的哪一行被炸毁的?这些数字将是它试图解码的块中的偏移量。你不得不确定它是什么块 - 怎么样?

Try this at the interactive prompt:

在交互式提示下尝试此操作:

buf = open('the_file', 'rb').read()
len(buf)
ubuf = buf.decode('utf8')
# splat ... but it will give you the byte offset into the file
buf[offset-50:60] # should show you where/what the problem is
# By the way, from the error message, looks like a bad
# FOUR-byte UTF-8 character ... interesting

#2


Can you check to make sure it is valid UTF-8? A way to do that is given at this SO question:

你能检查一下,确保它是有效的UTF-8吗?在这个问题上给出了一种方法:

iconv -f UTF-8 /usr/share/dict/words -o /dev/null

There are other ways to do the same thing.

还有其他方法可以做同样的事情。