Python的urllib中的网页的Unicode问题

I seem to have the all-familiar problem of correctly reading and viewing a web page. It looks like Python reads the page in UTF-8 but when I try to convert it to something more viewable (iso-8859-1) I get this error:

我似乎有一个熟悉的正确阅读和查看网页的问题。看起来Python在UTF-8中读取页面但是当我尝试将其转换为更易查看的内容时(iso-8859-1)我收到此错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)

The code looks like this:

代码如下所示:

#!/usr/bin/python
from urllib import urlopen
import re

url_address = 'http://www.eurohockey.net/players/show_player.cgi?serial=4722'

finished = 0
begin_record = 0
col = 0
str = ''

for line in urlopen(url_address):
    if '</tr' in line:
        begin_record = 0                   
        print str
        str = ''
        continue

    if begin_record == 1:
        col = col + 1
        tmp_match =  re.search('<td>(.+)</td>', line.strip())
        str = str + ';' + unicode(tmp_match.group(1), 'iso-8859-1')

    if '<tr class=\"even\"' in line or '<tr class=\"odd\"' in line: 
        begin_record = 1
        col = 0
        continue

How should I handle the contents? Firefox at least thinks it's iso-8859-1 and it would make sense looking at the contents of that page. The error comes from the 'ä' character clearly.

我该如何处理内容? Firefox至少认为它是iso-8859-1,看看该页面的内容是有意义的。错误来自'ä'字符。

And if I was to save that data to a database, should I not bother with changing the codec and then converting when showing it?

如果我要将数据保存到数据库,我是否应该在更改编解码器并在显示时进行转换?

3 个解决方案

#1

It doesn't look like Python is "reading it in UTF-8" at all. As already pointed out, you have an encoding problem, NOT a decoding problem. It is impossible for that error to have arisen from that line that you say. When asking a question like this, always give the full traceback and error message.

看起来Python看起来并不是“用UTF-8读取它”。正如已经指出的,你有编码问题,而不是解码问题。你说的那条线是不可能出现这种错误的。在询问这样的问题时,请始终提供完整的回溯和错误消息。

Kathy's suspicion is correct; in fact the print str line is the only possible source of that error, and that can only happen when sys.stdout.encoding is not set so Python punts on 'ascii'.

凯西的怀疑是正确的;事实上,print str行是该错误的唯一可能来源,并且只有在没有设置sys.stdout.encoding时才会发生这种情况,因此Python会在'ascii'上发布。

Variables that may affect the outcome are what version of Python you are using, what platform you are running on and exactly how you run your script -- none of which you have told us; please do.

可能影响结果的变量是您正在使用的Python版本,正在运行的平台以及您运行脚本的确切方式 - 您没有告诉我们这些;请做。

Example: I'm using Python 2.6.2 on Windows XP and I'm running your script with some diagnostic additions: (1) import sys; print sys.stdout.encoding up near the front (2) print repr(str) before print str so that I can see what you've got before it crashes.

示例:我在Windows XP上使用Python 2.6.2并且我正在运行带有一些诊断添加的脚本:(1)import sys;在打印str之前打印sys.stdout.encoding靠近前面(2)print repr(str),以便我可以看到它在崩溃之前已经得到了什么。

In a Command Prompt window, if I do \python26\python hockey.py it prints cp850 as the encoding and just works.

在命令提示符窗口中,如果我执行\ python26 \ python hockey.py它会打印cp850作为编码并且正常工作。

However if I do

但是,如果我这样做

\python26\python hockey.py | more

\python26\python hockey.py >hockey.txt

it prints None as the encoding and crashes with your error message on the first line with the a-with-diaeresis:

它打印无作为编码并在第一行使用a-with-diaeresis与您的错误消息崩溃:

C:\junk>\python26\python hockey.py >hockey.txt
Traceback (most recent call last):
  File "hockey.py", line 18, in <module>
    print str
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 2: ordinal not in range(128)

If that fits your case, the fix in general is to explicitly encode your output with an encoding suited to the display mechanism you plan to use.

如果这符合您的情况,通常的修复方法是使用适合您计划使用的显示机制的编码对输出进行显式编码。

#2

As noted by Lennart, your problem is not the decoding. It is trying to encode into "ascii", which is often a problem with print statements. I suspect the line

正如Lennart所说,你的问题不是解码。它试图编码为“ascii”,这通常是print语句的问题。我怀疑这条线

print str

is your problem. You need to encode the str into whatever your console is using to have that line work.

是你的问题。您需要将str编码为控制台用于使该行工作的任何内容。

#3

That text is indeed iso-88591-1, and I can decode it without a problem, and indeed your code runs without a hitch.

该文本确实是iso-88591-1,我可以毫无问题地对其进行解码,实际上您的代码运行顺利。

Your error, however, is an ENCODE error, not a decode error. And you don't do any encoding in your code, so. Possibly you have gotten encoding and decoding confused, it's a common problem.

但是,您的错误是ENCODE错误,而不是解码错误。并且您不在代码中进行任何编码,因此。可能你的编码和解码混乱,这是一个常见的问题。

You DECODE from Latin1 to Unicode. You ENCODE the other way. Remember that Latin1, UTF8 etc are called "encodings".

您从Latin1解码为Unicode。你以另一种方式编码。请记住,Latin1,UTF8等称为“编码”。

#1