I'm running a Python program which fetches a UTF-8-encoded web page, and I extract some text from the HTML using BeautifulSoup.
我正在运行一个Python程序,它获取一个UTF-8编码的网页,我使用BeautifulSoup从HTML中提取一些文本。
However, when I write this text to a file (or print it on the console), it gets written in an unexpected encoding.
但是,当我将此文本写入文件(或在控制台上打印)时,它将以意外编码形式写入。
Sample program:
import urllib2
from BeautifulSoup import BeautifulSoup
# Fetch URL
url = 'http://www.voxnow.de/'
request = urllib2.Request(url)
request.add_header('Accept-Encoding', 'utf-8')
# Response has UTF-8 charset header,
# and HTML body which is UTF-8 encoded
response = urllib2.urlopen(request)
# Parse with BeautifulSoup
soup = BeautifulSoup(response)
# Print title attribute of a <div> which uses umlauts (e.g. können)
print repr(soup.find('div', id='navbutton_account')['title'])
Running this gives the result:
运行它会得到结果:
# u'Hier k\u0102\u015bnnen Sie sich kostenlos registrieren und / oder einloggen!'
But I would expect a Python Unicode string to render ö
in the word können
as \xf6
:
但我希望Python Unicode字符串能够将单词können中的ö渲染为\ xf6:
# u'Hier k\xf6bnnen Sie sich kostenlos registrieren und / oder einloggen!'
I've tried passing the 'fromEncoding' parameter to BeautifulSoup, and trying to read()
and decode()
the response
object, but it either makes no difference, or throws an error.
我已经尝试将'fromEncoding'参数传递给BeautifulSoup,并尝试读取()和解码()响应对象,但它没有任何区别,或者抛出错误。
With the command curl www.voxnow.de | hexdump -C
, I can see that the web page is indeed UTF-8 encoded (i.e. it contains 0xc3 0xb6
) for the ö
character:
用命令curl www.voxnow.de | hexdump -C,我可以看到网页确实是UTF-8编码的(即它包含0xc3 0xb6)用于ö字符:
20 74 69 74 6c 65 3d 22 48 69 65 72 20 6b c3 b6 | title="Hier k..|
6e 6e 65 6e 20 53 69 65 20 73 69 63 68 20 6b 6f |nnen Sie sich ko|
73 74 65 6e 6c 6f 73 20 72 65 67 69 73 74 72 69 |stenlos registri|
I'm beyond the limit of my Python abilities, so I'm at a loss as to how to debug this further. Any advice?
我超出了我的Python能力限制,所以我对如何进一步调试这一点感到茫然。任何建议?
2 个解决方案
#1
22
As justhalf points out above, my question here is essentially a duplicate of this question.
正如justhalf指出的那样,我的问题基本上是这个问题的重复。
The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.
HTML内容报告为UTF-8编码,除了一个或两个流氓无效的UTF-8字符外,大部分都是。
This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:
这显然使BeautifulSoup混淆了正在使用哪种编码,以及在将内容传递给BeautifulSoup时尝试首先解码为UTF-8,如下所示:
soup = BeautifulSoup(response.read().decode('utf-8'))
I would get the error:
我会得到错误:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813:
invalid continuation byte
Looking more closely at the output, there was an instance of the character Ü
which was wrongly encoded as the invalid byte sequence 0xe3 0x9c
, rather than the correct 0xc3 0x9c
.
仔细观察输出,有一个字符Ü的实例被错误地编码为无效字节序列0xe3 0x9c,而不是正确的0xc3 0x9c。
As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:
由于该问题目前得分最高的答案表明,解析时可以删除无效的UTF-8字符,因此只有有效数据传递给BeautifulSoup:
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
#2
3
Encoding the result to utf-8
seems to work for me:
将结果编码为utf-8似乎对我有用:
print (soup.find('div', id='navbutton_account')['title']).encode('utf-8')
It yields:
Hier können Sie sich kostenlos registrieren und / oder einloggen!
#1
22
As justhalf points out above, my question here is essentially a duplicate of this question.
正如justhalf指出的那样,我的问题基本上是这个问题的重复。
The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.
HTML内容报告为UTF-8编码,除了一个或两个流氓无效的UTF-8字符外,大部分都是。
This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:
这显然使BeautifulSoup混淆了正在使用哪种编码,以及在将内容传递给BeautifulSoup时尝试首先解码为UTF-8,如下所示:
soup = BeautifulSoup(response.read().decode('utf-8'))
I would get the error:
我会得到错误:
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813:
invalid continuation byte
Looking more closely at the output, there was an instance of the character Ü
which was wrongly encoded as the invalid byte sequence 0xe3 0x9c
, rather than the correct 0xc3 0x9c
.
仔细观察输出,有一个字符Ü的实例被错误地编码为无效字节序列0xe3 0x9c,而不是正确的0xc3 0x9c。
As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:
由于该问题目前得分最高的答案表明,解析时可以删除无效的UTF-8字符,因此只有有效数据传递给BeautifulSoup:
soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
#2
3
Encoding the result to utf-8
seems to work for me:
将结果编码为utf-8似乎对我有用:
print (soup.find('div', id='navbutton_account')['title']).encode('utf-8')
It yields:
Hier können Sie sich kostenlos registrieren und / oder einloggen!