I am using the following code to scrape a webpage that contains Japanese characters:
我使用以下代码来刮取包含日文字符的网页:
import urllib2
import bs4
import time
url = 'http://www.city.sapporo.jp/eisei/tiiki/toban.html'
pagecontent = urllib2.urlopen(url)
soup = bs4.BeautifulSoup(pagecontent.read().decode("utf8"))
print(soup.prettify())
print(soup)
In some machines the code works fine, and the last two statements print the result successfully. However, in some machines the last but one statement gives the error
在某些机器中,代码工作正常,最后两个语句成功打印结果。但是,在某些机器中,最后一个语句会给出错误
UnicodeEncodeError 'ascii' codec can't encode characters in position 485-496: ordinal not in range(128),
and the last statement prints strange squares for all Japanese characters.
最后一个语句打印所有日文字符的奇怪方块。
Why the same code works differently for two machines? How can I fix this?
为什么两台机器的相同代码的工作方式不同?我怎样才能解决这个问题?
Python version 2.6.6
Python版本2.6.6
bs4 version: 4.1.0
bs4版本:4.1.0
1 个解决方案
#1
7
You need to configure your environment locale correctly; once your locale is set, Python will pick it up automatically when printing to a terminal.
您需要正确配置您的环境区域设置;一旦设置了语言环境,Python将在打印到终端时自动选择它。
Check your locale with the locale
command:
使用locale命令检查您的语言环境:
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
Note the .UTF-8
in my locale settings; it tells programs running in the terminal that my terminal uses the UTF-8 codec, one that supports all of Unicode.
请注意我的语言环境设置中的.UTF-8;它告诉终端中运行的程序我的终端使用UTF-8编解码器,支持所有Unicode。
You can set all of your locale in one step with the LANG
environment variable:
您可以使用LANG环境变量一步设置所有语言环境:
export LANG="en_US.UTF-8"
for a US locale (how dates and numbers are printed) with the UTF-8 codec. To be precise, the LC_CTYPE
setting is used for the output codec, which in turn defaults to the LANG
value.
使用UTF-8编解码器进行美国语言环境(如何打印日期和数字)。确切地说,LC_CTYPE设置用于输出编解码器,而输出编解码器默认为LANG值。
Also see the very comprehensive UTF-8 and Unicode FAQ for Unix/Linux.
另请参阅针对Unix / Linux的非常全面的UTF-8和Unicode FAQ。
#1
7
You need to configure your environment locale correctly; once your locale is set, Python will pick it up automatically when printing to a terminal.
您需要正确配置您的环境区域设置;一旦设置了语言环境,Python将在打印到终端时自动选择它。
Check your locale with the locale
command:
使用locale命令检查您的语言环境:
$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL="en_US.UTF-8"
Note the .UTF-8
in my locale settings; it tells programs running in the terminal that my terminal uses the UTF-8 codec, one that supports all of Unicode.
请注意我的语言环境设置中的.UTF-8;它告诉终端中运行的程序我的终端使用UTF-8编解码器,支持所有Unicode。
You can set all of your locale in one step with the LANG
environment variable:
您可以使用LANG环境变量一步设置所有语言环境:
export LANG="en_US.UTF-8"
for a US locale (how dates and numbers are printed) with the UTF-8 codec. To be precise, the LC_CTYPE
setting is used for the output codec, which in turn defaults to the LANG
value.
使用UTF-8编解码器进行美国语言环境(如何打印日期和数字)。确切地说,LC_CTYPE设置用于输出编解码器,而输出编解码器默认为LANG值。
Also see the very comprehensive UTF-8 and Unicode FAQ for Unix/Linux.
另请参阅针对Unix / Linux的非常全面的UTF-8和Unicode FAQ。