reading special characters from web in python

I am scraping an xml webpage for names of people via RE searching, however if the names contain special characters python is not reading them correctly. For Example:

我正在通过RE搜索抓取xml网页上的人名,但如果名称包含特殊字符,则python无法正确读取它们。例如:

Güngüneş A

comes out as:

出来是:

G\xc3\xbcng\xc3\xbcne\xc5\x9f A

How can I make this format correctly in my output?

如何在输出中正确制作此格式?

2 个解决方案

#1

use decode():

>>> b'G\xc3\xbcng\xc3\xbcne\xc5\x9f A'.decode()
'Güngüne\u015f A'

(My machine has problems with 'ş')

(我的机器有'ş'的问题)

#2

How are you reading these in? What OS are you using? Python 2 or 3? When I run,

你是怎么读这些的?你用的是什么操作系统? Python 2还是3?我跑的时候

myStr = 'G\xc3\xbcng\xc3\xbcne\xc5\x9f A'
print myStr

I get, 'Güngüneş A'.

我明白了,'GüngüneşA'。

Further, when I make a test file with the contents, 'Güngüneş A' and run,

此外,当我制作带有内容的测试文件时,'GüngüneşA'并运行,

mystr = open('test', 'r').read()
print mystr

I get 'Güngüneş A'.

我得到'GüngüneşA'。

I'm using ubuntu 10.04/python 2.6 and can't reproduce the problem with the information you've provided, if you post the actual code you're using it might help. That said, you could try specifying the type of string:

我正在使用ubuntu 10.04 / python 2.6并且无法使用您提供的信息重现该问题,如果您发布您正在使用它的实际代码可能有帮助。也就是说,您可以尝试指定字符串的类型:

myStr = 'String'
myStr = u'Unicode string'
myStr = r'String literal: escape characters ignored'

Or, if you want to include unicode characters in your code, you can add this line to the beginning of your file as stated in this answer:

或者,如果要在代码中包含unicode字符,可以将此行添加到文件的开头,如此答案中所述:

# -*- coding: utf-8 -*-

#1