用utf-8字符串将unicode转换为str内容

时间:2021-06-30 20:13:46

I'm using pyquery to parse a page:

我使用pyquery来解析一个页面:

dom = PyQuery('http://zh.wikipedia.org/w/index.php', {'title': 'CSS', 'printable': 'yes', 'variant': 'zh-cn'})
content = dom('#mw-content-text > p').eq(0).text()

but what I get in content is a unicode string with utf-8 encoded content:

但是我在内容中得到的是一个带有utf-8编码内容的unicode字符串:

u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8...'

how could I convert it to str without lost the content?

如何将它转换为str而不丢失内容?

to make it clear:

弄清楚:

I want conent == '\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

= '\xe5\xb1\x82\xe5\x8f\xa0\ xa6 \xa0\xb7\xe5\ xb5 \x8f\ x8f\xe8\xa1\ xa1\xa8'

not conent == u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'

不是浓度= = u ' \ xe5 \ xb1 \ x82 \ xe5 \ x8f \ xa0 \ xe6 \ xa0 \ xb7 \ xe5 \ xbc \ x8f \ xe8 \ xa1 \ xa8 '

1 个解决方案

#1


21  

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

如果你有一个UTF-8字节的unicode值,编码为Latin-1以保存“bytes”:

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

因为码码码码统一码点U+0000到U+00FF都与latin-1编码一对一映射;该编码将数据解释为文字字节。

For your example this gives me:

对于你的例子,这给了我:

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表

#1


21  

If you have a unicode value with UTF-8 bytes, encode to Latin-1 to preserve the 'bytes':

如果你有一个UTF-8字节的unicode值,编码为Latin-1以保存“bytes”:

content = content.encode('latin1')

because the Unicode codepoints U+0000 to U+00FF all map one-on-one with the latin-1 encoding; this encoding thus interprets your data as literal bytes.

因为码码码码统一码点U+0000到U+00FF都与latin-1编码一对一映射;该编码将数据解释为文字字节。

For your example this gives me:

对于你的例子,这给了我:

>>> content = u'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1')
'\xe5\xb1\x82\xe5\x8f\xa0\xe6\xa0\xb7\xe5\xbc\x8f\xe8\xa1\xa8'
>>> content.encode('latin1').decode('utf8')
u'\u5c42\u53e0\u6837\u5f0f\u8868'
>>> print content.encode('latin1').decode('utf8')
层叠样式表