我应该如何将包含unicode字符的字符串转换为unicode?

时间:2021-03-23 20:13:15

I thought that I dominated all the Unicode stuff in Python 2, but it seems that there's something I don't understand. I have this user input from HTML that goes to my python script:

我认为我主宰了Python 2中的所有Unicode内容,但似乎有些东西我不明白。我有来自HTML的用户输入到我的python脚本:

a = "m\xe9dico"

I want this to be médico (that means doctor). So to convert that to unicode I'm doing:

我希望这是médico(这意味着医生)。所以要将其转换为unicode我正在做:

a.decode("utf-8")

Or:

要么:

unicode(a, "utf-8")

But this is throwing:

但这是投掷:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 1: ordinal not in range(128) 

How can achieve this?

怎么能实现这个?

2 个解决方案

#1


5  

This is not utf-8:

这不是utf-8:

print txt.decode('iso8859-1')
Out[14]: médico

If you want utf-8 string, use:

如果你想要utf-8字符串,请使用:

txt.decode('iso8859-1').encode('utf-8')
Out[15]: 'm\xc3\xa9dico'

#2


2  

You can prefix your string with a u to mark it as a unicode literal:

您可以在字符串前加上u来将其标记为unicode文字:

>>> a = u'm\xe9dico'
>>> print a
médico
>>> type(a)
<type 'unicode'>

or, to convert an existing string:

或者,转换现有字符串:

>>> a = 'm\xe9dico'
>>> type(a)
<type 'str'>
>>> new_a = unicode(a,'iso-8859-1')
>>> print new_a
médico
>>> type(new_a)
<type 'unicode'>
>>> new_a == u'm\xe9dico'
True

Further reading: Python docs - Unicode HOWTO.

进一步阅读:Python文档 - Unicode HOWTO。

#1


5  

This is not utf-8:

这不是utf-8:

print txt.decode('iso8859-1')
Out[14]: médico

If you want utf-8 string, use:

如果你想要utf-8字符串,请使用:

txt.decode('iso8859-1').encode('utf-8')
Out[15]: 'm\xc3\xa9dico'

#2


2  

You can prefix your string with a u to mark it as a unicode literal:

您可以在字符串前加上u来将其标记为unicode文字:

>>> a = u'm\xe9dico'
>>> print a
médico
>>> type(a)
<type 'unicode'>

or, to convert an existing string:

或者,转换现有字符串:

>>> a = 'm\xe9dico'
>>> type(a)
<type 'str'>
>>> new_a = unicode(a,'iso-8859-1')
>>> print new_a
médico
>>> type(new_a)
<type 'unicode'>
>>> new_a == u'm\xe9dico'
True

Further reading: Python docs - Unicode HOWTO.

进一步阅读:Python文档 - Unicode HOWTO。