Python如何修复破坏的utf-8编码?

时间:2023-01-06 11:07:08

My string is Niệm Bồ Tát (Thiá»n sÆ° Nhất Hạnh) and I wanna decode it to Niệm Bồ Tát (Thiền sư Nhất Hạnh) I see in that site can do that http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

我的字符串是NiệmBồTát(Thiá»nsÆ°NhấtHạnh),我想把它解码为NiệmBồTát(ThiềnsưNhấtHạnh)我看到那个网站可以做那个http ://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

and I start to try by Python

我开始尝试使用Python

mystr = '09. Bát Nhã Tâm Kinh'
mystr.decode('utf-8')

but actually it is not correct because original string is utf-8 but the string show is not my expecting result.

但实际上它不正确,因为原始字符串是utf-8但字符串显示不是我期望的结果。

Note: it is Vietnamese character.

注意:它是越南字符。

How to resolve that case? Is that Windows Unicode or something? How to detect the encoding here. Thanks in advance

如何解决这个案子?是Windows Unicode还是什么?如何在这里检测编码。提前致谢

2 个解决方案

#1


8  

I'm not sure what you can do with these kind of data, but for your example in your original post, this works:

我不确定你能用这些数据做些什么,但对于你在原帖中的例子,这有效:

>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh

#2


8  

The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy

唯一帮助我破解西里尔字符串的东西 - https://github.com/LuminosoInsight/python-ftfy

This module fixes pretty much everything and works much better than online decoders.

该模块几乎可以修复所有内容,并且比在线解码器更好地工作。

>>> from ftfy import fix_encoding
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'

It can be easily installed using pip install ftfy

它可以使用pip install ftfy轻松安装

#1


8  

I'm not sure what you can do with these kind of data, but for your example in your original post, this works:

我不确定你能用这些数据做些什么,但对于你在原帖中的例子,这有效:

>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> s = mystr.decode('utf8').encode('latin1').decode('utf8')
>>> s
u'09. B\xe1t Nh\xe3 T\xe2m Kinh'
>>> print(s)
09. Bát Nhã Tâm Kinh

#2


8  

The only thing that helped me with broken cyrillic string - https://github.com/LuminosoInsight/python-ftfy

唯一帮助我破解西里尔字符串的东西 - https://github.com/LuminosoInsight/python-ftfy

This module fixes pretty much everything and works much better than online decoders.

该模块几乎可以修复所有内容,并且比在线解码器更好地工作。

>>> from ftfy import fix_encoding
>>> mystr = '09. Bát Nhã Tâm Kinh'
>>> fix_encoding(mystr)
'09. Bát Nhã Tâm Kinh'

It can be easily installed using pip install ftfy

它可以使用pip install ftfy轻松安装