在unicode字符串中转换字节字符串

时间:2021-11-28 20:13:50

I have a code such that:

我有一个代码,这样:

a = "\u0432"
b = u"\u0432"
c = b"\u0432"
d = c.decode('utf8')

print(type(a), a)
print(type(b), b)
print(type(c), c)
print(type(d), d)

And output:

<class 'str'> в
<class 'str'> в
<class 'bytes'> b'\\u0432'
<class 'str'> \u0432

Why in the latter case I see a character code, instead of the character? How I can transform Byte string to Unicode string that in case of an output I saw the character, instead of its code?

为什么在后一种情况下我看到的是字符代码而不是字符?我如何将Byte字符串转换为Unicode字符串,在输出的情况下,我看到了字符而不是代码?

2 个解决方案

#1


35  

In strings (or Unicode objects in Python 2), \u has a special meaning, namely saying, "here comes a Unicode character specified by it's Unicode ID". Hence u"\u0432" will result in the character в.

在字符串(或Python 2中的Unicode对象)中,\ u有一个特殊含义,即说“这里有一个由Unicode ID指定的Unicode字符”。因此u“\ u0432”将导致角色в。

The b'' prefix tells you this is a sequence of 8-bit bytes, and bytes object has no Unicode characters, so the \u code has no special meaning. Hence, b"\u0432" is just the sequence of the bytes \,u,0,4,3 and 2.

b''前缀告诉你这是一个8位字节的序列,而bytes对象没有Unicode字符,所以\ u代码没有特殊含义。因此,b“\ u0432”只是字节\,u,0,4,3和2的序列。

Essentially you have an 8-bit string containing not a Unicode character, but the specification of a Unicode character.

基本上,您有一个8位字符串,不包含Unicode字符,但包含Unicode字符的规范。

You can convert this specification using the unicode escape encoder.

您可以使用unicode转义编码器转换此规范。

>>> c.decode('unicode_escape')
'в'

#2


1  

Loved Lennart's answer. It put me on the right track for solving the particular problem I had faced. What I added was the ability to produce html-compatible code for \u???? specifications in strings. Basically, only one line was needed:

喜欢Lennart的回答。它让我走上正确的轨道,解决我遇到的特殊问题。我添加的是为\ u ????生成html兼容代码的能力字符串中的规范。基本上,只需要一行:

results = results.replace('\\u','&#x')

This all came about from a need to convert JSON results to something that displays well in a browser. Here is some test code that is integrated with a cloud application:

这一切都是因为需要将JSON结果转换为在浏览器中显示良好的结果。以下是与云应用程序集成的一些测试代码:

# References:
# http://*.com/questions/9746303/how-do-i-send-a-post-request-as-a-json
# https://docs.python.org/3/library/http.client.html
# http://docs.python-requests.org/en/v0.10.7/user/quickstart/#custom-headers
# http://*.com/questions/606191/convert-bytes-to-a-python-string
# http://www.w3schools.com/charsets/ref_utf_punctuation.asp
# http://*.com/questions/13837848/converting-byte-string-in-unicode-string

import urllib.request
import json

body = [ { "query": "co-development and language.name:English", "page": 1, "pageSize": 100 } ]
myurl = "https://core.ac.uk:443/api-v2/articles/search?metadata=true&fulltext=false&citations=false&similar=false&duplicate=false&urls=true&extractedUrls=false&faithfulMetadata=false&apiKey=SZYoqzk0Vx5QiEATgBPw1b842uypeXUv"
req = urllib.request.Request(myurl)
req.add_header('Content-Type', 'application/json; charset=utf-8')
jsondata = json.dumps(body)
jsondatabytes = jsondata.encode('utf-8') # needs to be bytes
req.add_header('Content-Length', len(jsondatabytes))
print ('\n', jsondatabytes, '\n')
response = urllib.request.urlopen(req, jsondatabytes)
results = response.read()
results = results.decode('utf-8')
results = results.replace('\\u','&#x') # produces html hex version of \u???? unicode characters
print(results)

#1


35  

In strings (or Unicode objects in Python 2), \u has a special meaning, namely saying, "here comes a Unicode character specified by it's Unicode ID". Hence u"\u0432" will result in the character в.

在字符串(或Python 2中的Unicode对象)中,\ u有一个特殊含义,即说“这里有一个由Unicode ID指定的Unicode字符”。因此u“\ u0432”将导致角色в。

The b'' prefix tells you this is a sequence of 8-bit bytes, and bytes object has no Unicode characters, so the \u code has no special meaning. Hence, b"\u0432" is just the sequence of the bytes \,u,0,4,3 and 2.

b''前缀告诉你这是一个8位字节的序列,而bytes对象没有Unicode字符,所以\ u代码没有特殊含义。因此,b“\ u0432”只是字节\,u,0,4,3和2的序列。

Essentially you have an 8-bit string containing not a Unicode character, but the specification of a Unicode character.

基本上,您有一个8位字符串,不包含Unicode字符,但包含Unicode字符的规范。

You can convert this specification using the unicode escape encoder.

您可以使用unicode转义编码器转换此规范。

>>> c.decode('unicode_escape')
'в'

#2


1  

Loved Lennart's answer. It put me on the right track for solving the particular problem I had faced. What I added was the ability to produce html-compatible code for \u???? specifications in strings. Basically, only one line was needed:

喜欢Lennart的回答。它让我走上正确的轨道,解决我遇到的特殊问题。我添加的是为\ u ????生成html兼容代码的能力字符串中的规范。基本上,只需要一行:

results = results.replace('\\u','&#x')

This all came about from a need to convert JSON results to something that displays well in a browser. Here is some test code that is integrated with a cloud application:

这一切都是因为需要将JSON结果转换为在浏览器中显示良好的结果。以下是与云应用程序集成的一些测试代码:

# References:
# http://*.com/questions/9746303/how-do-i-send-a-post-request-as-a-json
# https://docs.python.org/3/library/http.client.html
# http://docs.python-requests.org/en/v0.10.7/user/quickstart/#custom-headers
# http://*.com/questions/606191/convert-bytes-to-a-python-string
# http://www.w3schools.com/charsets/ref_utf_punctuation.asp
# http://*.com/questions/13837848/converting-byte-string-in-unicode-string

import urllib.request
import json

body = [ { "query": "co-development and language.name:English", "page": 1, "pageSize": 100 } ]
myurl = "https://core.ac.uk:443/api-v2/articles/search?metadata=true&fulltext=false&citations=false&similar=false&duplicate=false&urls=true&extractedUrls=false&faithfulMetadata=false&apiKey=SZYoqzk0Vx5QiEATgBPw1b842uypeXUv"
req = urllib.request.Request(myurl)
req.add_header('Content-Type', 'application/json; charset=utf-8')
jsondata = json.dumps(body)
jsondatabytes = jsondata.encode('utf-8') # needs to be bytes
req.add_header('Content-Length', len(jsondatabytes))
print ('\n', jsondatabytes, '\n')
response = urllib.request.urlopen(req, jsondatabytes)
results = response.read()
results = results.decode('utf-8')
results = results.replace('\\u','&#x') # produces html hex version of \u???? unicode characters
print(results)