This is a Python 101 type question, but it had me baffled for a while when I tried to use a package that seemed to convert my string input into bytes.
这是一个Python 101类型的问题,但是当我试图使用一个似乎将我的字符串输入转换成字节的包时,它让我感到困惑。
As you will see below I found the answer for myself, but I felt it was worth recording here because of the time it took me to unearth what was going on. It seems to be generic to Python 3, so I have not referred to the original package I was playing with; it does not seem to be an error (just that the particular package had a .tostring()
method that was clearly not producing what I understood as a string...)
正如你在下面看到的,我找到了我自己的答案,但我觉得在这里录制是值得的,因为它带我去发掘正在发生的事情。它似乎对Python 3是通用的,所以我没有提到我正在使用的原始包;它看起来并不是一个错误(只是特定的包有一个.tostring()方法,它显然没有产生我所理解的字符串…)
My test program goes like this:
我的测试程序是这样的:
import mangler # spoof package
stringThing = """
<Doc>
<Greeting>Hello World</Greeting>
<Greeting>你好</Greeting>
</Doc>
"""
# print out the input
print('This is the string input:')
print(stringThing)
# now make the string into bytes
bytesThing = mangler.tostring(stringThing) # pseudo-code again
# now print it out
print('\nThis is the bytes output:')
print(bytesThing)
The output from this code gives this:
该代码的输出结果如下:
This is the string input:
<Doc>
<Greeting>Hello World</Greeting>
<Greeting>你好</Greeting>
</Doc>
This is the bytes output:
b'\n<Doc>\n <Greeting>Hello World</Greeting>\n <Greeting>\xe4\xbd\xa0\xe5\xa5\xbd</Greeting>\n</Doc>\n'
So, there is a need to be able to convert between bytes and strings, to avoid ending up with non-ascii characters being turned into gobbledegook.
因此,需要能够在字节和字符串之间进行转换,以避免最终将非ascii字符转换为gobbledegook。
4 个解决方案
#1
94
The 'mangler' in the above code sample was doing the equivalent of this:
上面代码示例中的“mangler”就是这样做的:
bytesThing = stringThing.encode(encoding='UTF-8')
There are other ways to write this (notably using bytes(stringThing, encoding='UTF-8')
, but the above syntax makes it obvious what is going on, and also what to do to recover the string:
还有其他的方法来写这个(特别是使用字节(stringThing,编码='UTF-8'),但是上面的语法清楚地说明了正在进行的操作,以及如何恢复字符串:
newStringThing = bytesThing.decode(encoding='UTF-8')
When we do this, the original string is recovered.
当我们这样做时,原始的字符串会被恢复。
Note, using str(bytesThing)
just transcribes all the gobbledegook without converting it back into Unicode, unless you specifically request UTF-8, viz., str(bytesThing, encoding='UTF-8')
. No error is reported if the encoding is not specified.
注意,使用str(bytesThing)将所有的gobbledegook转换为Unicode,除非您特别请求UTF-8、viz、str(bytesThing,编码='UTF-8')。如果没有指定编码,则不会报告错误。
#2
12
In python3, there is a bytes()
method that is in the same format as encode()
.
在python3中,有一个字节()方法与编码()的格式相同。
str1 = b'hello world'
str2 = bytes("hello world", encoding="UTF-8")
print(str1 == str2) # Returns True
I didn't read anything about this in the docs, but perhaps I wasn't looking in the right place. This way you can explicitly turn strings into byte streams and have it more readable than using encode
and decode
, and without having to prefex b
in front of quotes.
我在文档中没有读到任何关于这个的东西,但也许我没有在正确的地方查找。通过这种方式,您可以显式地将字符串转换成字节流,并使其比使用编码和解码更具可读性,而且无需在引号前面进行预编译。
#3
0
TRY THIS:
试试这个:
StringVariable=ByteVariable.decode('UTF-8','ignore')
TO TEST TYPE:
测试类型:
print(type(StringVariable))
Here 'StringVariable' represented as a string. 'ByteVariable' represent as Byte. Its not relevent to question Variables..
这里“StringVariable”表示为字符串。“ByteVariable”代表字节。它与问题变量无关。
#4
0
This is a Python 101 type question,
这是一个Python 101类型的问题,
It's a simple question but one where the answer is not so simple.
这是一个简单的问题,但答案却不是那么简单。
In python3, a "bytes" object represents a sequence of bytes, a "string" object represents a sequence of unicode code points.
在python3中,一个“字节”对象表示一个字节序列,一个“string”对象表示一个unicode代码点序列。
To convert between from "bytes" to "string" and from "string" back to "bytes" you use the bytes.encode and string.decode functions. These functions take two parameters, an encoding and an error handling policy.
从“字节”到“字符串”,从“字符串”返回到“字节”,您可以使用字节。编码和string.decode功能。这些函数接受两个参数,一个编码和一个错误处理策略。
Sadly there are an awful lot of cases where sequences of bytes are used to represent text, but it is not necessarily well-defined what encoding is being used.
遗憾的是,有很多情况下,使用字节序列表示文本,但并不一定定义使用什么编码。
If you want to write robust software then you need to think carefully about those parameters. You need to think carefully about what encoding the bytes are supposed to be in and how you will handle the case where they turn out not to be a valid sequence of bytes for the encoding you thought they should be in. Python defaults to UTF-8 and erroring out on any byte sequence that is not valid UTF-8.
如果您想编写健壮的软件,那么您需要仔细考虑这些参数。您需要仔细考虑哪些编码字节应该在哪里,以及您将如何处理这样的情况:它们不是一个有效的字节序列,而是您认为它们应该在的编码。Python默认为UTF-8,并在任何不有效的UTF-8字节序列上出错。
print(bytesThing)
打印(bytesThing)
Python uses "repr" as a fallback conversion to string. repr attempts to produce python code that will recreate the object. In the case of a bytes object this means among other things escaping bytes outside the printable ascii range.
Python使用“repr”作为回退转换到字符串。repr尝试生成将重新创建对象的python代码。在字节对象的情况下,这意味着在可打印的ascii范围之外的其他东西中,可以忽略字节。
#1
94
The 'mangler' in the above code sample was doing the equivalent of this:
上面代码示例中的“mangler”就是这样做的:
bytesThing = stringThing.encode(encoding='UTF-8')
There are other ways to write this (notably using bytes(stringThing, encoding='UTF-8')
, but the above syntax makes it obvious what is going on, and also what to do to recover the string:
还有其他的方法来写这个(特别是使用字节(stringThing,编码='UTF-8'),但是上面的语法清楚地说明了正在进行的操作,以及如何恢复字符串:
newStringThing = bytesThing.decode(encoding='UTF-8')
When we do this, the original string is recovered.
当我们这样做时,原始的字符串会被恢复。
Note, using str(bytesThing)
just transcribes all the gobbledegook without converting it back into Unicode, unless you specifically request UTF-8, viz., str(bytesThing, encoding='UTF-8')
. No error is reported if the encoding is not specified.
注意,使用str(bytesThing)将所有的gobbledegook转换为Unicode,除非您特别请求UTF-8、viz、str(bytesThing,编码='UTF-8')。如果没有指定编码,则不会报告错误。
#2
12
In python3, there is a bytes()
method that is in the same format as encode()
.
在python3中,有一个字节()方法与编码()的格式相同。
str1 = b'hello world'
str2 = bytes("hello world", encoding="UTF-8")
print(str1 == str2) # Returns True
I didn't read anything about this in the docs, but perhaps I wasn't looking in the right place. This way you can explicitly turn strings into byte streams and have it more readable than using encode
and decode
, and without having to prefex b
in front of quotes.
我在文档中没有读到任何关于这个的东西,但也许我没有在正确的地方查找。通过这种方式,您可以显式地将字符串转换成字节流,并使其比使用编码和解码更具可读性,而且无需在引号前面进行预编译。
#3
0
TRY THIS:
试试这个:
StringVariable=ByteVariable.decode('UTF-8','ignore')
TO TEST TYPE:
测试类型:
print(type(StringVariable))
Here 'StringVariable' represented as a string. 'ByteVariable' represent as Byte. Its not relevent to question Variables..
这里“StringVariable”表示为字符串。“ByteVariable”代表字节。它与问题变量无关。
#4
0
This is a Python 101 type question,
这是一个Python 101类型的问题,
It's a simple question but one where the answer is not so simple.
这是一个简单的问题,但答案却不是那么简单。
In python3, a "bytes" object represents a sequence of bytes, a "string" object represents a sequence of unicode code points.
在python3中,一个“字节”对象表示一个字节序列,一个“string”对象表示一个unicode代码点序列。
To convert between from "bytes" to "string" and from "string" back to "bytes" you use the bytes.encode and string.decode functions. These functions take two parameters, an encoding and an error handling policy.
从“字节”到“字符串”,从“字符串”返回到“字节”,您可以使用字节。编码和string.decode功能。这些函数接受两个参数,一个编码和一个错误处理策略。
Sadly there are an awful lot of cases where sequences of bytes are used to represent text, but it is not necessarily well-defined what encoding is being used.
遗憾的是,有很多情况下,使用字节序列表示文本,但并不一定定义使用什么编码。
If you want to write robust software then you need to think carefully about those parameters. You need to think carefully about what encoding the bytes are supposed to be in and how you will handle the case where they turn out not to be a valid sequence of bytes for the encoding you thought they should be in. Python defaults to UTF-8 and erroring out on any byte sequence that is not valid UTF-8.
如果您想编写健壮的软件,那么您需要仔细考虑这些参数。您需要仔细考虑哪些编码字节应该在哪里,以及您将如何处理这样的情况:它们不是一个有效的字节序列,而是您认为它们应该在的编码。Python默认为UTF-8,并在任何不有效的UTF-8字节序列上出错。
print(bytesThing)
打印(bytesThing)
Python uses "repr" as a fallback conversion to string. repr attempts to produce python code that will recreate the object. In the case of a bytes object this means among other things escaping bytes outside the printable ascii range.
Python使用“repr”作为回退转换到字符串。repr尝试生成将重新创建对象的python代码。在字节对象的情况下,这意味着在可打印的ascii范围之外的其他东西中,可以忽略字节。