I know this looks embarrassingly easy, and I guess the problem is that I just don't have a clear understanding of all this bytes-str-unicode (and encoding-decoding, speaking frankly) stuff yet.
我知道这看起来很容易,而且我猜问题是我对所有这些字节-字符串-unicode(坦白地说,编码-解码)的东西还没有一个清晰的理解。
I've been trying to get my working code to run on Python 3. The part I'm stuck with is when I parse an XML with lxml
and decode a base64 string that is in that XML.
我一直试图让我的工作代码在Python 3上运行。我遇到的问题是,当我用lxml解析XML并解码XML中的base64字符串时。
The code now works in the following manner:
该守则的运作方式如下:
I retrieve the binary data with an XPath query '.../binary/text()'
. This produces a one-element list containing a lxml.etree._ElementUnicodeResult
object. Then, with python 2, I was able to do:
我使用XPath查询“……/二进制/文本()”检索二进制数据。这将生成一个包含lxml.etree的单元素列表。_ElementUnicodeResult对象。然后,通过python 2,我可以做到:
decoded = source.decode('base64')
and finally
最后
output = numpy.frombuffer(decoded)
However, on python 3 I get an error message saying
但是,在python 3上,我得到一个错误消息说
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'decode'
This is not so surprising, because lxml.etree._ElementUnicodeResult
is a subclass of str
.
这并不奇怪,因为lxml.etree。_ElementUnicodeResult是str的一个子类。
Another way would be to get a real str
with the same data in it with
另一种方法是得到一个具有相同数据的真正的str。
binary = tree.xpath('//binary')[0]
binary_string = binary.text
That would be essentially the same. So what do I do to decode it from base64? I've looked at the base64
module, but it takes a bytes
object as an argument, and I can't think of the way to present str
as bytes
, because if I try to construct a bytes
object, Python will try to encode the string, which I don't need.
本质上是一样的。那么我该怎么从base64解码呢?我已经看过了base64模块,但它以一个bytes对象作为参数,我想不出将str表示为bytes的方法,因为如果我尝试构造一个bytes对象,Python将尝试对字符串进行编码,这是我不需要的。
Googling further, I came across the binascii
module (which is invoked indirectly from base64
anyway, if I'm not mistaken), but calling binascii.b2a_base64()
on my string produces
进一步搜索之后,我遇到了binascii模块(如果我没弄错的话,它是间接从base64调用的),但是在我的字符串中调用binascii.b2a_base64()
TypeError: 'str' does not support the buffer interface
P.S. I've even found an answered question on how to decode a hex string in Python 3, but this is done with a dedicated method bytes.fromhex()
so I don't see how it would be helpful.
另外,我甚至找到了一个关于如何在Python 3中解码十六进制字符串的问题的答案,但是这是通过一个专用的方法bytes.fromhex()完成的,所以我不认为这有什么用。
Could someone please tell me what I'm missing? I'm afraid most of the post is irrelevant and only aggravates my shame, but at least you guys know what I tried.
谁能告诉我我遗漏了什么吗?恐怕大部分的帖子都是无关紧要的,只会让我更难堪,但至少你们知道我做了什么。
2 个解决方案
#1
2
I don't have Python 3 installed, but it sounds like you need to convert the Unicode returned from lxml to bytes, perhaps by calling .encode('ascii') ?
我没有安装Python 3,但是听起来好像需要将从lxml返回的Unicode转换成字节,也许可以通过调用.encode(“ascii”)来实现?
#2
6
OK, I think I'm going to summarize my current understanding of things (feel free to correct me). Hopefully it will help someone else out there as confused as I've been.
好的,我想我要总结一下我目前对事物的理解(请随意纠正我)。希望它能帮助其他人像我一样困惑。
The credit totally goes to thebjorn and delnan, of course.
当然,这要归功于比约恩和德尔南。
So, starting with the most common things: there's Unicode, and it's a global standard that assigns codes (or code points) to all the exotic characters you can imagine. Those codes are just integer numbers. As of Unicode 6.1 there are 109,975 graphic characters, says Wikipedia.
所以,从最常见的东西开始:有Unicode,它是一个全球标准,为所有你能想象到的外来字符分配代码(或代码点)。这些代码只是整数。*说,到Unicode 6.1版本时,有109,975个图形字符。
Then there are encodings that define how to designate Unicode characters with byte codes. One byte isn't enough to designate an arbitrary Unicode char. Although, if you only take a small subset of them (English alphabet, digits, punctuation, some control characters), you can do with one byte per character (or even 7 bits; see ASCII).
然后还有一些编码,定义如何使用字节码来指定Unicode字符。一个字节不足以指定一个任意的Unicode字符。虽然,如果您只取其中的一小部分(英语字母表、数字、标点符号、一些控制字符),您可以对每个字符(甚至7位)使用一个字节;见ASCII)。
To pass a Unicode string anywhere, one needs to encode it in bytes, then it can be decoded on the other end.
要在任何地方传递Unicode字符串,需要用字节对其进行编码,然后可以在另一端对其进行解码。
In Python 2, str
is actually bytes, and unicode
is Unicode, but Python 2 will do implicit encoding/decoding for you when needed. It will try to use ASCII encoding.
在Python 2中,str实际上是字节,unicode是unicode,但是Python 2将在需要时为您进行隐式编码/解码。它将尝试使用ASCII编码。
In Python 3, str
is always a Unicode string, and bytes
is a new data type for actual bytes. No implicit conversion is ever done by Python 3, you always need to do it yourself and specify the encoding. That means that your program won't work until you understand what's going on, which totally happened to me.
在Python 3中,str始终是Unicode字符串,字节是实际字节的新数据类型。Python 3从来没有做过隐式转换,您总是需要自己做它并指定编码。这意味着你的程序在你明白发生了什么之前不会运行,这完全发生在我身上。
Now, that being more or less clear, let's move on to base64 encoding, which is also an encoding of sorts, but has a slightly different meaning. Suppose you have some binary data (i.e. bytes) that may mean anything (in my case it's a bunch of float
s). Now you want to represent this binary array with a string. That's what base64 encoding means: you have your bytes represented as an ASCII string.
现在,我们来看看base64编码,它也是一种排序编码,但含义略有不同。假设您有一些二进制数据(即字节),可能意味着任何东西(在我的例子中,它是一堆浮点数)。现在要用字符串表示这个二进制数组。这就是base64编码的含义:将字节表示为ASCII字符串。
Base64 means 6 bit, so in a base64-encoded string a single character stands for 6 bits of your data. That is why base64-encoded strings need to have the length that is a multiple of 4: otherwise the number of bytes encoded will be not integer.
Base64意味着6位,所以在Base64编码的字符串中,一个字符代表6位数据。这就是为什么base64编码的字符串需要长度为4的倍数:否则编码的字节数将不是整数。
Finally, to decode from base64 you need an ASCII string. A Unicode string won't do, there can only be characters from the base64 alphabet. Base64 module does the job in Python. The base64.b64decode()
function takes a byte string as the argument. In Python 2 it means: str
. In Python 3 it means: bytes
. So if you have a str
, such as
最后,要从base64解码,需要一个ASCII字符串。Unicode字符串不行,只能有来自base64字母表的字符。Base64模块在Python中执行此任务。函数的作用是:将一个字节字符串作为参数。在Python 2中它的意思是:str.在Python 3中它的意思是:bytes。如果你有一个str,比如
>>> s = 'U3RhY2sgT3ZlcmZsb3c='
In Python 2 you could just do
在Python 2中,你可以这么做
>>> s.decode('base64')
because s
is already in ASCII. In Python 3, you need to encode it in ASCII first, so you'll have to do:
因为s已经是ASCII码了。在python3里,你需要先用ASCII码编码,所以你必须这样做:
>>> base64.b64decode(s.encode('ascii'))
And by the way, this will return a bytes
object, so it's really up to you how to treat those bytes then. Maybe it's my floats, but maybe you should try to decode it as ASCII :) In Python 2 however it will be just a str
. Anyway, have a look at struct
for the tools to unpack your data from those bytes.
顺便说一下,这将返回一个字节对象,因此如何处理这些字节完全取决于你。也许是我的浮点数,但也许你应该尝试把它解码成ASCII:)在Python 2中,它只是一个str。
So if you need the code to work on both Python 2 and 3, go with the last one. To make sure you have Unicode in the end (if you are decoding text from base64), you'll have to decode it:
因此,如果您需要代码同时处理Python 2和3,请使用最后一个。为了确保最后有Unicode(如果您正在解码base64的文本),您必须解码它:
>>> base64.b64decode(s.encode('ascii')).decode('ascii')
On Python 2, encode('ascii')
won't effectively do anything because it's applied to str
. So it will do an implicit conversion to Unicode first, and then do what you want (convert it back to ASCII). decode('ascii')
will return a unicode
object on Python 2.
在Python 2中,encode(“ascii”)不会有效地执行任何操作,因为它应用于str。因此,它将首先隐式地转换为Unicode,然后执行您想要的操作(将其转换回ascii)。decode(“ascii”)将返回Python 2上的unicode对象。
#1
2
I don't have Python 3 installed, but it sounds like you need to convert the Unicode returned from lxml to bytes, perhaps by calling .encode('ascii') ?
我没有安装Python 3,但是听起来好像需要将从lxml返回的Unicode转换成字节,也许可以通过调用.encode(“ascii”)来实现?
#2
6
OK, I think I'm going to summarize my current understanding of things (feel free to correct me). Hopefully it will help someone else out there as confused as I've been.
好的,我想我要总结一下我目前对事物的理解(请随意纠正我)。希望它能帮助其他人像我一样困惑。
The credit totally goes to thebjorn and delnan, of course.
当然,这要归功于比约恩和德尔南。
So, starting with the most common things: there's Unicode, and it's a global standard that assigns codes (or code points) to all the exotic characters you can imagine. Those codes are just integer numbers. As of Unicode 6.1 there are 109,975 graphic characters, says Wikipedia.
所以,从最常见的东西开始:有Unicode,它是一个全球标准,为所有你能想象到的外来字符分配代码(或代码点)。这些代码只是整数。*说,到Unicode 6.1版本时,有109,975个图形字符。
Then there are encodings that define how to designate Unicode characters with byte codes. One byte isn't enough to designate an arbitrary Unicode char. Although, if you only take a small subset of them (English alphabet, digits, punctuation, some control characters), you can do with one byte per character (or even 7 bits; see ASCII).
然后还有一些编码,定义如何使用字节码来指定Unicode字符。一个字节不足以指定一个任意的Unicode字符。虽然,如果您只取其中的一小部分(英语字母表、数字、标点符号、一些控制字符),您可以对每个字符(甚至7位)使用一个字节;见ASCII)。
To pass a Unicode string anywhere, one needs to encode it in bytes, then it can be decoded on the other end.
要在任何地方传递Unicode字符串,需要用字节对其进行编码,然后可以在另一端对其进行解码。
In Python 2, str
is actually bytes, and unicode
is Unicode, but Python 2 will do implicit encoding/decoding for you when needed. It will try to use ASCII encoding.
在Python 2中,str实际上是字节,unicode是unicode,但是Python 2将在需要时为您进行隐式编码/解码。它将尝试使用ASCII编码。
In Python 3, str
is always a Unicode string, and bytes
is a new data type for actual bytes. No implicit conversion is ever done by Python 3, you always need to do it yourself and specify the encoding. That means that your program won't work until you understand what's going on, which totally happened to me.
在Python 3中,str始终是Unicode字符串,字节是实际字节的新数据类型。Python 3从来没有做过隐式转换,您总是需要自己做它并指定编码。这意味着你的程序在你明白发生了什么之前不会运行,这完全发生在我身上。
Now, that being more or less clear, let's move on to base64 encoding, which is also an encoding of sorts, but has a slightly different meaning. Suppose you have some binary data (i.e. bytes) that may mean anything (in my case it's a bunch of float
s). Now you want to represent this binary array with a string. That's what base64 encoding means: you have your bytes represented as an ASCII string.
现在,我们来看看base64编码,它也是一种排序编码,但含义略有不同。假设您有一些二进制数据(即字节),可能意味着任何东西(在我的例子中,它是一堆浮点数)。现在要用字符串表示这个二进制数组。这就是base64编码的含义:将字节表示为ASCII字符串。
Base64 means 6 bit, so in a base64-encoded string a single character stands for 6 bits of your data. That is why base64-encoded strings need to have the length that is a multiple of 4: otherwise the number of bytes encoded will be not integer.
Base64意味着6位,所以在Base64编码的字符串中,一个字符代表6位数据。这就是为什么base64编码的字符串需要长度为4的倍数:否则编码的字节数将不是整数。
Finally, to decode from base64 you need an ASCII string. A Unicode string won't do, there can only be characters from the base64 alphabet. Base64 module does the job in Python. The base64.b64decode()
function takes a byte string as the argument. In Python 2 it means: str
. In Python 3 it means: bytes
. So if you have a str
, such as
最后,要从base64解码,需要一个ASCII字符串。Unicode字符串不行,只能有来自base64字母表的字符。Base64模块在Python中执行此任务。函数的作用是:将一个字节字符串作为参数。在Python 2中它的意思是:str.在Python 3中它的意思是:bytes。如果你有一个str,比如
>>> s = 'U3RhY2sgT3ZlcmZsb3c='
In Python 2 you could just do
在Python 2中,你可以这么做
>>> s.decode('base64')
because s
is already in ASCII. In Python 3, you need to encode it in ASCII first, so you'll have to do:
因为s已经是ASCII码了。在python3里,你需要先用ASCII码编码,所以你必须这样做:
>>> base64.b64decode(s.encode('ascii'))
And by the way, this will return a bytes
object, so it's really up to you how to treat those bytes then. Maybe it's my floats, but maybe you should try to decode it as ASCII :) In Python 2 however it will be just a str
. Anyway, have a look at struct
for the tools to unpack your data from those bytes.
顺便说一下,这将返回一个字节对象,因此如何处理这些字节完全取决于你。也许是我的浮点数,但也许你应该尝试把它解码成ASCII:)在Python 2中,它只是一个str。
So if you need the code to work on both Python 2 and 3, go with the last one. To make sure you have Unicode in the end (if you are decoding text from base64), you'll have to decode it:
因此,如果您需要代码同时处理Python 2和3,请使用最后一个。为了确保最后有Unicode(如果您正在解码base64的文本),您必须解码它:
>>> base64.b64decode(s.encode('ascii')).decode('ascii')
On Python 2, encode('ascii')
won't effectively do anything because it's applied to str
. So it will do an implicit conversion to Unicode first, and then do what you want (convert it back to ASCII). decode('ascii')
will return a unicode
object on Python 2.
在Python 2中,encode(“ascii”)不会有效地执行任何操作,因为它应用于str。因此,它将首先隐式地转换为Unicode,然后执行您想要的操作(将其转换回ascii)。decode(“ascii”)将返回Python 2上的unicode对象。