“b”字符在字符串文字前面做什么?

时间:2021-09-18 23:09:34

Apparently, the following is valid syntax

显然,下面是有效的语法。

my_string = b'The string'

I would like to know:

我想知道:

  1. What does this b character in front of the string mean?
  2. 这个b字在弦的前面是什么意思?
  3. What are the effects of using it?
  4. 使用它有什么影响?
  5. What are appropriate situations to use it?
  6. 什么是适当的情况下使用它?

I found a related question right here on SO, but that question is about PHP though, and it states the b is used to indicate the string is binary, as opposed to Unicode, which was needed for code to be compatible from version of PHP < 6, when migrating to PHP 6. I don't think this applies to Python.

所以这里我发现了一个相关的问题,但问题是关于PHP,这国家b是用来表示二进制字符串,而非Unicode,所需的代码兼容的版本的PHP < 6,当迁移到PHP 6。我不认为这适用于Python。

I did find this documentation on the Python site about using a u character in the same syntax to specify a string as Unicode. Unfortunately, it doesn't mention the b character anywhere in that document.

我确实在Python站点上找到了关于在相同的语法中使用u字符来指定字符串作为Unicode的文档。不幸的是,它在文档中的任何地方都没有提到b字符。

Also, just out of curiosity, are there more symbols than the b and u that do other things?

同样,出于好奇,还有比b和u更多的符号吗?

6 个解决方案

#1


218  

To quote the Python 2.x documentation:

引用Python 2。x文档:

A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

Python 2中忽略了“b”或“b”的前缀;它表示在Python 3中文字应该变成一个字节文字(例如,当代码自动转换为2to3)。“u”或“b”前缀后面可能有一个“r”前缀。

The Python 3.3 documentation states:

Python 3.3文档说明:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

字节文字总是以“b”或“b”前缀;它们生成的是字节类型的实例,而不是str类型。它们可能只包含ASCII字符;具有128个或以上数值的字节必须用转义表示。

#2


366  

Python 3.x makes a clear distinction between the types:

Python 3。x明确区分了类型:

  • str = '...' literals = a sequence of Unicode characters (UTF-16 or UTF-32, depending on how Python was compiled)
  • str = '……' literals =一个Unicode字符序列(UTF-16或UTF-32,这取决于Python的编译方式)
  • bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
  • 字节= b“……' literals =一个八进制序列(0到255之间的整数)

If you're familiar with Java or C#, think of str as String and bytes as byte[]. If you're familiar with SQL, think of str as NVARCHAR and bytes as BINARY or BLOB. If you're familiar with the Windows registry, think of str as REG_SZ and bytes as REG_BINARY. If you're familiar with C(++), then forget everything you've learned about char and strings, because A CHARACTER IS NOT A BYTE. That idea is long obsolete.

如果您熟悉Java或c#,请将str视为字符串,以字节为字节[]。如果您熟悉SQL,请将str视为NVARCHAR,将字节视为二进制或BLOB。如果您熟悉Windows注册表,可以将str视为REG_SZ,将字节看作REG_BINARY。如果您熟悉C(+),那么就忘记您学过的关于char和string的所有知识,因为字符不是一个字节。这个想法早就过时了。

You use str when you want to represent text.

当您想要表示文本时使用str。

print('שלום עולם')

You use bytes when you want to represent low-level binary data like structs.

当您想要表示像struct这样的低级二进制数据时,使用字节。

NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]

You can encode a str to a bytes object.

您可以将一个str编码为一个字节对象。

>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'

And you can decode a bytes into a str.

你可以把一个字节解码成一个str。

>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'

But you can't freely mix the two types.

但是你不能*地混合这两种类型。

>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str

The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.

b”……符号有点混乱,因为它允许使用ASCII字符而不是十六进制数字来指定字节0x01-0x7F。

>>> b'A' == b'\x41'
True

But I must emphasize, a character is not a byte.

但我必须强调,一个字符不是一个字节。

>>> 'A' == b'A'
False

In Python 2.x

Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:

Python的前3.0版本在文本和二进制数据之间缺乏这种区分。取而代之的是:

  • unicode = u'...' literals = sequence of Unicode characters = 3.x str
  • unicode = u '……' literals = Unicode字符序列= 3。x str
  • str = '...' literals = sequences of confounded bytes/characters
    • Usually text, encoded in some unspecified encoding.
    • 通常是文本,编码在一些未指定的编码中。
    • But also used to represent binary data like struct.pack output.
    • 但也用来表示二进制数据,比如struct。包输出。
  • str = '……字面意思是指混淆的字节/字符的序列,通常是文本,编码在某些未指定的编码中。但也用来表示二进制数据,比如struct。包输出。

In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.

为了方便2。x到3。x过渡,b”……字面上的语法被移植到Python 2.6中,以允许区分二进制字符串(应该是3.x中的字节)和文本字符串(应该是3.x中的str)。b前缀在2中什么都不做。但是告诉2to3脚本不要将它转换成3。x中的Unicode字符串。

So yes, b'...' literals in Python have the same purpose that they do in PHP.

是的,b”……Python中的文字与PHP中有相同的用途。

Also, just out of curiosity, are there more symbols than the b and u that do other things?

同样,出于好奇,还有比b和u更多的符号吗?

The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.

r前缀创建一个原始字符串(例如,r'\t'是一个反斜杠+ t而不是一个选项卡),以及三引号“'……”“”或“”“……”允许多行字符串文字。

#3


10  

The b denotes a byte string.

b表示一个字节字符串。

Bytes are the actual data. Strings are an abstraction.

字节是实际的数据。字符串是一个抽象的概念。

If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.

如果你有一个多字符的字符串对象,你只需要一个字符,它就会是一个字符串,它的大小可能会超过1个字节,这取决于编码。

If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.

如果用一个字节字符串获取1个字节,那么从0-255得到一个8位的值,如果编码的字符是> 1字节,那么它可能不代表一个完整的字符。

TBH I'd use strings unless I had some specific low level reason to use bytes.

TBH我将使用字符串,除非我有一些特定的低层次的理由来使用字节。

#4


6  

It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.

它将它转换为一个字节文本(或者是2。x中的str),并且对2.6+有效。

The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).

r前缀导致反斜杠为“未解释”(未被忽略,且差异很重要)。

#5


6  

Here's an example where the absence of 'b' would throw a TypeError exception in Python 3.x

这里有一个例子,在Python 3.x中没有“b”会抛出一个TypeError异常。

>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

Adding a 'b' prefix would fix the problem.

加一个“b”前缀可以解决这个问题。

#6


0  

In addition to what others have said, note that a single character in unicode can consist of multiple bytes.

除了其他人所说的以外,请注意unicode中的单个字符可以包含多个字节。

The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.

unicode的工作方式是使用旧的ASCII格式(7位代码,看起来像0xxx xxxx),并添加了多字节序列,所有字节从1 (1xxx xxxx)开始,以表示ASCII以外的字符,以便unicode可以向后兼容ASCII。

>>> len('Öl')  # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8')  # convert str to bytes 
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8'))  # 3 bytes encode 2 characters !
3

#1


218  

To quote the Python 2.x documentation:

引用Python 2。x文档:

A prefix of 'b' or 'B' is ignored in Python 2; it indicates that the literal should become a bytes literal in Python 3 (e.g. when code is automatically converted with 2to3). A 'u' or 'b' prefix may be followed by an 'r' prefix.

Python 2中忽略了“b”或“b”的前缀;它表示在Python 3中文字应该变成一个字节文字(例如,当代码自动转换为2to3)。“u”或“b”前缀后面可能有一个“r”前缀。

The Python 3.3 documentation states:

Python 3.3文档说明:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

字节文字总是以“b”或“b”前缀;它们生成的是字节类型的实例,而不是str类型。它们可能只包含ASCII字符;具有128个或以上数值的字节必须用转义表示。

#2


366  

Python 3.x makes a clear distinction between the types:

Python 3。x明确区分了类型:

  • str = '...' literals = a sequence of Unicode characters (UTF-16 or UTF-32, depending on how Python was compiled)
  • str = '……' literals =一个Unicode字符序列(UTF-16或UTF-32,这取决于Python的编译方式)
  • bytes = b'...' literals = a sequence of octets (integers between 0 and 255)
  • 字节= b“……' literals =一个八进制序列(0到255之间的整数)

If you're familiar with Java or C#, think of str as String and bytes as byte[]. If you're familiar with SQL, think of str as NVARCHAR and bytes as BINARY or BLOB. If you're familiar with the Windows registry, think of str as REG_SZ and bytes as REG_BINARY. If you're familiar with C(++), then forget everything you've learned about char and strings, because A CHARACTER IS NOT A BYTE. That idea is long obsolete.

如果您熟悉Java或c#,请将str视为字符串,以字节为字节[]。如果您熟悉SQL,请将str视为NVARCHAR,将字节视为二进制或BLOB。如果您熟悉Windows注册表,可以将str视为REG_SZ,将字节看作REG_BINARY。如果您熟悉C(+),那么就忘记您学过的关于char和string的所有知识,因为字符不是一个字节。这个想法早就过时了。

You use str when you want to represent text.

当您想要表示文本时使用str。

print('שלום עולם')

You use bytes when you want to represent low-level binary data like structs.

当您想要表示像struct这样的低级二进制数据时,使用字节。

NaN = struct.unpack('>d', b'\xff\xf8\x00\x00\x00\x00\x00\x00')[0]

You can encode a str to a bytes object.

您可以将一个str编码为一个字节对象。

>>> '\uFEFF'.encode('UTF-8')
b'\xef\xbb\xbf'

And you can decode a bytes into a str.

你可以把一个字节解码成一个str。

>>> b'\xE2\x82\xAC'.decode('UTF-8')
'€'

But you can't freely mix the two types.

但是你不能*地混合这两种类型。

>>> b'\xEF\xBB\xBF' + 'Text with a UTF-8 BOM'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: can't concat bytes to str

The b'...' notation is somewhat confusing in that it allows the bytes 0x01-0x7F to be specified with ASCII characters instead of hex numbers.

b”……符号有点混乱,因为它允许使用ASCII字符而不是十六进制数字来指定字节0x01-0x7F。

>>> b'A' == b'\x41'
True

But I must emphasize, a character is not a byte.

但我必须强调,一个字符不是一个字节。

>>> 'A' == b'A'
False

In Python 2.x

Pre-3.0 versions of Python lacked this kind of distinction between text and binary data. Instead, there was:

Python的前3.0版本在文本和二进制数据之间缺乏这种区分。取而代之的是:

  • unicode = u'...' literals = sequence of Unicode characters = 3.x str
  • unicode = u '……' literals = Unicode字符序列= 3。x str
  • str = '...' literals = sequences of confounded bytes/characters
    • Usually text, encoded in some unspecified encoding.
    • 通常是文本,编码在一些未指定的编码中。
    • But also used to represent binary data like struct.pack output.
    • 但也用来表示二进制数据,比如struct。包输出。
  • str = '……字面意思是指混淆的字节/字符的序列,通常是文本,编码在某些未指定的编码中。但也用来表示二进制数据,比如struct。包输出。

In order to ease the 2.x-to-3.x transition, the b'...' literal syntax was backported to Python 2.6, in order to allow distinguishing binary strings (which should be bytes in 3.x) from text strings (which should be str in 3.x). The b prefix does nothing in 2.x, but tells the 2to3 script not to convert it to a Unicode string in 3.x.

为了方便2。x到3。x过渡,b”……字面上的语法被移植到Python 2.6中,以允许区分二进制字符串(应该是3.x中的字节)和文本字符串(应该是3.x中的str)。b前缀在2中什么都不做。但是告诉2to3脚本不要将它转换成3。x中的Unicode字符串。

So yes, b'...' literals in Python have the same purpose that they do in PHP.

是的,b”……Python中的文字与PHP中有相同的用途。

Also, just out of curiosity, are there more symbols than the b and u that do other things?

同样,出于好奇,还有比b和u更多的符号吗?

The r prefix creates a raw string (e.g., r'\t' is a backslash + t instead of a tab), and triple quotes '''...''' or """...""" allow multi-line string literals.

r前缀创建一个原始字符串(例如,r'\t'是一个反斜杠+ t而不是一个选项卡),以及三引号“'……”“”或“”“……”允许多行字符串文字。

#3


10  

The b denotes a byte string.

b表示一个字节字符串。

Bytes are the actual data. Strings are an abstraction.

字节是实际的数据。字符串是一个抽象的概念。

If you had multi-character string object and you took a single character, it would be a string, and it might be more than 1 byte in size depending on encoding.

如果你有一个多字符的字符串对象,你只需要一个字符,它就会是一个字符串,它的大小可能会超过1个字节,这取决于编码。

If took 1 byte with a byte string, you'd get a single 8-bit value from 0-255 and it might not represent a complete character if those characters due to encoding were > 1 byte.

如果用一个字节字符串获取1个字节,那么从0-255得到一个8位的值,如果编码的字符是> 1字节,那么它可能不代表一个完整的字符。

TBH I'd use strings unless I had some specific low level reason to use bytes.

TBH我将使用字符串,除非我有一些特定的低层次的理由来使用字节。

#4


6  

It turns it into a bytes literal (or str in 2.x), and is valid for 2.6+.

它将它转换为一个字节文本(或者是2。x中的str),并且对2.6+有效。

The r prefix causes backslashes to be "uninterpreted" (not ignored, and the difference does matter).

r前缀导致反斜杠为“未解释”(未被忽略,且差异很重要)。

#5


6  

Here's an example where the absence of 'b' would throw a TypeError exception in Python 3.x

这里有一个例子,在Python 3.x中没有“b”会抛出一个TypeError异常。

>>> f=open("new", "wb")
>>> f.write("Hello Python!")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' does not support the buffer interface

Adding a 'b' prefix would fix the problem.

加一个“b”前缀可以解决这个问题。

#6


0  

In addition to what others have said, note that a single character in unicode can consist of multiple bytes.

除了其他人所说的以外,请注意unicode中的单个字符可以包含多个字节。

The way unicode works is that it took the old ASCII format (7-bit code that looks like 0xxx xxxx) and added multi-bytes sequences where all bytes start with 1 (1xxx xxxx) to represent characters beyond ASCII so that Unicode would be backwards-compatible with ASCII.

unicode的工作方式是使用旧的ASCII格式(7位代码,看起来像0xxx xxxx),并添加了多字节序列,所有字节从1 (1xxx xxxx)开始,以表示ASCII以外的字符,以便unicode可以向后兼容ASCII。

>>> len('Öl')  # German word for 'oil' with 2 characters
2
>>> 'Öl'.encode('UTF-8')  # convert str to bytes 
b'\xc3\x96l'
>>> len('Öl'.encode('UTF-8'))  # 3 bytes encode 2 characters !
3