我可以安全地在C ++ 11中使用std :: string作为二进制数据吗?

时间:2021-09-23 06:52:17

There are several posts on the internet that suggest that you should use std::vector<unsigned char> or something similar for binary data.

互联网上有几篇文章建议你应该使用std :: vector 或类似的二进制数据。

But I'd much rather prefer a std::basic_string variant for that, since it provides many convenient string manipulation functions. And AFAIK, since C++11, the standard guarantees what every known C++03 implementation already did: that std::basic_string stores its contents contiguously in memory.

但我更喜欢std :: basic_string变种,因为它提供了许多方便的字符串操作函数。和AFAIK一样,自C ++ 11以来,该标准保证了每个已知的C ++ 03实现已经做到的:std :: basic_string将其内容连续存储在内存中。

At first glance then, std::basic_string<unsigned char> might be a good choice.

乍一看,std :: basic_string 可能是个不错的选择。

I don't want to use std::basic_string<unsigned char>, however, because almost all operating system functions only accept char*, making an explicit cast necessary. Also, string literals are const char*, so I would need an explicit cast to const unsigned char* every time I assigned a string literal to my binary string, which I would also like to avoid. Also, functions for reading from and writing to files or networking buffers similarly accept char* and const char* pointers.

但是,我不想使用std :: basic_string ,因为几乎所有操作系统函数都只接受char *,因此需要进行显式转换。另外,字符串文字是const char *,所以每次我将字符串文字分配给我的二进制字符串时,我都需要显式转换为const unsigned char *,我也想避免这种情况。此外,用于读取和写入文件或网络缓冲区的函数同样接受char *和const char *指针。

This leaves std::string, which is basically a typedef for std::basic_string<char>.

这留下了std :: string,它基本上是std :: basic_string 的typedef。

The only potential remaining issue (that I can see) with using std::string for binary data is that std::string uses char (which can be signed).

使用std :: string作为二进制数据的唯一潜在剩余问题(我可以看到)是std :: string使用char(可以签名)。

char, signed char, and unsigned char are three different types and char can be either unsigned or signed.

char,signed char和unsigned char是三种不同的类型,char可以是unsigned或signed。

So, when an actual byte value of 11111111b is returned from std::string:operator[] as char, and you want to check its value, its value can be either 255 (if char is unsigned) or it might be "something negative" (if char is signed, depending on your number representation).

因此,当从std :: string:operator []返回实际字节值11111111b作为char时,如果要检查其值,则其值可以是255(如果char是无符号的),或者它可能是“负面的” “(如果char已签名,则取决于您的号码表示)。

Similarly, if you want to explicitly append the actual byte value 11111111b to a std::string, simply appending (char) (255) might be implementation-defined (and even raise a signal) if char is signed and the int to char conversation results in an overflow.

类似地,如果要将实际字节值11111111b显式附加到std :: string,则只要附加(char)(255)可能是实现定义的(甚至引发信号)如果char被签名并且int到char会话导致溢出。

So, is there a safe way around this, that makes std::string binary-safe again?

那么,有一个安全的方法,这使得std :: string二进制安全吗?

§3.10/15 states:

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

如果程序试图通过以下类型之一以外的glvalue访问对象的存储值,则行为未定义:

  • [...]
  • a type that is the signed or unsigned type corresponding to the dynamic type of the object,
  • 与对象的动态类型对应的有符号或无符号类型的类型,

  • [...]
  • a char or unsigned char type.
  • char或unsigned char类型。

Which, if I understand it correctly, seems to allow using an unsigned char* pointer to access and manipulate the contents of a std::string and makes this also well-defined. It just reinterprets the bit pattern as an unsigned char, without any change or information loss, the latter namely because all bits in a char, signed char, and unsigned char must be used for the value representation.

如果我理解正确的话,似乎允许使用unsigned char *指针来访问和操作std :: string的内容并使其定义良好。它只是将位模式重新解释为无符号字符,没有任何更改或信息丢失,后者即因为char,signed char和unsigned char中的所有位必须用于值表示。

I could then use this unsigned char* interpretation of the contents of std::string as a means to access and change the byte values in the [0, 255] range, in a well-defined and portable manner, regardless of the signedness of char itself.

然后,我可以使用这个unsigned char *解释std :: string的内容,作为一种以明确定义和可移植的方式访问和更改[0,255]范围内的字节值的方法,无论签名是什么char本身。

This should solve any problems arising from a potentially signed char.

这应解决可能签名的char引起的任何问题。

Are my assumptions and conclusions correct?

我的假设和结论是否正确?

Also, is the unsigned char* interpretation of the same bit pattern (i.e. 11111111b or 10101010b) guaranteed to be the same on all implementations? Put differently, does the standard guarantee that "looking through the eyes of an unsigned char", the same bit pattern always leads to the same numerical value (assuming the number of bits in a byte is the same)?

此外,对于所有实现,相同位模式(即11111111b或10101010b)的无符号char *解释是否保证相同?换句话说,标准保证“通过无符号字符的眼睛看”,相同的位模式总是会导致相同的数值(假设一个字节中的位数相同)?

Can I thus safely (that is, without any undefined or implementation-defined behavior) use std::string for storing and manipulating binary data in C++11?

我可以安全地(即没有任何未定义或实现定义的行为)使用std :: string在C ++ 11中存储和操作二进制数据吗?

3 个解决方案

#1


17  

The conversion static_cast<char>(uc) where uc is of type is unsigned char is always valid: according to 3.9.1 [basic.fundamental] the representation of char, signed char, and unsigned char are identical with char being identical to one of the two other types:

其中uc类型为unsigned char的转换static_cast (uc)始终有效:根据3.9.1 [basic.fundamental],char,signed char和unsigned char的表示与char相同,char相同其他两种类型:

Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation. For unsigned narrow character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.

声明为字符(char)的对象应足够大,以存储实现的基本字符集的任何成员。如果此组中的字符存储在字符对象中,则该字符对象的整数值等于该字符的单个字符文字形式的值。实现定义char对象是否可以保存负值。字符可以显式声明为unsigned或signed。 Plain char,signed char和unsigned char是三种不同的类型,统称为窄字符类型。 char,signed char和unsigned char占用相同数量的存储空间并具有相同的对齐要求(3.11);也就是说,它们具有相同的对象表示。对于窄字符类型,对象表示的所有位都参与值表示。对于无符号窄字符类型,值表示的所有可能位模式表示数字。这些要求不适用于其他类型。在任何特定实现中,普通char对象可以采用与signed char或unsigned char相同的值;哪一个是实现定义的。

Converting values outside the range of unsigned char to char will, of course, be problematic and may cause undefined behavior. That is, as long as you don't try to store funny values into the std::string you'd be OK. With respect to bit patterns, you can rely on the nth bit to translated into 2n. There shouldn't be a problem to store binary data in a std::string when processed carefully.

当然,将unsigned char范围之外的值转换为char将是有问题的,并且可能导致未定义的行为。也就是说,只要你不尝试将有趣的值存储到std :: string中就可以了。关于位模式,您可以依赖第n位转换为2n。在仔细处理时,将二进制数据存储在std :: string中应该没有问题。

That said, I don't buy into your premise: Processing binary data mostly requires dealing with bytes which are best manipulated using unsigned values. The few cases where you'd need to convert between char* and unsigned char* create convenient errors when not treated explicitly while messing up the use of char accidentally will be silent! That is, dealing with unsigned char will prevent errors. I also don't buy into the premise that you get all those nice string functions: for one, you are generally better off using the algorithms anyway but also binary data is not string data. In summary: the recommendation for std::vector<unsigned char> isn't just coming out of thin air! It is deliberate to avoid building hard to find traps into the design!

也就是说,我不买入你的前提:处理二进制数据主要需要处理最好使用无符号值操作的字节。少数情况下你需要在char *和unsigned char *之间进行转换,如果没有明确处理,在意外搞乱使用char时会产生方便的错误!也就是说,处理unsigned char可以防止错误。我也不会购买你得到所有那些好的字符串函数的前提:首先,你通常最好不要使用算法,但二进制数据也不是字符串数据。总结:对std :: vector 的建议不仅仅是凭空而来!故意避免难以在设计中找到陷阱!

The only mildly reasonable argument in favor of using char could be the one about string literals but even that doesn't hold water with user-defined string literals introduced into C++11:

支持使用char的唯一有点合理的论点可能是关于字符串文字的那个,但即便如此,也不能保留用C ++ 11引入的用户定义的字符串文字:

#include <cstddef>
unsigned char const* operator""_u (char const* s, size_t) 
{
    return reinterpret_cast<unsigned char const*>(s);
}

unsigned char const* hello = "hello"_u;

#2


1  

Yes your assumptions are correct. Store binary data as a sequence of unsigned char in std::string.

是的,你的假设是正确的。将二进制数据存储为std :: string中的unsigned char序列。

#3


-1  

I've run into trouble using std::string to handle binary data in Microsoft Visual Studio. I've seen the strings get inexplicably truncated, so I wouldn't do this regardless of what the standards documents say.

我使用std :: string来处理Microsoft Visual Studio中的二进制数据时遇到了麻烦。我已经看到字符串被莫名其妙地截断了,所以无论标准文档说什么,我都不会这样做。

#1


17  

The conversion static_cast<char>(uc) where uc is of type is unsigned char is always valid: according to 3.9.1 [basic.fundamental] the representation of char, signed char, and unsigned char are identical with char being identical to one of the two other types:

其中uc类型为unsigned char的转换static_cast (uc)始终有效:根据3.9.1 [basic.fundamental],char,signed char和unsigned char的表示与char相同,char相同其他两种类型:

Objects declared as characters (char) shall be large enough to store any member of the implementation’s basic character set. If a character from this set is stored in a character object, the integral value of that character object is equal to the value of the single character literal form of that character. It is implementation-defined whether a char object can hold negative values. Characters can be explicitly declared unsigned or signed. Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (3.11); that is, they have the same object representation. For narrow character types, all bits of the object representation participate in the value representation. For unsigned narrow character types, all possible bit patterns of the value representation represent numbers. These requirements do not hold for other types. In any particular implementation, a plain char object can take on either the same values as a signed char or an unsigned char; which one is implementation-defined.

声明为字符(char)的对象应足够大,以存储实现的基本字符集的任何成员。如果此组中的字符存储在字符对象中,则该字符对象的整数值等于该字符的单个字符文字形式的值。实现定义char对象是否可以保存负值。字符可以显式声明为unsigned或signed。 Plain char,signed char和unsigned char是三种不同的类型,统称为窄字符类型。 char,signed char和unsigned char占用相同数量的存储空间并具有相同的对齐要求(3.11);也就是说,它们具有相同的对象表示。对于窄字符类型,对象表示的所有位都参与值表示。对于无符号窄字符类型,值表示的所有可能位模式表示数字。这些要求不适用于其他类型。在任何特定实现中,普通char对象可以采用与signed char或unsigned char相同的值;哪一个是实现定义的。

Converting values outside the range of unsigned char to char will, of course, be problematic and may cause undefined behavior. That is, as long as you don't try to store funny values into the std::string you'd be OK. With respect to bit patterns, you can rely on the nth bit to translated into 2n. There shouldn't be a problem to store binary data in a std::string when processed carefully.

当然,将unsigned char范围之外的值转换为char将是有问题的,并且可能导致未定义的行为。也就是说,只要你不尝试将有趣的值存储到std :: string中就可以了。关于位模式,您可以依赖第n位转换为2n。在仔细处理时,将二进制数据存储在std :: string中应该没有问题。

That said, I don't buy into your premise: Processing binary data mostly requires dealing with bytes which are best manipulated using unsigned values. The few cases where you'd need to convert between char* and unsigned char* create convenient errors when not treated explicitly while messing up the use of char accidentally will be silent! That is, dealing with unsigned char will prevent errors. I also don't buy into the premise that you get all those nice string functions: for one, you are generally better off using the algorithms anyway but also binary data is not string data. In summary: the recommendation for std::vector<unsigned char> isn't just coming out of thin air! It is deliberate to avoid building hard to find traps into the design!

也就是说,我不买入你的前提:处理二进制数据主要需要处理最好使用无符号值操作的字节。少数情况下你需要在char *和unsigned char *之间进行转换,如果没有明确处理,在意外搞乱使用char时会产生方便的错误!也就是说,处理unsigned char可以防止错误。我也不会购买你得到所有那些好的字符串函数的前提:首先,你通常最好不要使用算法,但二进制数据也不是字符串数据。总结:对std :: vector 的建议不仅仅是凭空而来!故意避免难以在设计中找到陷阱!

The only mildly reasonable argument in favor of using char could be the one about string literals but even that doesn't hold water with user-defined string literals introduced into C++11:

支持使用char的唯一有点合理的论点可能是关于字符串文字的那个,但即便如此,也不能保留用C ++ 11引入的用户定义的字符串文字:

#include <cstddef>
unsigned char const* operator""_u (char const* s, size_t) 
{
    return reinterpret_cast<unsigned char const*>(s);
}

unsigned char const* hello = "hello"_u;

#2


1  

Yes your assumptions are correct. Store binary data as a sequence of unsigned char in std::string.

是的,你的假设是正确的。将二进制数据存储为std :: string中的unsigned char序列。

#3


-1  

I've run into trouble using std::string to handle binary data in Microsoft Visual Studio. I've seen the strings get inexplicably truncated, so I wouldn't do this regardless of what the standards documents say.

我使用std :: string来处理Microsoft Visual Studio中的二进制数据时遇到了麻烦。我已经看到字符串被莫名其妙地截断了,所以无论标准文档说什么,我都不会这样做。