我如何正确地使用std:: UTF-8上的字符串在c++中?

时间:2021-12-14 15:01:02

My platform is a Mac and C++11 (or above). I'm a C++ beginner and working on a personal project which processes Chinese and English. UTF-8 is the preferred encoding for this project.

我的平台是Mac和c++ 11(或以上)。我是一个c++初学者,正在从事一个处理中英文的个人项目。UTF-8是这个项目的首选编码。

I read some posts on Stack Overflow, and many of them suggest using std::string when dealing with UTF-8 and avoid wchar_t as there's no char8_t right now for UTF-8.

我读过一些关于栈溢出的文章,其中很多建议在处理UTF-8时使用std::string,避免使用wchar_t,因为现在UTF-8没有char8_t。

However, none of them talk about how to properly deal with functions like str[i], std::string::size(), std::string::find_first_of() or std::regex as these function usually returns unexpected results when facing UTF-8.

但是,没有人讨论如何正确地处理str[i]、std::string:::size()、std:::string::find_first_of()或std:::regex等函数,因为这些函数在处理UTF-8时通常会返回意外结果。

Should I go ahead with std::string or switch to std::wstring? If I should stay with std::string, what's the best practice for one to handle the above problems?

我应该继续使用std::string还是切换到std::wstring?如果我应该继续使用std::string,那么处理上述问题的最佳实践是什么?

3 个解决方案

#1


61  

Unicode Glossary

Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:

Unicode是一个庞大而复杂的主题。我不希望在那里涉猎太深,但是有必要快速的词汇:

  1. Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
  2. 代码点:代码点是Unicode的基本构建块,代码点只是映射到一个含义的整数。整数部分可以包含32位(实际上是24位),其含义可以是一个字母、一个音符、一个空格、一个符号、一个微笑、半个旗帜……它甚至可以是“下一部分从右向左读”。
  3. Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.
  4. Grapheme集群:Grapheme集群是一组语义相关的代码点,例如unicode中的一个标志通过关联两个代码点来表示;这两者单独没有任何意义,但在一个Grapheme集群中关联在一起,它们表示一个标志。在某些脚本中,石墨烯簇还被用来将字母与变音符进行配对。

This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.

这是Unicode的基础。代码点和Grapheme集群之间的区别基本上可以忽略掉,因为对于大多数现代语言来说,每个“字符”都映射到一个代码点(对于常用的字母+发音组合,有专门的重音形式)。不过,如果你敢戴上笑脸、旗帜等等……然后你可能要注意这个区别。


UTF Primer

Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.

然后,必须对一系列Unicode代码点进行编码;常见的编码是UTF-8、UTF-16和UTF-32,后者两种都存在于Little-Endian和Big-Endian形式中,总共有5种通用编码。

In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:

在UTF-X中,X是代码单元的位元大小,每个代码点表示为一个或多个代码单元,具体取决于其大小:

  • UTF-8: 1 to 4 Code Units,
  • UTF-8: 1至4编码单元,
  • UTF-16: 1 or 2 Code Units,
  • UTF-16: 1或2个代码单元,
  • UTF-32: 1 Code Unit.
  • utf - 32:1代码单元。

std::string and std::wstring.

  1. Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
  2. 不要使用std::wstring如果你关心可移植性(wchar_t在Windows上只有16位);使用std::u32string(又名std: basic_string )。
  3. The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
  4. 内存中的表示(std::string或std::wstring)独立于磁盘上的表示(UTF-8, UTF-16或UTF-32),所以要准备好在边界上转换(读写)。
  5. While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.
  6. 虽然32位的wchar_t确保代码单元表示完整的代码点,但它仍然不表示完整的Grapheme集群。

If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.

如果您只是在阅读或编写字符串,那么您应该对std::string或std::wstring没有任何问题。

Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.

当你开始切割和切割的时候,麻烦就开始了,然后你必须注意(1)代码点边界(UTF-8或UTF-16)和(2)Grapheme集群边界。前者可以自己处理,后者需要使用Unicode感知库。


Picking std::string or std::u32string?

If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.

如果性能令人担忧,那么std::string由于其内存占用更小,性能可能会更好;尽管大量使用中文可能会改变交易。一如既往,概要文件。

If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.

如果Grapheme集群不是一个问题,那么std: u32string的优点是可以简化:1个代码单元-> 1个代码点意味着您不能意外地分割代码点,std::basic_string的所有功能都是开箱即用的。

If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.

如果您与使用std::string或char*/char const*的软件进行交互,那么请坚持使用std::string以避免来回转换。否则会很痛苦。


UTF-8 in std::string.

UTF-8 actually works quite well in std::string.

UTF-8实际上在std::string中工作得很好。

Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

由于UTF-8编码是自同步的,并且向后兼容ASCII,所以大多数操作都是在这个框中进行的。

Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:

由于编码代码点的方式,寻找一个代码点不能偶然地匹配另一个代码点的中间:

  • str.find('\n') works,
  • str.find(“\ n”)工作,
  • str.find("...") works for matching byte by byte1,
  • find("…")用于通过byte1匹配字节,
  • str.find_first_of("\r\n") works if searching for ASCII characters.
  • 如果搜索ASCII字符,str.find_first_of(“\r\n”)可以工作。

Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.

类似地,regex应该主要是开箱即用的。作为一个字符序列(“哈哈”)是一个字节序列(“哈”),基本的搜索模式应该工作的。

Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.

但是,要注意字符类(例如[:alphanum:]),因为它可能与Unicode字符匹配,也可能不匹配。

Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".

同样,警惕将中继器应用于非ascii字符”,“哈?”可能只考虑最后一个字节是可选的;使用括号来准确描述重复字节序列在这种情况下:“(哈)?”。

1The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.

1、查找的关键概念是规范化和排序;这将影响所有的比较操作。字符串总是按字节进行比较(并对其进行排序),而不考虑特定于语言或用法的比较规则。如果您需要处理完全的规范化/排序,您需要一个完整的Unicode库,如ICU。

#2


9  

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

字符串和std::wstring都必须使用UTF编码来表示Unicode。在macOS上,std::string is UTF-8(8位代码单元),std:::wstring is UTF-32(32位代码单元);注意,wchar_t的大小与平台相关。

For both, size tracks the number of code units instead of the number of code points, or grapheme clusters. (A code point is one named Unicode entity, one or more of which form a grapheme cluster. Grapheme clusters are the visible characters that users interact with, like letters or emojis.)

对于两者,size都跟踪代码单元的数量,而不是代码点的数量,或者是grapheme集群。(代码点是一个名为Unicode的实体,其中一个或多个组成一个grapheme集群。Grapheme集群是用户与之交互的可见字符,如字母或表情符号。

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of grapheme clusters. Obviously, however, this comes at the cost of using up to 4x more memory.

虽然我不熟悉中文的Unicode表示,但是当您使用UTF-32时,代码单元的数量通常非常接近于grapheme集群的数量。然而,显然,这是以多消耗4倍内存为代价的。

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

最准确的解决方案是使用Unicode库(如ICU)来计算您所追求的Unicode属性。

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

最后,在人类语言中不使用组合字符的UTF字符串通常与find/regex非常匹配。我不太懂中文,但英语就是其中之一。

#3


8  

std::string and friends are encoding-agnostic. The only difference between std::wstring and std::string are that std::wstring uses wchar_t as the individual element, not char. For most compilers the latter is 8-bit. The former is supposed to be large enough to hold any unicode character, but in practice on some systems it isn't (Microsoft's compiler, for example, uses a 16-bit type). You can't store UTF-8 in std::wstring; that's not what it's designed for. It's designed to be an equivalent of UTF-32 - a string where each element is a single Unicode codepoint.

字符串和朋友是不知道编码的。std:::wstring和std::string之间的唯一区别是std:::wstring使用wchar_t作为单独的元素,而不是char。对于大多数编译器,后者是8位的。前者应该足够大,可以容纳任何unicode字符,但在某些系统中,它不是(例如,微软的编译器使用16位类型)。您不能在std中存储UTF-8::wstring;这不是设计的目的。它被设计成相当于UTF-32——一个每个元素都是一个Unicode码点的字符串。

If you want to index UTF-8 strings by Unicode codepoint or composed unicode glyph (or some other thing), count the length of a UTF-8 string in Unicode codepoints or some other unicode object, or find by Unicode codepoint, you're going to need to use something other than the standard library. ICU is one of the libraries in the field; there may be others.

如果您希望通过Unicode codepoint或组合的Unicode字形(或其他东西)来索引UTF-8字符串,请在Unicode codepoint或其他Unicode对象中计算UTF-8字符串的长度,或者通过Unicode codepoint查找UTF-8字符串,则需要使用标准库之外的其他东西。ICU是该领域的图书馆之一;可能会有别人。

Something that's probably worth noting is that if you're searching for ASCII characters, you can mostly treat a UTF-8 bytestream as if it were byte-by-byte. Each ASCII character encodes the same in UTF-8 as it does in ASCII, and every multi-byte unit in UTF-8 is guaranteed not to include any bytes in the ASCII range.

值得注意的是,如果您正在搜索ASCII字符,那么您可以将UTF-8字节的bytestream当作字节来处理。每个ASCII字符在UTF-8中编码的代码与ASCII相同,而UTF-8中的每一个多字节单元都保证不包含ASCII范围内的任何字节。

#1


61  

Unicode Glossary

Unicode is a vast and complex topic. I do not wish to wade too deep there, however a quick glossary is necessary:

Unicode是一个庞大而复杂的主题。我不希望在那里涉猎太深,但是有必要快速的词汇:

  1. Code Points: Code Points are the basic building blocks of Unicode, a code point is just an integer mapped to a meaning. The integer portion fits into 32 bits (well, 24 bits really), and the meaning can be a letter, a diacritic, a white space, a sign, a smiley, half a flag, ... and it can even be "the next portion reads right to left".
  2. 代码点:代码点是Unicode的基本构建块,代码点只是映射到一个含义的整数。整数部分可以包含32位(实际上是24位),其含义可以是一个字母、一个音符、一个空格、一个符号、一个微笑、半个旗帜……它甚至可以是“下一部分从右向左读”。
  3. Grapheme Clusters: Grapheme Clusters are groups of semantically related Code Points, for example a flag in unicode is represented by associating two Code Points; each of those two, in isolation, has no meaning, but associated together in a Grapheme Cluster they represent a flag. Grapheme Clusters are also used to pair a letter with a diacritic in some scripts.
  4. Grapheme集群:Grapheme集群是一组语义相关的代码点,例如unicode中的一个标志通过关联两个代码点来表示;这两者单独没有任何意义,但在一个Grapheme集群中关联在一起,它们表示一个标志。在某些脚本中,石墨烯簇还被用来将字母与变音符进行配对。

This is the basic of Unicode. The distinction between Code Point and Grapheme Cluster can be mostly glossed over because for most modern languages each "character" is mapped to a single Code Point (there are dedicated accented forms for commonly used letter+diacritic combinations). Still, if you venture in smileys, flags, etc... then you may have to pay attention to the distinction.

这是Unicode的基础。代码点和Grapheme集群之间的区别基本上可以忽略掉,因为对于大多数现代语言来说,每个“字符”都映射到一个代码点(对于常用的字母+发音组合,有专门的重音形式)。不过,如果你敢戴上笑脸、旗帜等等……然后你可能要注意这个区别。


UTF Primer

Then, a serie of Unicode Code Points has to be encoded; the common encodings are UTF-8, UTF-16 and UTF-32, the latter two existing in both Little-Endian and Big-Endian forms, for a total of 5 common encodings.

然后,必须对一系列Unicode代码点进行编码;常见的编码是UTF-8、UTF-16和UTF-32,后者两种都存在于Little-Endian和Big-Endian形式中,总共有5种通用编码。

In UTF-X, X is the size in bits of the Code Unit, each Code Point is represented as one or several Code Units, depending on its magnitude:

在UTF-X中,X是代码单元的位元大小,每个代码点表示为一个或多个代码单元,具体取决于其大小:

  • UTF-8: 1 to 4 Code Units,
  • UTF-8: 1至4编码单元,
  • UTF-16: 1 or 2 Code Units,
  • UTF-16: 1或2个代码单元,
  • UTF-32: 1 Code Unit.
  • utf - 32:1代码单元。

std::string and std::wstring.

  1. Do not use std::wstring if you care about portability (wchar_t is only 16 bits on Windows); use std::u32string instead (aka std::basic_string<char32_t>).
  2. 不要使用std::wstring如果你关心可移植性(wchar_t在Windows上只有16位);使用std::u32string(又名std: basic_string )。
  3. The in-memory representation (std::string or std::wstring) is independent of the on-disk representation (UTF-8, UTF-16 or UTF-32), so prepare yourself for having to convert at the boundary (reading and writing).
  4. 内存中的表示(std::string或std::wstring)独立于磁盘上的表示(UTF-8, UTF-16或UTF-32),所以要准备好在边界上转换(读写)。
  5. While a 32-bits wchar_t ensures that a Code Unit represents a full Code Point, it still does not represent a complete Grapheme Cluster.
  6. 虽然32位的wchar_t确保代码单元表示完整的代码点,但它仍然不表示完整的Grapheme集群。

If you are only reading or composing strings, you should have no to little issues with std::string or std::wstring.

如果您只是在阅读或编写字符串,那么您应该对std::string或std::wstring没有任何问题。

Troubles start when you start slicing and dicing, then you have to pay attention to (1) Code Point boundaries (in UTF-8 or UTF-16) and (2) Grapheme Clusters boundaries. The former can be handled easily enough on your own, the latter requires using a Unicode aware library.

当你开始切割和切割的时候,麻烦就开始了,然后你必须注意(1)代码点边界(UTF-8或UTF-16)和(2)Grapheme集群边界。前者可以自己处理,后者需要使用Unicode感知库。


Picking std::string or std::u32string?

If performance is a concern, it is likely that std::string will perform better due to its smaller memory footprint; though heavy use of Chinese may change the deal. As always, profile.

如果性能令人担忧,那么std::string由于其内存占用更小,性能可能会更好;尽管大量使用中文可能会改变交易。一如既往,概要文件。

If Grapheme Clusters are not a problem, then std::u32string has the advantage of simplifying things: 1 Code Unit -> 1 Code Point means that you cannot accidentally split Code Points, and all the functions of std::basic_string work out of the box.

如果Grapheme集群不是一个问题,那么std: u32string的优点是可以简化:1个代码单元-> 1个代码点意味着您不能意外地分割代码点,std::basic_string的所有功能都是开箱即用的。

If you interface with software taking std::string or char*/char const*, then stick to std::string to avoid back-and-forth conversions. It'll be a pain otherwise.

如果您与使用std::string或char*/char const*的软件进行交互,那么请坚持使用std::string以避免来回转换。否则会很痛苦。


UTF-8 in std::string.

UTF-8 actually works quite well in std::string.

UTF-8实际上在std::string中工作得很好。

Most operations work out of the box because the UTF-8 encoding is self-synchronizing and backward compatible with ASCII.

由于UTF-8编码是自同步的,并且向后兼容ASCII,所以大多数操作都是在这个框中进行的。

Due the way Code Points are encoded, looking for a Code Point cannot accidentally match the middle of another Code Point:

由于编码代码点的方式,寻找一个代码点不能偶然地匹配另一个代码点的中间:

  • str.find('\n') works,
  • str.find(“\ n”)工作,
  • str.find("...") works for matching byte by byte1,
  • find("…")用于通过byte1匹配字节,
  • str.find_first_of("\r\n") works if searching for ASCII characters.
  • 如果搜索ASCII字符,str.find_first_of(“\r\n”)可以工作。

Similarly, regex should mostly works out of the box. As a sequence of characters ("haha") is just a sequence of bytes ("哈"), basic search patterns should work out of the box.

类似地,regex应该主要是开箱即用的。作为一个字符序列(“哈哈”)是一个字节序列(“哈”),基本的搜索模式应该工作的。

Be wary, however, of character classes (such as [:alphanum:]), as depending on the regex flavor and implementation it may or may not match Unicode characters.

但是,要注意字符类(例如[:alphanum:]),因为它可能与Unicode字符匹配,也可能不匹配。

Similarly, be wary of applying repeaters to non-ASCII "characters", "哈?" may only consider the last byte to be optional; use parentheses to clearly delineate the repeated sequence of bytes in such cases: "(哈)?".

同样,警惕将中继器应用于非ascii字符”,“哈?”可能只考虑最后一个字节是可选的;使用括号来准确描述重复字节序列在这种情况下:“(哈)?”。

1The key concepts to look-up are normalization and collation; this affects all comparison operations. std::string will always compare (and thus sort) byte by byte, without regard for comparison rules specific to a language or a usage. If you need to handle full normalization/collation, you need a complete Unicode library, such as ICU.

1、查找的关键概念是规范化和排序;这将影响所有的比较操作。字符串总是按字节进行比较(并对其进行排序),而不考虑特定于语言或用法的比较规则。如果您需要处理完全的规范化/排序,您需要一个完整的Unicode库,如ICU。

#2


9  

Both std::string and std::wstring must use UTF encoding to represent Unicode. On macOS specifically, std::string is UTF-8 (8-bit code units), and std::wstring is UTF-32 (32-bit code units); note that the size of wchar_t is platform-dependent.

字符串和std::wstring都必须使用UTF编码来表示Unicode。在macOS上,std::string is UTF-8(8位代码单元),std:::wstring is UTF-32(32位代码单元);注意,wchar_t的大小与平台相关。

For both, size tracks the number of code units instead of the number of code points, or grapheme clusters. (A code point is one named Unicode entity, one or more of which form a grapheme cluster. Grapheme clusters are the visible characters that users interact with, like letters or emojis.)

对于两者,size都跟踪代码单元的数量,而不是代码点的数量,或者是grapheme集群。(代码点是一个名为Unicode的实体,其中一个或多个组成一个grapheme集群。Grapheme集群是用户与之交互的可见字符,如字母或表情符号。

Although I'm not familiar with the Unicode representation of Chinese, it's very possible that when you use UTF-32, the number of code units is often very close to the number of grapheme clusters. Obviously, however, this comes at the cost of using up to 4x more memory.

虽然我不熟悉中文的Unicode表示,但是当您使用UTF-32时,代码单元的数量通常非常接近于grapheme集群的数量。然而,显然,这是以多消耗4倍内存为代价的。

The most accurate solution would be to use a Unicode library, such as ICU, to calculate the Unicode properties that you are after.

最准确的解决方案是使用Unicode库(如ICU)来计算您所追求的Unicode属性。

Finally, UTF strings in human languages that don't use combining characters usually do pretty well with find/regex. I'm not sure about Chinese, but English is one of them.

最后,在人类语言中不使用组合字符的UTF字符串通常与find/regex非常匹配。我不太懂中文,但英语就是其中之一。

#3


8  

std::string and friends are encoding-agnostic. The only difference between std::wstring and std::string are that std::wstring uses wchar_t as the individual element, not char. For most compilers the latter is 8-bit. The former is supposed to be large enough to hold any unicode character, but in practice on some systems it isn't (Microsoft's compiler, for example, uses a 16-bit type). You can't store UTF-8 in std::wstring; that's not what it's designed for. It's designed to be an equivalent of UTF-32 - a string where each element is a single Unicode codepoint.

字符串和朋友是不知道编码的。std:::wstring和std::string之间的唯一区别是std:::wstring使用wchar_t作为单独的元素,而不是char。对于大多数编译器,后者是8位的。前者应该足够大,可以容纳任何unicode字符,但在某些系统中,它不是(例如,微软的编译器使用16位类型)。您不能在std中存储UTF-8::wstring;这不是设计的目的。它被设计成相当于UTF-32——一个每个元素都是一个Unicode码点的字符串。

If you want to index UTF-8 strings by Unicode codepoint or composed unicode glyph (or some other thing), count the length of a UTF-8 string in Unicode codepoints or some other unicode object, or find by Unicode codepoint, you're going to need to use something other than the standard library. ICU is one of the libraries in the field; there may be others.

如果您希望通过Unicode codepoint或组合的Unicode字形(或其他东西)来索引UTF-8字符串,请在Unicode codepoint或其他Unicode对象中计算UTF-8字符串的长度,或者通过Unicode codepoint查找UTF-8字符串,则需要使用标准库之外的其他东西。ICU是该领域的图书馆之一;可能会有别人。

Something that's probably worth noting is that if you're searching for ASCII characters, you can mostly treat a UTF-8 bytestream as if it were byte-by-byte. Each ASCII character encodes the same in UTF-8 as it does in ASCII, and every multi-byte unit in UTF-8 is guaranteed not to include any bytes in the ASCII range.

值得注意的是,如果您正在搜索ASCII字符,那么您可以将UTF-8字节的bytestream当作字节来处理。每个ASCII字符在UTF-8中编码的代码与ASCII相同,而UTF-8中的每一个多字节单元都保证不包含ASCII范围内的任何字节。