(编码)C ++中的字符串处理 - 问题/最佳实践?

时间:2021-10-23 20:20:10

What are the best practices for handling strings in C++? I'm wondering especially how to handle the following cases:

在C ++中处理字符串的最佳实践是什么?我特别想知道如何处理以下情况:

  • File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.

    文件和XML文件的文件输入/输出,可以用不同的编码编写。处理此问题的推荐方法是什么,以及如何检索值?我猜,XML节点可能包含UTF-16文本,然后我必须以某种方式使用它。

  • How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?

    如何处理char *字符串。毕竟,这可以是无符号或不符号,我想知道我如何确定他们使用的编码(ANSI?),以及如何转换为UTF-8?是否有任何推荐阅读,其中记录了关于字符串的C / C ++的基本保证?

  • String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?

    UTF-8等字符串的字符串算法 - 计算长度,解析等。如何做到最好?

  • What character type is really portable? I've learned that wchar_t can be anything from 8-32 bit wide, making it no good choice if I want to be consistent across platforms (especially when moving data between different platforms - this seems to be a problem, as described for example in EASTL, look at item #13)

    什么字符类型真的可移植?我已经知道wchar_t可以是8-32位宽的任何东西,如果我想在不同平台之间保持一致(特别是在不同平台之间移动数据时 - 这似乎是一个问题,如同在EASTL,看项目#13)

At the moment, I'm using std::string everywhere, with a small helper utility to convert to UTF-16 when calling Unicode-APIs, but I'm pretty sure that this is not really the best way. Using something like Qt's QString or the ICU String class seems to be right, but I wonder whether there is a more lightweight approach (i.e. if my char strings are ANSI encoded, and the subset of ANSI that is used is equal to UFT-8, then I can easily treat the data as UTF-8 and provide converters from/to UTF-8, and I'm done, as I can store it in std::string, unless there are problems with this approach).

目前,我在任何地方使用std :: string,在调用Unicode-APIs时使用一个小帮助器实用程序转换为UTF-16,但我很确定这不是最好的方法。使用类似Qt的QString或ICU String类似乎是正确的,但我想知道是否有更轻量级的方法(即如果我的char字符串是ANSI编码的,并且使用的ANSI子集等于UFT-8,那么我可以轻松地将数据视为UTF-8并提供转换器来自/到UTF-8,我已经完成了,因为我可以将它存储在std :: string中,除非这种方法存在问题)。

2 个解决方案

#1


For a shorter answer, I would just recommend using UTF-16 for simplicity; Java/C#/Python 3.0 switched to that model exactly for simplicity. I've always expected wchar_t to be 16 or 32bit wide, and many platforms support that; indeed, APIs like wcrtomb() do not allow an implementation to support a shift state for wchar_t*, but since UTF-8 needs none, it may be used, while other encodings are ruled out.

对于较短的答案,我建议使用UTF-16以简化;为简单起见,Java / C#/ Python 3.0切换到该模型。我一直期望wchar_t为16或32位宽,许多平台都支持它;实际上,像wcrtomb()这样的API不允许实现支持wchar_t *的移位状态,但由于UTF-8不需要,可以使用它,而其他编码则被排除。

Then, I answer the question about XML.

然后,我回答有关XML的问题。

File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.

文件和XML文件的文件输入/输出,可以用不同的编码编写。处理此问题的推荐方法是什么,以及如何检索值?我猜,XML节点可能包含UTF-16文本,然后我必须以某种方式使用它。

I'm not sure, but I don't think so. Mixing two encodings in the same file is asking for trouble and data corruption. Encoding a file in UTF-16 is usually a bad choice since most programs rely on using ASCII everwhere. The issue is: an XML file might use any single encoding, maybe even UTF-16, but then also the initial encoding declaration has to use UTF-16, and even the tags then. The problem I see with UTF-16 is: how should one reliable parse the initial declaration? The answer comes in the specification:, § 4.3.3:

我不确定,但我不这么认为。在同一个文件中混合两个编码会导致问题和数据损坏。以UTF-16编码文件通常是一个糟糕的选择,因为大多数程序都依赖于使用ASCII。问题是:XML文件可能使用任何单一编码,甚至可能使用UTF-16,但是初始编码声明也必须使用UTF-16,甚至是标签。我在UTF-16中看到的问题是:如何可靠地解析初始声明?答案来自规范:,§4.3.3:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

在没有外部传输协议(例如HTTP或MIME)提供的信息的情况下,对于实体来说,这是一个致命的错误,包括以不同于声明中指定的编码的形式向XML处理器提供编码声明,或者一个实体,它既不以字节顺序标记也不以编码声明开头,以使用UTF-8以外的编码。请注意,由于ASCII是UTF-8的子集,因此普通的ASCII实体并不严格需要编码声明。

When reading that, note that also an XML file is an entity, called the document entity; in general, an entity is a storage unit for the document. From the whole specification, I'd say that only one encoding declaration is allowed for each entity, and I'd convert all entities to UTF-16 when reading them for easier handling.

阅读时,请注意XML文件也是一个实体,称为文档实体;通常,实体是文档的存储单元。从整个规范来看,我会说每个实体只允许一个编码声明,并且在读取它们时我会将所有实体转换为UTF-16以便于处理。

Webography:

#2


String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?

UTF-8等字符串的字符串算法 - 计算长度,解析等。如何做到最好?

mbrlen gives you the length of a C string. I don't think std::string can be used for multibyte strings, you should use wstring for wide ones.

mbrlen为您提供C字符串的长度。我不认为std :: string可以用于多字节字符串,你应该使用wstring作为宽字符串。

In general, you should probaby stick with UTF-16 inside your program and use UTF-8 only on I/O (I don't know well other options, but they are surely more complex and error-prone).

一般来说,你应该在你的程序中使用UTF-16并且只在I / O上使用UTF-8(我不太了解其他选项,但它们肯定更复杂且容易出错)。

How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?

如何处理char *字符串。毕竟,这可以是无符号或不符号,我想知道我如何确定他们使用的编码(ANSI?),以及如何转换为UTF-8?是否有任何推荐阅读,其中记录了关于字符串的C / C ++的基本保证?

Basically, you can use any encoding, and you will happen to use the native encoding of the system on which you are running on, as long as it's an 8-bit encoding. C was born for ASCII, and locale handling was an afterthought. For years, each system understood mostly one native encoding, say ISO-8859-x, and files from another encoding could even be non-representable.

基本上,您可以使用任何编码,并且您将碰巧使用您运行的系统的本机编码,只要它是8位编码即可。 C诞生于ASCII,并且语言环境处理是事后的想法。多年来,每个系统主要理解一种原生编码,比如ISO-8859-x,而来自另一种编码的文件甚至可以是不可表示的。

Since for UTF-8 strings one byte is not always one character, I guess that the safest bet is to use multibyte string for them. The C manuals I used described multibyte string in abstract, without details on those issues (in particular, on the used encoding). For C, see functions like mbrlen and mbrtowc. On my Linux system, it is noted that their behaviour depends on LC_CTYPE, and this probably means that the native type of multibyte strings. From the documentation it can be inferred that their API supports also encodings where you can shift from one-byte to two-bytes and back.

因为对于UTF-8字符串,一个字节并不总是一个字符,我想最安全的选择是为它们使用多字节字符串。我使用的C手册描述了抽象的多字节字符串,没有关于这些问题的详细信息(特别是关于使用的编码)。对于C,请参阅mbrlen和mbrtowc等函数。在我的Linux系统上,注意到它们的行为取决于LC_CTYPE,这可能意味着本机类型的多字节字符串。从文档中可以推断出,它们的API也支持编码,您可以在其中从一个字节转换为两个字节并返回。

How to handle char* strings. After all, this can be unsigned or not,

如何处理char *字符串。毕竟,这可以是未签名或不签名,

If you rely on signedness of char, you're doing it wrong. Signedness of chars only matters if you use char as a numeric type, and then you should always use either unsigned or signed chars; in fact, you should pretend that plain char is neither unsigned nor signed, and that an expression like a > 0 (if a is a char) has undefined semantics. But what would it be useful for, anyway?

如果你依赖于char的签名,你做错了。只有使用char作为数字类型时,chars的签名才有意义,然后你应该总是使用unsigned或signed chars;事实上,你应该假装普通字符既不是无符号也不是有符号的,并且像> 0(如果a是char)这样的表达式具有未定义的语义。但是,无论如何它会有用吗?

#1


For a shorter answer, I would just recommend using UTF-16 for simplicity; Java/C#/Python 3.0 switched to that model exactly for simplicity. I've always expected wchar_t to be 16 or 32bit wide, and many platforms support that; indeed, APIs like wcrtomb() do not allow an implementation to support a shift state for wchar_t*, but since UTF-8 needs none, it may be used, while other encodings are ruled out.

对于较短的答案,我建议使用UTF-16以简化;为简单起见,Java / C#/ Python 3.0切换到该模型。我一直期望wchar_t为16或32位宽,许多平台都支持它;实际上,像wcrtomb()这样的API不允许实现支持wchar_t *的移位状态,但由于UTF-8不需要,可以使用它,而其他编码则被排除。

Then, I answer the question about XML.

然后,我回答有关XML的问题。

File input/output of text and XML files, which may be written in different encodings. What is the recommended way of handling this, and how to retrieve the values? I guess, a XML node may contain UTF-16 text, and then I have to work with it somehow.

文件和XML文件的文件输入/输出,可以用不同的编码编写。处理此问题的推荐方法是什么,以及如何检索值?我猜,XML节点可能包含UTF-16文本,然后我必须以某种方式使用它。

I'm not sure, but I don't think so. Mixing two encodings in the same file is asking for trouble and data corruption. Encoding a file in UTF-16 is usually a bad choice since most programs rely on using ASCII everwhere. The issue is: an XML file might use any single encoding, maybe even UTF-16, but then also the initial encoding declaration has to use UTF-16, and even the tags then. The problem I see with UTF-16 is: how should one reliable parse the initial declaration? The answer comes in the specification:, § 4.3.3:

我不确定,但我不这么认为。在同一个文件中混合两个编码会导致问题和数据损坏。以UTF-16编码文件通常是一个糟糕的选择,因为大多数程序都依赖于使用ASCII。问题是:XML文件可能使用任何单一编码,甚至可能使用UTF-16,但是初始编码声明也必须使用UTF-16,甚至是标签。我在UTF-16中看到的问题是:如何可靠地解析初始声明?答案来自规范:,§4.3.3:

In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

在没有外部传输协议(例如HTTP或MIME)提供的信息的情况下,对于实体来说,这是一个致命的错误,包括以不同于声明中指定的编码的形式向XML处理器提供编码声明,或者一个实体,它既不以字节顺序标记也不以编码声明开头,以使用UTF-8以外的编码。请注意,由于ASCII是UTF-8的子集,因此普通的ASCII实体并不严格需要编码声明。

When reading that, note that also an XML file is an entity, called the document entity; in general, an entity is a storage unit for the document. From the whole specification, I'd say that only one encoding declaration is allowed for each entity, and I'd convert all entities to UTF-16 when reading them for easier handling.

阅读时,请注意XML文件也是一个实体,称为文档实体;通常,实体是文档的存储单元。从整个规范来看,我会说每个实体只允许一个编码声明,并且在读取它们时我会将所有实体转换为UTF-16以便于处理。

Webography:

#2


String algorithms for UTF-8 etc. strings -- computing the length, parsing, etc. How is this done best?

UTF-8等字符串的字符串算法 - 计算长度,解析等。如何做到最好?

mbrlen gives you the length of a C string. I don't think std::string can be used for multibyte strings, you should use wstring for wide ones.

mbrlen为您提供C字符串的长度。我不认为std :: string可以用于多字节字符串,你应该使用wstring作为宽字符串。

In general, you should probaby stick with UTF-16 inside your program and use UTF-8 only on I/O (I don't know well other options, but they are surely more complex and error-prone).

一般来说,你应该在你的程序中使用UTF-16并且只在I / O上使用UTF-8(我不太了解其他选项,但它们肯定更复杂且容易出错)。

How to handle char* strings. After all, this can be unsigned or not, and I wonder how I determine what encoding they use (ANSI?), and how to convert to UTF-8? Is there any recommended reading on this, where the basic guarantees of C/C++ about strings are documented?

如何处理char *字符串。毕竟,这可以是无符号或不符号,我想知道我如何确定他们使用的编码(ANSI?),以及如何转换为UTF-8?是否有任何推荐阅读,其中记录了关于字符串的C / C ++的基本保证?

Basically, you can use any encoding, and you will happen to use the native encoding of the system on which you are running on, as long as it's an 8-bit encoding. C was born for ASCII, and locale handling was an afterthought. For years, each system understood mostly one native encoding, say ISO-8859-x, and files from another encoding could even be non-representable.

基本上,您可以使用任何编码,并且您将碰巧使用您运行的系统的本机编码,只要它是8位编码即可。 C诞生于ASCII,并且语言环境处理是事后的想法。多年来,每个系统主要理解一种原生编码,比如ISO-8859-x,而来自另一种编码的文件甚至可以是不可表示的。

Since for UTF-8 strings one byte is not always one character, I guess that the safest bet is to use multibyte string for them. The C manuals I used described multibyte string in abstract, without details on those issues (in particular, on the used encoding). For C, see functions like mbrlen and mbrtowc. On my Linux system, it is noted that their behaviour depends on LC_CTYPE, and this probably means that the native type of multibyte strings. From the documentation it can be inferred that their API supports also encodings where you can shift from one-byte to two-bytes and back.

因为对于UTF-8字符串,一个字节并不总是一个字符,我想最安全的选择是为它们使用多字节字符串。我使用的C手册描述了抽象的多字节字符串,没有关于这些问题的详细信息(特别是关于使用的编码)。对于C,请参阅mbrlen和mbrtowc等函数。在我的Linux系统上,注意到它们的行为取决于LC_CTYPE,这可能意味着本机类型的多字节字符串。从文档中可以推断出,它们的API也支持编码,您可以在其中从一个字节转换为两个字节并返回。

How to handle char* strings. After all, this can be unsigned or not,

如何处理char *字符串。毕竟,这可以是未签名或不签名,

If you rely on signedness of char, you're doing it wrong. Signedness of chars only matters if you use char as a numeric type, and then you should always use either unsigned or signed chars; in fact, you should pretend that plain char is neither unsigned nor signed, and that an expression like a > 0 (if a is a char) has undefined semantics. But what would it be useful for, anyway?

如果你依赖于char的签名,你做错了。只有使用char作为数字类型时,chars的签名才有意义,然后你应该总是使用unsigned或signed chars;事实上,你应该假装普通字符既不是无符号也不是有符号的,并且像> 0(如果a是char)这样的表达式具有未定义的语义。但是,无论如何它会有用吗?