C / C ++中的跨平台通用文本处理

时间:2021-12-21 12:02:29

What's the current best practice for handling generic text in a platform independent way?

以独立于平台的方式处理通用文本的当前最佳实践是什么?

For example, on Windows there are the "A" and "W" versions of APIs. Down at the C layer we have the "_tcs" functions (like _tcscpy) which map to either "wcscpy" or "strcpy". And in the STL I've frequently used something like:

例如,在Windows上有API的“A”和“W”版本。在C层,我们有“_tcs”函数(如_tcscpy),它们映射到“wcscpy”或“strcpy”。在STL我经常使用类似的东西:

typedef std::basic_string<TCHAR> tstring;

What issues if any arise from these sorts of patterns on other systems?

在其他系统上出现这些类型的模式会产生什么问题?

3 个解决方案

#1


There is no support for a generic (variable-width) chararacter like TCHAR in standard C++. C++ does have wchar_t, but the encoding isn't guaranteed. C++1x will much improve things once we have char16_t and char32_t as well as UTF-{8,16,32} literals.

在标准C ++中不支持像TCHAR这样的通用(可变宽度)字符。 C ++确实有wchar_t,但不保证编码。一旦我们有char16_t和char32_t以及UTF- {8,16,32}文字,C ++ 1x将大大改善一些事情。

I personally am not a big fan of generic characters because they lead to some nasty problems (like conversion) and, what's more, if you are using a type (like TCHAR) that might ever have a maximum width of 8, you might as well code with char. If you really need that backwards-compatibility, just use UTF-8; it is specifically designed to be a strict superset of ASCII. You may have to use conversion APIs (especially on Windows, which for some bizarre reason is UTF-16), but at least it'll be consistent.

我个人并不是泛型字符的忠实粉丝,因为它们会导致一些令人讨厌的问题(比如转换),而且,如果你使用的类型(如TCHAR)可能最大宽度为8,你可能也会代码与char。如果你真的需要向后兼容性,只需使用UTF-8;它专门设计为ASCII的严格超集。您可能必须使用转换API(特别是在Windows上,由于某些奇怪的原因是UTF-16),但至少它会保持一致。

EDIT: To actually answer the original question, other platforms typically have no such construct. You will have to define your TCHAR on that platform, or else use a library that provides one (but as you should no doubt be able to guess, I'm not a big fan of that concept in libraries either).

编辑:要真正回答原始问题,其他平台通常没有这样的结构。您将不得不在该平台上定义您的TCHAR,或者使用提供一个TCHAR的库(但您无疑应该猜测,我不是图书馆中那个概念的忠实粉丝)。

#2


One thing to be careful of is to make sure for all static libraries that you have, and modules that use these static libraries, that you use the same char format. Because otherwise your code will compile, but not link properly.

需要注意的一件事是确保您拥有的所有静态库以及使用这些静态库的模块使用相同的char格式。因为否则您的代码将编译,但不能正确链接。

I typically create my own t types based on the stl types. tstring, tstringstream, and even down to boost types like tpath_t.

我通常根据stl类型创建自己的t类型。 tstring,tstringstream,甚至是tpath_t之类的提升类型。

#3


Unicode character set + the encoding that makes the most sense for your data. I typically use UTF-8 because it's convenient with traditional C / C++ functions and the data I deal with doesn't cause too much bloat.

Unicode字符集+对您的数据最有意义的编码。我通常使用UTF-8,因为它对传统的C / C ++函数很方便,而且我处理的数据不会导致过多的膨胀。

Some APIs (Windows) and cross language tools (Java) use UTF-16 so that might be a consideration.

一些API(Windows)和跨语言工具(Java)使用UTF-16,因此可能需要考虑。

One practice I wish we had been better at is to leave text as an array bytes for doing low tech operations like copying, simple comparison, simple searching, etc. When you need the richer more character aware operations you can convert to some super string (icu strings are nice -- but heavy) and define the layers / entry points that need to do this as opposed to naively doing it everywhere. The needless conversations kills our performance -- especially when combined with an XML DOM library which also uses the "super" strings.

我希望我们做得更好的一种做法是将文本作为数组字节保留,以进行复制,简单比较,简单搜索等低技术操作。当您需要更丰富的字符识别操作时,您可以转换为一些超级字符串( icu字符串很好 - 但很重)并定义了需要执行此操作的图层/入口点,而不是天真地在任何地方执行它。不必要的对话会破坏我们的性能 - 特别是当与使用“超级”字符串的XML DOM库结合时。

#1


There is no support for a generic (variable-width) chararacter like TCHAR in standard C++. C++ does have wchar_t, but the encoding isn't guaranteed. C++1x will much improve things once we have char16_t and char32_t as well as UTF-{8,16,32} literals.

在标准C ++中不支持像TCHAR这样的通用(可变宽度)字符。 C ++确实有wchar_t,但不保证编码。一旦我们有char16_t和char32_t以及UTF- {8,16,32}文字,C ++ 1x将大大改善一些事情。

I personally am not a big fan of generic characters because they lead to some nasty problems (like conversion) and, what's more, if you are using a type (like TCHAR) that might ever have a maximum width of 8, you might as well code with char. If you really need that backwards-compatibility, just use UTF-8; it is specifically designed to be a strict superset of ASCII. You may have to use conversion APIs (especially on Windows, which for some bizarre reason is UTF-16), but at least it'll be consistent.

我个人并不是泛型字符的忠实粉丝,因为它们会导致一些令人讨厌的问题(比如转换),而且,如果你使用的类型(如TCHAR)可能最大宽度为8,你可能也会代码与char。如果你真的需要向后兼容性,只需使用UTF-8;它专门设计为ASCII的严格超集。您可能必须使用转换API(特别是在Windows上,由于某些奇怪的原因是UTF-16),但至少它会保持一致。

EDIT: To actually answer the original question, other platforms typically have no such construct. You will have to define your TCHAR on that platform, or else use a library that provides one (but as you should no doubt be able to guess, I'm not a big fan of that concept in libraries either).

编辑:要真正回答原始问题,其他平台通常没有这样的结构。您将不得不在该平台上定义您的TCHAR,或者使用提供一个TCHAR的库(但您无疑应该猜测,我不是图书馆中那个概念的忠实粉丝)。

#2


One thing to be careful of is to make sure for all static libraries that you have, and modules that use these static libraries, that you use the same char format. Because otherwise your code will compile, but not link properly.

需要注意的一件事是确保您拥有的所有静态库以及使用这些静态库的模块使用相同的char格式。因为否则您的代码将编译,但不能正确链接。

I typically create my own t types based on the stl types. tstring, tstringstream, and even down to boost types like tpath_t.

我通常根据stl类型创建自己的t类型。 tstring,tstringstream,甚至是tpath_t之类的提升类型。

#3


Unicode character set + the encoding that makes the most sense for your data. I typically use UTF-8 because it's convenient with traditional C / C++ functions and the data I deal with doesn't cause too much bloat.

Unicode字符集+对您的数据最有意义的编码。我通常使用UTF-8,因为它对传统的C / C ++函数很方便,而且我处理的数据不会导致过多的膨胀。

Some APIs (Windows) and cross language tools (Java) use UTF-16 so that might be a consideration.

一些API(Windows)和跨语言工具(Java)使用UTF-16,因此可能需要考虑。

One practice I wish we had been better at is to leave text as an array bytes for doing low tech operations like copying, simple comparison, simple searching, etc. When you need the richer more character aware operations you can convert to some super string (icu strings are nice -- but heavy) and define the layers / entry points that need to do this as opposed to naively doing it everywhere. The needless conversations kills our performance -- especially when combined with an XML DOM library which also uses the "super" strings.

我希望我们做得更好的一种做法是将文本作为数组字节保留,以进行复制,简单比较,简单搜索等低技术操作。当您需要更丰富的字符识别操作时,您可以转换为一些超级字符串( icu字符串很好 - 但很重)并定义了需要执行此操作的图层/入口点,而不是天真地在任何地方执行它。不必要的对话会破坏我们的性能 - 特别是当与使用“超级”字符串的XML DOM库结合时。