在C#中处理Unicode字符串的最佳实践是什么?

时间:2021-01-01 20:18:20

Can somebody please provide me some important aspects I should be aware of while handling Unicode strings in C#?

在C#中处理Unicode字符串时,有人可以提供一些我应该注意的重要方面吗?

7 个解决方案

#1


11  

Keep in mind that C# strings are sequnces of Char, UTF-16 code units. They are not Unicode code-points. Some unicode code points require two Char's, and you should not split strings between these Chars.

请记住,C#字符串是Char,UTF-16代码单元的序列。它们不是Unicode代码点。一些unicode代码点需要两个Char,你不应该在这些Chars之间分割字符串。

In addition, unicode code points may combine to form a single language 'character' -- for instance, a 'u' Char followed by umlat Char. So you can't split strings between arbitrary code points either.

此外,unicode代码点可以组合形成单个语言'character' - 例如,'u'Char后跟umlat Char。因此,您也无法在任意代码点之间拆分字符串。

Basically, it's mess of issues, where any given issue may only in practice affect languages you don't know.

基本上,它是一堆乱七八糟的问题,任何特定的问题可能只会在实践中影响你不知道的语言。

#2


7  

C# (and .Net in general) handle unicode strings transparently, and you won't have to do anything special unless your application needs to read/write files with specific encodings. In those cases, you can convert managed strings to byte arrays of the encoding of your choice by using the classes in the System.Text.Encodings namespace.

C#(和.Net一般)透明地处理unicode字符串,除非您的应用程序需要读取/写入具有特定编码的文件,否则您不必执行任何特殊操作。在这些情况下,您可以使用System.Text.Encodings命名空间中的类将托管字符串转换为您选择的编码的字节数组。

#3


2  

System.String already handled unicode internally so you are covered there. Best practice would be to use System.Text.Encoding.UTF8Encoding when reading and writing files. It's more than just reading/writing files however, anything that streams data out including network connections is going to depend upon the encoding. If you're using WCF, it's going to default to UTF8 for most of the bindings (in fact most don't allow ASCII at all).

System.String已在内部处理unicode,因此您可以在那里进行处理。最佳做法是在读取和写入文件时使用System.Text.Encoding.UTF8Encoding。它不仅仅是读/写文件,任何流出数据的东西,包括网络连接,都取决于编码。如果您正在使用WCF,那么对于大多数绑定,它将默认为UTF8(实际上大多数都不允许使用ASCII)。

UTF8 is a good choice because while it still supports the entire Unicode character set, for the majority of the ASCII character set it has a byte similarity. Thus naive applications that don't support Unicode have some chance of reading/writing your applications data. Those applications will only begin to fail when you start using extended characters.

UTF8是一个不错的选择,因为它仍然支持整个Unicode字符集,对于大多数ASCII字符集,它具有字节相似性。因此,不支持Unicode的天真应用程序有可能读取/写入您的应用程序数据。当您开始使用扩展字符时,这些应用程序才会开始失败。

System.Text.Encoding.Unicode will write UTF-16 which is a minimum of two bytes per character, making it both larger and fully incompatible with ASCII. And System.Text.Encoding.UTF32 as you can guess is larger still. I'm not sure of the real-world use case of UTF-16 and 32, but perhaps they perform better when you have large numbers of extended characters. That's just a theory, but if it is true, then Japanese/Chinese developers making a product that will be used primarily in those languages might find UTF-16/32 a better choice.

System.Text.Encoding.Unicode将写入UTF-16,每个字符至少有两个字节,使其更大,与ASCII完全不兼容。而您可以猜测的System.Text.Encoding.UTF32仍然更大。我不确定UTF-16和32的真实用例,但是当你有大量的扩展字符时,它们可能表现得更好。这只是一个理论,但如果确实如此,那么制作主要用于这些语言的产品的日本/中国开发商可能会发现UTF-16/32是更好的选择。

#4


1  

Only think about encoding when reading and writing streams. Use TextReader and TextWriters to read and write text in different encodings. Always use utf-8 if you have a choice.

只考虑读写流时的编码。使用TextReader和TextWriters以不同的编码读写文本。如果您有选择,请始终使用utf-8。

Don't get confused by languages and cultures - that's a completely separate issue from unicode.

不要被语言和文化混淆 - 这与unicode完全不同。

#5


0  

.Net has relatively good i18n support. You don't really need to think about unicode that much as all .Net strings and built-in string functions do the right thing with unicode. The only thing to bear in mind is that most of the string functions, for example DateTime.ToString(), use by default the thread's culture which by default is the Windows culture. You can specify a different culture for formatting either on the current thread or on each method call.

.Net拥有相对较好的i18n支持。您并不需要考虑unicode,因为所有.Net字符串和内置字符串函数都使用unicode做正确的事情。唯一要记住的是,大多数字符串函数(例如DateTime.ToString())默认使用线程的文化,默认情况下是Windows文化。您可以在当前线程或每个方法调用上指定不同的文化格式。

The only time unicode is an issue is when encoding/decoding strings to and from bytes.

unicode唯一出现问题的时候是对字节进行编码/解码字符串。

#6


0  

As mentioned, .NET strings handle Unicode transparently. Besides file I/O, the other consideration would be at the database layer. SQL Server for instance distinguishes between VARCHAR (non-unicode) and NVARCHAR (which handles unicode). Also need to pay attention to stored procedure parameters.

如前所述,.NET字符串透明地处理Unicode。除文件I / O外,另一个考虑因素是数据库层。例如,SQL Server区分VARCHAR(非unicode)和NVARCHAR(处理unicode)。还需要注意存储过程参数。

#7


0  

More details can be found on this thread:

更多细节可以在这个帖子中找到:

http://discuss.joelonsoftware.com/default.asp?dotnet.12.189999.12

#1


11  

Keep in mind that C# strings are sequnces of Char, UTF-16 code units. They are not Unicode code-points. Some unicode code points require two Char's, and you should not split strings between these Chars.

请记住,C#字符串是Char,UTF-16代码单元的序列。它们不是Unicode代码点。一些unicode代码点需要两个Char,你不应该在这些Chars之间分割字符串。

In addition, unicode code points may combine to form a single language 'character' -- for instance, a 'u' Char followed by umlat Char. So you can't split strings between arbitrary code points either.

此外,unicode代码点可以组合形成单个语言'character' - 例如,'u'Char后跟umlat Char。因此,您也无法在任意代码点之间拆分字符串。

Basically, it's mess of issues, where any given issue may only in practice affect languages you don't know.

基本上,它是一堆乱七八糟的问题,任何特定的问题可能只会在实践中影响你不知道的语言。

#2


7  

C# (and .Net in general) handle unicode strings transparently, and you won't have to do anything special unless your application needs to read/write files with specific encodings. In those cases, you can convert managed strings to byte arrays of the encoding of your choice by using the classes in the System.Text.Encodings namespace.

C#(和.Net一般)透明地处理unicode字符串,除非您的应用程序需要读取/写入具有特定编码的文件,否则您不必执行任何特殊操作。在这些情况下,您可以使用System.Text.Encodings命名空间中的类将托管字符串转换为您选择的编码的字节数组。

#3


2  

System.String already handled unicode internally so you are covered there. Best practice would be to use System.Text.Encoding.UTF8Encoding when reading and writing files. It's more than just reading/writing files however, anything that streams data out including network connections is going to depend upon the encoding. If you're using WCF, it's going to default to UTF8 for most of the bindings (in fact most don't allow ASCII at all).

System.String已在内部处理unicode,因此您可以在那里进行处理。最佳做法是在读取和写入文件时使用System.Text.Encoding.UTF8Encoding。它不仅仅是读/写文件,任何流出数据的东西,包括网络连接,都取决于编码。如果您正在使用WCF,那么对于大多数绑定,它将默认为UTF8(实际上大多数都不允许使用ASCII)。

UTF8 is a good choice because while it still supports the entire Unicode character set, for the majority of the ASCII character set it has a byte similarity. Thus naive applications that don't support Unicode have some chance of reading/writing your applications data. Those applications will only begin to fail when you start using extended characters.

UTF8是一个不错的选择,因为它仍然支持整个Unicode字符集,对于大多数ASCII字符集,它具有字节相似性。因此,不支持Unicode的天真应用程序有可能读取/写入您的应用程序数据。当您开始使用扩展字符时,这些应用程序才会开始失败。

System.Text.Encoding.Unicode will write UTF-16 which is a minimum of two bytes per character, making it both larger and fully incompatible with ASCII. And System.Text.Encoding.UTF32 as you can guess is larger still. I'm not sure of the real-world use case of UTF-16 and 32, but perhaps they perform better when you have large numbers of extended characters. That's just a theory, but if it is true, then Japanese/Chinese developers making a product that will be used primarily in those languages might find UTF-16/32 a better choice.

System.Text.Encoding.Unicode将写入UTF-16,每个字符至少有两个字节,使其更大,与ASCII完全不兼容。而您可以猜测的System.Text.Encoding.UTF32仍然更大。我不确定UTF-16和32的真实用例,但是当你有大量的扩展字符时,它们可能表现得更好。这只是一个理论,但如果确实如此,那么制作主要用于这些语言的产品的日本/中国开发商可能会发现UTF-16/32是更好的选择。

#4


1  

Only think about encoding when reading and writing streams. Use TextReader and TextWriters to read and write text in different encodings. Always use utf-8 if you have a choice.

只考虑读写流时的编码。使用TextReader和TextWriters以不同的编码读写文本。如果您有选择,请始终使用utf-8。

Don't get confused by languages and cultures - that's a completely separate issue from unicode.

不要被语言和文化混淆 - 这与unicode完全不同。

#5


0  

.Net has relatively good i18n support. You don't really need to think about unicode that much as all .Net strings and built-in string functions do the right thing with unicode. The only thing to bear in mind is that most of the string functions, for example DateTime.ToString(), use by default the thread's culture which by default is the Windows culture. You can specify a different culture for formatting either on the current thread or on each method call.

.Net拥有相对较好的i18n支持。您并不需要考虑unicode,因为所有.Net字符串和内置字符串函数都使用unicode做正确的事情。唯一要记住的是,大多数字符串函数(例如DateTime.ToString())默认使用线程的文化,默认情况下是Windows文化。您可以在当前线程或每个方法调用上指定不同的文化格式。

The only time unicode is an issue is when encoding/decoding strings to and from bytes.

unicode唯一出现问题的时候是对字节进行编码/解码字符串。

#6


0  

As mentioned, .NET strings handle Unicode transparently. Besides file I/O, the other consideration would be at the database layer. SQL Server for instance distinguishes between VARCHAR (non-unicode) and NVARCHAR (which handles unicode). Also need to pay attention to stored procedure parameters.

如前所述,.NET字符串透明地处理Unicode。除文件I / O外,另一个考虑因素是数据库层。例如,SQL Server区分VARCHAR(非unicode)和NVARCHAR(处理unicode)。还需要注意存储过程参数。

#7


0  

More details can be found on this thread:

更多细节可以在这个帖子中找到:

http://discuss.joelonsoftware.com/default.asp?dotnet.12.189999.12