在SQL Server中存储UTF-16 / Unicode数据

时间:2023-02-05 10:15:56

According to this, SQL Server 2K5 uses UCS-2 internally. It can store UTF-16 data in UCS-2 (with appropriate data types, nchar etc), however if there is a supplementary character this is stored as 2 UCS-2 characters.

据此,SQL Server 2K5内部使用UCS-2。它可以在UCS-2中存储UTF-16数据(具有适当的数据类型,nchar等),但是如果存在补充字符,则将其存储为2个UCS-2字符。

This brings the obvious issues with the string functions, namely that what is one character is treated as 2 by SQL Server.

这带来了字符串函数的明显问题,即SQL Server将一个字符视为2。

I am somewhat surprised that SQL Server is basically only able to handle UCS-2, and even more so that this is not fixed in SQL 2K8. I do appreciate that some of these characters may not be all that common.

我有点惊讶的是,SQL Server基本上只能处理UCS-2,甚至更多,因此SQL 2K8中没有修复它。我很欣赏其中一些角色可能并不常见。

Aside from the functions suggested in the article, any suggestions on best approach for dealing with the (broken) string functions and UTF-16 data in SQL Server 2K5.

除了本文中建议的功能之外,还有关于在SQL Server 2K5中处理(损坏的)字符串函数和UTF-16数据的最佳方法的任何建议。

3 个解决方案

#1


5  

SQL Server 2012 now supports UTF-16 including surrogate pairs. See http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx, especially the section "Supplementary characters".

SQL Server 2012现在支持UTF-16,包括代理对。请参阅http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx,尤其是“补充字符”部分。

So one fix for the original problem is to adopt SQL Server 2012.

因此,对原始问题的一个修复是采用SQL Server 2012。

#2


2  

The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.

字符串函数可以与unicode字符串一起使用;那些关心字符数的字符将两个字节的字符视为单个字符,而不是两个字符。唯一需要注意的是len()和datalength(),它们在使用unicode时返回不同的值。它们返回正确的值 - len()返回字符长度,datalength()返回字节长度。由于双字节字符,它们恰好不同。

So, as long as you use the proper functions in your code, everything should work transparently.

因此,只要您在代码中使用适当的函数,一切都应该透明地工作。

EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.

编辑:只需仔细检查联机丛书,自SQL Server 2000以来,unicode数据与字符串函数无关。

EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.

编辑2:正如评论中所指出的,SQL Server的字符串函数不支持完整的Unicode字符集,因为缺乏对在平面0之外解析代理的支持(换句话说,SQL Server的字符串函数最多只能识别2个每个字符的字节数。)SQL Server将正确存储和返回数据,但依赖于字符计数的任何字符串函数都不会返回预期的值。绕过这种情况的最常见方法似乎是在SQL Server外部处理字符串,或者使用CLR集成来添加Unicode感知字符串处理函数。

#3


-2  

something to add, that I just learned the hard way:

要添加的东西,我刚刚学到了很多东西:

if you use an "n" field in oracle (im running 9i), and access it via the .net oracleclient, it seems that only parameterized sql will work... the N'string' unicode prefix doesnt seem to do the trick if you have some inline sql.

如果您在oracle中使用“n”字段(即运行9i),并通过.net oracleclient访问它,似乎只有参数化的sql才能工作...... N字符串'unicode前缀似乎没有做到这一点你有一些内联SQL。

and by "work", I mean: it will lose any characters not supported by the base charset. So in my instances, english chars work fine, cyrillic turns into question marks/garbage.

通过“工作”,我的意思是:它将丢失基本字符集不支持的任何字符。所以在我的例子中,英语字符工作得很好,西里尔字母变成问号/垃圾。

this is a fuller discussion on the subject: http://forums.oracle.com/forums/thread.jspa?threadID=376847

这是关于这个主题的更全面的讨论:http://forums.oracle.com/forums/thread.jspa?threadID = 376847

Wonder if the ORA_NCHAR_LITERAL_REPLACE variable can be set in the connection string or something.

想知道是否可以在连接字符串中设置ORA_NCHAR_LITERAL_REPLACE变量。

#1


5  

SQL Server 2012 now supports UTF-16 including surrogate pairs. See http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx, especially the section "Supplementary characters".

SQL Server 2012现在支持UTF-16,包括代理对。请参阅http://msdn.microsoft.com/en-us/library/ms143726(v=sql.110).aspx,尤其是“补充字符”部分。

So one fix for the original problem is to adopt SQL Server 2012.

因此,对原始问题的一个修复是采用SQL Server 2012。

#2


2  

The string functions work fine with unicode character strings; the ones that care about the number of characters treat a two-byte character as a single character, not two characters. The only ones to watch for are len() and datalength(), which return different values when using unicode. They return the correct values of course - len() returns the length in characters, and datalength() returns the length in bytes. They just happen to be different because of the two-byte characters.

字符串函数可以与unicode字符串一起使用;那些关心字符数的字符将两个字节的字符视为单个字符,而不是两个字符。唯一需要注意的是len()和datalength(),它们在使用unicode时返回不同的值。它们返回正确的值 - len()返回字符长度,datalength()返回字节长度。由于双字节字符,它们恰好不同。

So, as long as you use the proper functions in your code, everything should work transparently.

因此,只要您在代码中使用适当的函数,一切都应该透明地工作。

EDIT: Just double-checked Books Online, unicode data has worked seemlessly with string functions since SQL Server 2000.

编辑:只需仔细检查联机丛书,自SQL Server 2000以来,unicode数据与字符串函数无关。

EDIT 2: As pointed out in the comments, SQL Server's string functions do not support the full Unicode character set due to lack of support for parsing surrogates outside of plane 0 (or, in other words, SQL Server's string functions only recognize up to 2 bytes per character.) SQL Server will store and return the data correctly, however any string function that relies on character counts will not return the expected values. The most common way to bypass this seems to be either processing the string outside SQL Server, or else using the CLR integration to add Unicode aware string processing functions.

编辑2:正如评论中所指出的,SQL Server的字符串函数不支持完整的Unicode字符集,因为缺乏对在平面0之外解析代理的支持(换句话说,SQL Server的字符串函数最多只能识别2个每个字符的字节数。)SQL Server将正确存储和返回数据,但依赖于字符计数的任何字符串函数都不会返回预期的值。绕过这种情况的最常见方法似乎是在SQL Server外部处理字符串,或者使用CLR集成来添加Unicode感知字符串处理函数。

#3


-2  

something to add, that I just learned the hard way:

要添加的东西,我刚刚学到了很多东西:

if you use an "n" field in oracle (im running 9i), and access it via the .net oracleclient, it seems that only parameterized sql will work... the N'string' unicode prefix doesnt seem to do the trick if you have some inline sql.

如果您在oracle中使用“n”字段(即运行9i),并通过.net oracleclient访问它,似乎只有参数化的sql才能工作...... N字符串'unicode前缀似乎没有做到这一点你有一些内联SQL。

and by "work", I mean: it will lose any characters not supported by the base charset. So in my instances, english chars work fine, cyrillic turns into question marks/garbage.

通过“工作”,我的意思是:它将丢失基本字符集不支持的任何字符。所以在我的例子中,英语字符工作得很好,西里尔字母变成问号/垃圾。

this is a fuller discussion on the subject: http://forums.oracle.com/forums/thread.jspa?threadID=376847

这是关于这个主题的更全面的讨论:http://forums.oracle.com/forums/thread.jspa?threadID = 376847

Wonder if the ORA_NCHAR_LITERAL_REPLACE variable can be set in the connection string or something.

想知道是否可以在连接字符串中设置ORA_NCHAR_LITERAL_REPLACE变量。