如何将Unicode字符串转换为utf-8或utf-16字符串？

How to convert Unicode string into a utf-8 or utf-16 string? My VS2005 project is using Unicode char set, while sqlite in cpp provide

如何将Unicode字符串转换为utf-8或utf-16字符串?我的VS2005项目使用Unicode字符集,而cppite中的sqlite提供

int sqlite3_open(
  const char *filename,   /* Database filename (UTF-8) */
  sqlite3 **ppDb          /* OUT: SQLite db handle */
);
int sqlite3_open16(
  const void *filename,   /* Database filename (UTF-16) */
  sqlite3 **ppDb          /* OUT: SQLite db handle */
);

for opening a folder. How can I convert string, CString, or wstring into UTF-8 or UTF-16 charset?

用于打开文件夹。如何将字符串,CString或wstring转换为UTF-8或UTF-16字符集?

Thanks very much!

非常感谢!

5 个解决方案

#1

Short answer:

No conversion required if you use Unicode strings such as CString or wstring. Use sqlite3_open16(). You will have to make sure you pass a WCHAR pointer (casted to void *. Seems lame! Even if this lib is cross platform, I guess they could have defined a wide char type that depends on the platform and is less unfriendly than a void *) to the API. Such as for a CString: (void*)(LPCWSTR)strFilename

如果使用CString或wstring等Unicode字符串,则无需转换。使用sqlite3_open16()。你必须确保你传递一个WCHAR指针(转换为void *。看起来很蹩脚!即使这个lib是跨平台的,我猜他们可能已经定义了一个取决于平台的宽字符类型,并且不像虚空那样不友好*)到API。例如对于CString :( void *)(LPCWSTR)strFilename

The longer answer:

答案越长:

You don't have a Unicode string that you want to convert to UTF8 or UTF16. You have a Unicode string represented in your program using a given encoding: Unicode is not a binary representation per se. Encodings say how the Unicode code points (numerical values) are represented in memory (binary layout of the number). UTF8 and UTF16 are the most widely used encodings. They are very different though.

您没有要转换为UTF8或UTF16的Unicode字符串。您使用给定的编码在程序中表示Unicode字符串:Unicode本身不是二进制表示形式。编码说明如何在内存中表示Unicode代码点(数值)(数字的二进制布局)。 UTF8和UTF16是使用最广泛的编码。但它们非常不同。

When a VS project says "Unicode charset", it actually means "characters are encoded as UTF16". Therefore, you can use sqlite3_open16() directly. No conversion required. Characters are stored in WCHAR type (as opposed to char) which takes 16 bits (Fallsback on standard C type wchar_t, which takes 16 bits on Win32. Might be different on other platforms. Thanks for the correction, Checkers).

当一个VS项目说“Unicode charset”时,它实际上意味着“字符被编码为UTF16”。因此,您可以直接使用sqlite3_open16()。无需转换。字符存储在WCHAR类型(而不是char)中,占用16位(标准C类型wchar_t上的回退,在Win32上占16位。在其他平台上可能会有所不同。感谢校正,Checkers)。

There's one more detail that you might want to pay attention to: UTF16 exists in 2 flavors: Big Endian and Little Endian. That's the byte ordering of these 16 bits. The function prototype you give for UTF16 doesn't say which ordering is used. But you're pretty safe assuming that sqlite uses the same endian-ness as Windows (Little Endian IIRC. I know the order but have always had problem with the names :-) ).

还有一个您可能需要注意的细节:UTF16有两种版本:Big Endian和Little Endian。这是这16位的字节顺序。您为UTF16提供的函数原型并未说明使用了哪种排序。但是你很安全,假设sqlite使用与Windows相同的字节序(Little Endian IIRC。我知道顺序,但一直有名字的问题:-))。

EDIT: Answer to comment by Checkers:

编辑:回答Checkers的评论:

UTF16 uses 16 bits code units. Under Win32 (and only on Win32), wchar_t is used for such storage unit. The trick is that some Unicode characters require a sequence of 2 such 16-bits code units. They are called Surrogate Pairs.

UTF16使用16位代码单元。在Win32下(仅在Win32上),wchar_t用于此类存储单元。诀窍是一些Unicode字符需要2个这样的16位代码单元的序列。它们被称为代理对。

The same way an UTF8 represents 1 character using a 1 to 4 bytes sequence. Yet UTF8 are used with the char type.

UTF8使用1到4个字节序列表示1个字符的方式相同。然而,UTF8与char类型一起使用。

#2

Use the WideCharToMultiByte function. Specify CP_UTF8 for the CodePage parameter.

使用WideCharToMultiByte函数。为CodePage参数指定CP_UTF8。

CHAR buf[256]; // or whatever
WideCharToMultiByte(
  CP_UTF8, 
  0, 
  StringToConvert, // the string you have
  -1, // length of the string - set -1 to indicate it is null terminated
  buf, // output
  __countof(buf), // size of the buffer in bytes - if you leave it zero the return value is the length required for the output buffer
  NULL,    
  NULL
);

Also, the default encoding for unicode apps in windows is UTF-16LE, so you might not need to perform any translation and just use the second version sqlite3_open16.

此外,Windows中unicode应用程序的默认编码是UTF-16LE,因此您可能不需要执行任何转换,只需使用第二个版本sqlite3_open16。

#3

All the C++ string types are charset neutral. They just settle on a character width, and make no further assumptions. A wstring uses 16-bit characters in Windows, corresponding roughly to utf-16, but it still depends on what you store in the thread. The wstring doesn't in any way enforce that the data you put in it must be valid utf16. Windows uses utf16 when UNICODE is defined though, so most likely your strings are already utf16, and you don't need to do anything.

所有C ++字符串类型都是charset中立的。他们只是在一个字符宽度上,并没有进一步的假设。 wstring在Windows中使用16位字符,大致相当于utf-16,但它仍然取决于你在线程中存储的内容。 wstring不以任何方式强制您放入其中的数据必须是有效的utf16。虽然定义了UNICODE,但Windows使用utf16,因此很可能你的字符串已经是utf16,而且你不需要做任何事情。

A few others have suggested using the WideCharToMultiByte function, which is (one of) the way(s) to go to convert utf16 to utf8. But since sqlite can handle utf16, that shouldn't be necessary.

其他一些人建议使用WideCharToMultiByte函数,这是将utf16转换为utf8的方法之一。但是因为sqlite可以处理utf16,所以这不是必需的。

#4

utf-8 and utf-16 are both "unicode" character encodings. What you probably talk about is utf-32 which is a fixed-size character encoding. Maybe searching for

utf-8和utf-16都是“unicode”字符编码。您可能谈论的是utf-32,它是一个固定大小的字符编码。也许正在寻找

"Convert utf-32 into utf-8 or utf-16"

“将utf-32转换为utf-8或utf-16”

provides you some results or other papers on this.

为您提供一些结果或其他论文。

#5

The simplest way to do this is to use CStringA. The CString class is a typedef for either CStringA (ASCII version) or CStringW (wide char version). Both of these classes have constructors to convert string types. I typically use:

最简单的方法是使用CStringA。 CString类是CStringA(ASCII版本)或CStringW(宽字符版本)的typedef。这两个类都有构造函数来转换字符串类型。我通常使用:

sqlite3_open(CStringA(L"MyWideCharFileName"), ...);

#1