在std::string中存储unicode UTF-8字符串

时间:2023-01-05 22:26:18

In response to discussion in

作为对In的讨论的回应

Cross-platform strings (and Unicode) in C++

c++中的跨平台字符串(和Unicode)

How to deal with Unicode strings in C/C++ in a cross-platform friendly way?

如何以跨平台友好的方式处理C/ c++中的Unicode字符串?

I'm trying to assign a UTF-8 string to a std::string variable in Visual Studio 2010 environment

我试图在Visual Studio 2010环境中为std::string变量分配一个UTF-8字符串

std::string msg = "महसुस";

std::string味精= "महसुस”;

However, when I view the string view debugger, I only see "?????" I have the file saved as Unicode (UTF-8 with Signature) and i'm using character set "use unicode character set"

但是,当我查看字符串视图调试器时,我只看到“????? ?”我将文件保存为Unicode (UTF-8,带有签名),我使用字符集“使用Unicode字符集”

"महसुस" is a nepali language and it contains 5 characters and will occupy 15 bytes. But visual studio debugger shows msg size as 5

“महसुस”是一个尼泊尔语,它包含5字符并将占领15字节。但是visual studio调试器显示msg大小为5

My question is:

我的问题是:

How do I use std::string to just store the utf-8 without needing to manipulate it?

如何使用std: string来存储utf-8而不需要对其进行操作?

5 个解决方案

#1


8  

If you were using C++11 then this would be easy:

如果你用的是c++ 11,那么这很简单:

std::string msg = u8"महसुस";

But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):

但是由于您不是,所以您可以使用转义序列,而不依赖源文件的字符集来为您管理编码,这样您的代码就更具可移植性(以防您不小心将其保存为非utf8格式):

std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "महसुस"

Otherwise, you might consider doing a conversion at runtime instead:

否则,您可能考虑在运行时进行转换:

std::string toUtf8(const std::wstring &str)
{
    std::string ret;
    int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
    if (len > 0)
    {
        ret.resize(len);
        WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
    }
    return ret;
}

std::string msg = toUtf8(L"महसुस");

#2


4  

If you have C++11, you can write u8"महसुस". Otherwise, you'll have to write the actual byte sequence, using \xxx for each byte in the UTF-8 sequence.

如果你有c++ 11中,您可以编写与“महसुस”。否则,您将不得不为UTF-8序列中的每个字节编写实际的字节序列,使用\xxx。

Typically, you're better off reading such text from a configuration file.

通常,最好从配置文件中读取这些文本。

#3


4  

You can write msg.c_str(), s8 in the Watches window to see the UTF-8 string correctly.

您可以在手表窗口中写入msg.c_str()、s8以正确地查看UTF-8字符串。

#4


1  

There is a way to display the right values thanks to the ‘s8′ format specifier. If we append ‘,s8′ to the variable names, Visual Studio reparses the text in UTF-8 and renders the text correctly:

有一种方法来显示正确的值由于s8′格式说明符。如果我们添加的变量名,学生8′,Visual Studio重新解析在utf - 8和呈现的文本文本正确:

In case, you are using Microsoft Visual Studio 2008 Service Pack 1, you need to apply hotfix

如果您正在使用Microsoft Visual Studio 2008 Service Pack 1,那么您需要应用hotfix

http://support.microsoft.com/kb/980263

http://support.microsoft.com/kb/980263

#5


1  

If you set the system locale to English, and the file is in UTF-8 without BOM, VC will let you store the string as-is. I have written an article about this here.

如果您将系统语言环境设置为英语,并且该文件在没有BOM的UTF-8中,那么VC将允许您按原样存储字符串。我在这里写了一篇关于这方面的文章。

在std::string中存储unicode UTF-8字符串

#1


8  

If you were using C++11 then this would be easy:

如果你用的是c++ 11,那么这很简单:

std::string msg = u8"महसुस";

But since you are not, you can use escape sequences and not rely on the source file's charset to manage the encoding for you, this way your code is more portable (in case you accidentally save it in a non-UTF8 format):

但是由于您不是,所以您可以使用转义序列,而不依赖源文件的字符集来为您管理编码,这样您的代码就更具可移植性(以防您不小心将其保存为非utf8格式):

std::string msg = "\xE0\xA4\xAE\xE0\xA4\xB9\xE0\xA4\xB8\xE0\xA5\x81\xE0\xA4\xB8"; // "महसुस"

Otherwise, you might consider doing a conversion at runtime instead:

否则,您可能考虑在运行时进行转换:

std::string toUtf8(const std::wstring &str)
{
    std::string ret;
    int len = WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0, NULL, NULL);
    if (len > 0)
    {
        ret.resize(len);
        WideCharToMultiByte(CP_UTF8, 0, str.c_str(), str.length(), &ret[0], len, NULL, NULL);
    }
    return ret;
}

std::string msg = toUtf8(L"महसुस");

#2


4  

If you have C++11, you can write u8"महसुस". Otherwise, you'll have to write the actual byte sequence, using \xxx for each byte in the UTF-8 sequence.

如果你有c++ 11中,您可以编写与“महसुस”。否则,您将不得不为UTF-8序列中的每个字节编写实际的字节序列,使用\xxx。

Typically, you're better off reading such text from a configuration file.

通常,最好从配置文件中读取这些文本。

#3


4  

You can write msg.c_str(), s8 in the Watches window to see the UTF-8 string correctly.

您可以在手表窗口中写入msg.c_str()、s8以正确地查看UTF-8字符串。

#4


1  

There is a way to display the right values thanks to the ‘s8′ format specifier. If we append ‘,s8′ to the variable names, Visual Studio reparses the text in UTF-8 and renders the text correctly:

有一种方法来显示正确的值由于s8′格式说明符。如果我们添加的变量名,学生8′,Visual Studio重新解析在utf - 8和呈现的文本文本正确:

In case, you are using Microsoft Visual Studio 2008 Service Pack 1, you need to apply hotfix

如果您正在使用Microsoft Visual Studio 2008 Service Pack 1,那么您需要应用hotfix

http://support.microsoft.com/kb/980263

http://support.microsoft.com/kb/980263

#5


1  

If you set the system locale to English, and the file is in UTF-8 without BOM, VC will let you store the string as-is. I have written an article about this here.

如果您将系统语言环境设置为英语,并且该文件在没有BOM的UTF-8中,那么VC将允许您按原样存储字符串。我在这里写了一篇关于这方面的文章。

在std::string中存储unicode UTF-8字符串