如何在c++中使用UTF-8,从其他编码转换为UTF-8

时间:2023-01-06 14:33:29

I don't know how to solve that:

我不知道怎么解决

Imagine, we have 4 websites:

想象一下,我们有4个网站:

  • A: UTF-8
  • 答:utf - 8
  • B: ISO-8859-1
  • iso - 8859 - 1
  • C: ASCII
  • C:ASCII
  • D: UTF-16
  • D:utf - 16

My Program written in C++ does the following: It downloads a website and parses it. But it has to understand the content. My problem is not the parsing which is done with ASCII-characters like ">" or "<".

我用c++编写的程序执行以下操作:下载一个网站并解析它。但它必须理解内容。我的问题不是使用“>”或“<”之类的ascii字符进行解析。

The problem is that the program should find all words out of the website's text. A word is any combination of alphanumerical characters. Then I send these words to a server. The database and the web-frontend are using UTF-8. So my questions are:

问题是程序应该从网站的文本中找出所有的单词。一个词是任何字母和数字字符的组合。然后我把这些话发送给服务器。数据库和web前端都使用UTF-8。所以我的问题是:

  • How can I convert "any" (or the most used) character encoding to UTF-8?
  • 如何将“任意”(或最常用的)字符编码转换为UTF-8?
  • How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...
  • 如何在c++中使用utf -8字符串?我认为wchar_t不能工作,因为它有两个字节长。UTF-8中的代码点最长可达4字节……
  • Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?
  • 对于这样的utf -8字符串,是否有像isspace()、isalnum()、strlen()、tolower()这样的函数?

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

请注意:我在c++中不做任何输出(如std::cout)。只需过滤掉这些单词并将它们发送到服务器。

I know about UTF8-CPP but it has no is*() functions. And as I read, it does not convert from other character encodings to UTF-8. Only from UTF-* to UTF-8.

我知道UTF8-CPP,但它没有*()函数。正如我所读到的,它不会从其他字符编码转换到UTF-8。只从UTF-*到UTF-8。

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

编辑:我忘了说,这个程序必须是可移植的:Windows, Linux,……

4 个解决方案

#1


9  

How can I convert "any" (or the most used) character encoding to UTF-8?

如何将“任意”(或最常用的)字符编码转换为UTF-8?

ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).

ICU (Unicode的国际组件)是这里的解决方案。它通常被认为是Unicode支持中的最后一种说法。甚至增加。语言环境和提高。Regex在谈到Unicode时使用它。关于为什么我建议直接使用ICU而不是包装器(比如Boost),请参阅我对Dory Zidon的回答的评论。

You create a converter for a given encoding...

为给定的编码创建一个转换器……

#include <ucnv.h>

UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
    // ...
    ucnv_close( converter );
}

...and then use the UnicodeString class as appripriate.

…然后使用UnicodeString类作为appripripriate。

I think wchar_t does not work because it is 2 bytes long.

我认为wchar_t不能工作,因为它有两个字节长。

The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.

wchar_t的大小是由实现定义的。AFAICR, Windows是2字节(UCS-2 / UTF-16,取决于Windows版本),Linux是4字节(UTF-32)。无论如何,由于标准没有为wchar_t定义Unicode语义,因此使用它是不可移植的猜测。不猜,ICU使用。

Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

对于这样的utf -8字符串,是否有像isspace()、isalnum()、strlen()、tolower()这样的函数?

Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.

不是在他们的UTF-8编码中,但是您在内部也不会使用它。UTF-8适合于外部表示,但是在内部UTF-16或UTF-32是更好的选择。上述函数确实存在于Unicode编码点(即。,UChar32);引用uchar.h。

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

请注意:我在c++中不做任何输出(如std::cout)。只需过滤掉这些单词并将它们发送到服务器。

Check BreakIterator.

检查BreakIterator。

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

编辑:我忘了说,这个程序必须是可移植的:Windows, Linux,……

In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.

如果我还没说过,那就用ICU吧,省下很多麻烦吧。即使它似乎有点重量级乍一看,它是最好的实现,它非常便携(使用它在Windows、Linux和AIX自己),你会用它一次又一次的项目,所以时间投入学习它的API不是浪费。

#2


3  

No sure if this will give you everything you're looking for but it might help a little. Have you tried looking at:

不确定这是否会给你所寻找的一切,但它可能会有所帮助。你试过看:

1) Boost.Locale library ? Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16

1)提高。区域图书馆吗?提振。Locale在Boost 1.48(2011年11月15日)中发布,使得从和到UTF8/16的转换更加容易

Here are some convenient examples from the docs:

这里有一些来自文档的方便示例:

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

2) Or at conversions are part of C++11?

2)或转换是C++11的一部分?

#include <codecvt>
#include <locale>
#include <string>
#include <cassert>

int main() {
  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
  std::string utf8 = convert.to_bytes(0x5e9);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}

#3


1  

How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...

如何在c++中使用utf -8字符串?我认为wchar_t不能工作,因为它有两个字节长。UTF-8中的代码点最长可达4字节……

This is easy, there is a project named  tinyutf8  , which is a drop-in replacement for std::string/std::wstring.

这很简单,有一个名为tinyutf8的项目,它是std::string/std::wstring的一个临时替代品。

Then the user can elegantly operate on codepoints, while their representation is always encoded in chars.

然后,用户可以优雅地操作代码点,而它们的表示总是用chars编码。


How can I convert "any" (or the most used) character encoding to UTF-8?

如何将“任意”(或最常用的)字符编码转换为UTF-8?

You might want to have a look at std::codecvt_utf8 and simlilar templates from <codecvt> (C++11).

您可能想看看来自 (c++ 11)的std: codecvt_utf8和simlilar模板。

#4


0  

UTF-8 is an encoding that uses multiple bytes for non-ASCII (7 bits code) utilising the 8th bit. As such you won't find '\', '/' inside of a multi-byte sequence. And isdigit works (though not arabic and other digits).

UTF-8是一种利用8位的非ascii(7位代码)使用多个字节的编码。因此,您不会在多字节序列中找到'\','/'。isdigit可以工作(虽然不是阿拉伯文和其他数字)。

It is a superset of ASCII and can hold all Unicode characters, so definitely to use with char and string.

它是ASCII的超集,可以容纳所有的Unicode字符,所以一定要使用char和string。

Inspect the HTTP headers (case insensitive); they are in ISO-8859-1, and precede an empty line and then the HTML content.

检查HTTP头(不区分大小写);它们在ISO-8859-1中,在空行之前,然后是HTML内容。

Content-Type: text/html; charset=UTF-8

If not present, there also there might be

如果不在场,也可能有

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8">      <!-- HTML5 -->

ISO-8859-1 is Latin 1, and you might do better to convert from Windows-1252, the Windows Latin-1 extension using 0x80 - 0xBF for some special characters like comma quotes and such. Even browsers on MacOS will understand these though ISO-8859-1 was specified.

ISO-8859-1是拉丁文1,您可以更好地从Windows-1252进行转换,Windows的Latin-1扩展名使用0x80 - 0xBF,用于某些特殊字符,如逗号引号等。即使是MacOS上的浏览器也会理解这些,尽管ISO-8859-1已经被指定。

Conversion libraries: alread mentioned by @syam.

转换库:@syam提到的alread。

Conversion

转换

Let's not consider UTF-16. One can read the headers and start till a meta statement for the charset as single-byte chars.

我们不要考虑utf - 16。一个人可以读取头信息,并开始直到将charset用作单字节字符的meta语句。

The conversion from single-byte encoding to UTF-8 can happen via a table. For instance generated with Java: a const char* table[] indexed by the char.

从单字节编码到UTF-8的转换可以通过表进行。例如,使用Java生成的const char* table[]。

table[157] = "\xEF\xBF\xBD";


public static void main(String[] args) {
    final String SOURCE_ENCODING = "windows-1252";
    byte[] sourceBytes = new byte[1];
    System.out.println("    const char* table[] = {");
    for (int c = 0; c < 256; ++c) {
        String comment = "";
        System.out.printf("       /* %3d */ \"", c);
        if (32 <= c && c < 127) {
            // Pure ASCII
            if (c == '\"' || c == '\\')
                System.out.print("\\");
            System.out.print((char)c);
        } else {
            if (c == 0) {
                comment = " // Unusable";
            }
            sourceBytes[0] = (byte)c;
            try {
                byte[] targetBytes = new String(sourceBytes, SOURCE_ENCODING).getBytes("UTF-8");
                for (int j = 0; j < targetBytes.length; ++j) {
                    int b = targetBytes[j] & 0xFF;
                    System.out.printf("\\x%02X", b);
                }
            } catch (UnsupportedEncodingException ex) {
                comment = " // " + ex.getMessage().replaceAll("\\s+", " "); // No newlines.
            }
        }
        System.out.print("\"");
        if (c < 255) {
            System.out.print(",");
        }
        System.out.println();
    }
    System.out.println("    };");
}

#1


9  

How can I convert "any" (or the most used) character encoding to UTF-8?

如何将“任意”(或最常用的)字符编码转换为UTF-8?

ICU (International Components for Unicode) is the solution here. It is generally considered to be the last say in Unicode support. Even Boost.Locale and Boost.Regex use it when it comes to Unicode. See my comment on Dory Zidon's answer as to why I recommend using ICU directly, instead of wrappers (like Boost).

ICU (Unicode的国际组件)是这里的解决方案。它通常被认为是Unicode支持中的最后一种说法。甚至增加。语言环境和提高。Regex在谈到Unicode时使用它。关于为什么我建议直接使用ICU而不是包装器(比如Boost),请参阅我对Dory Zidon的回答的评论。

You create a converter for a given encoding...

为给定的编码创建一个转换器……

#include <ucnv.h>

UConverter * converter;
UErrorCode err = U_ZERO_ERROR;
converter = ucnv_open( "8859-1", &err );
if ( U_SUCCESS( error ) )
{
    // ...
    ucnv_close( converter );
}

...and then use the UnicodeString class as appripriate.

…然后使用UnicodeString类作为appripripriate。

I think wchar_t does not work because it is 2 bytes long.

我认为wchar_t不能工作,因为它有两个字节长。

The size of wchar_t is implementation-defined. AFAICR, Windows is 2 byte (UCS-2 / UTF-16, depending on Windows version), Linux is 4 byte (UTF-32). In any case, since the standard doesn't define Unicode semantics for wchar_t, using it is non-portable guesswork. Don't guess, use ICU.

wchar_t的大小是由实现定义的。AFAICR, Windows是2字节(UCS-2 / UTF-16,取决于Windows版本),Linux是4字节(UTF-32)。无论如何,由于标准没有为wchar_t定义Unicode语义,因此使用它是不可移植的猜测。不猜,ICU使用。

Are there functions like isspace(), isalnum(), strlen(), tolower() for such UTF-8-strings?

对于这样的utf -8字符串,是否有像isspace()、isalnum()、strlen()、tolower()这样的函数?

Not in their UTF-8 encoding, but you don't use that internally anyway. UTF-8 is good for external representation, but internally UTF-16 or UTF-32 are the better choice. The abovementioned functions do exist for Unicode code points (i.e., UChar32); ref. uchar.h.

不是在他们的UTF-8编码中,但是您在内部也不会使用它。UTF-8适合于外部表示,但是在内部UTF-16或UTF-32是更好的选择。上述函数确实存在于Unicode编码点(即。,UChar32);引用uchar.h。

Please note: I do not do any output(like std::cout) in C++. Just filtering out the words and send them to the server.

请注意:我在c++中不做任何输出(如std::cout)。只需过滤掉这些单词并将它们发送到服务器。

Check BreakIterator.

检查BreakIterator。

Edit: I forgot to say, that the program has to be portable: Windows, Linux, ...

编辑:我忘了说,这个程序必须是可移植的:Windows, Linux,……

In case I haven't said it already, do use ICU, and save yourself tons of trouble. Even if it might seem a bit heavyweight at first glance, it is the best implementation out there, it is extremely portable (using it on Windows, Linux, and AIX myself), and you will use it again and again and again in projects to come, so time invested in learning its API is not wasted.

如果我还没说过,那就用ICU吧,省下很多麻烦吧。即使它似乎有点重量级乍一看,它是最好的实现,它非常便携(使用它在Windows、Linux和AIX自己),你会用它一次又一次的项目,所以时间投入学习它的API不是浪费。

#2


3  

No sure if this will give you everything you're looking for but it might help a little. Have you tried looking at:

不确定这是否会给你所寻找的一切,但它可能会有所帮助。你试过看:

1) Boost.Locale library ? Boost.Locale was released in Boost 1.48(November 15th, 2011) making it easier to convert from and to UTF8/16

1)提高。区域图书馆吗?提振。Locale在Boost 1.48(2011年11月15日)中发布,使得从和到UTF8/16的转换更加容易

Here are some convenient examples from the docs:

这里有一些来自文档的方便示例:

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

2) Or at conversions are part of C++11?

2)或转换是C++11的一部分?

#include <codecvt>
#include <locale>
#include <string>
#include <cassert>

int main() {
  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
  std::string utf8 = convert.to_bytes(0x5e9);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}

#3


1  

How can I work with UTF-8-strings in C++? I think wchar_t does not work because it is 2 bytes long. Code-Points in UTF-8 are up to 4 bytes long...

如何在c++中使用utf -8字符串?我认为wchar_t不能工作,因为它有两个字节长。UTF-8中的代码点最长可达4字节……

This is easy, there is a project named  tinyutf8  , which is a drop-in replacement for std::string/std::wstring.

这很简单,有一个名为tinyutf8的项目,它是std::string/std::wstring的一个临时替代品。

Then the user can elegantly operate on codepoints, while their representation is always encoded in chars.

然后,用户可以优雅地操作代码点,而它们的表示总是用chars编码。


How can I convert "any" (or the most used) character encoding to UTF-8?

如何将“任意”(或最常用的)字符编码转换为UTF-8?

You might want to have a look at std::codecvt_utf8 and simlilar templates from <codecvt> (C++11).

您可能想看看来自 (c++ 11)的std: codecvt_utf8和simlilar模板。

#4


0  

UTF-8 is an encoding that uses multiple bytes for non-ASCII (7 bits code) utilising the 8th bit. As such you won't find '\', '/' inside of a multi-byte sequence. And isdigit works (though not arabic and other digits).

UTF-8是一种利用8位的非ascii(7位代码)使用多个字节的编码。因此,您不会在多字节序列中找到'\','/'。isdigit可以工作(虽然不是阿拉伯文和其他数字)。

It is a superset of ASCII and can hold all Unicode characters, so definitely to use with char and string.

它是ASCII的超集,可以容纳所有的Unicode字符,所以一定要使用char和string。

Inspect the HTTP headers (case insensitive); they are in ISO-8859-1, and precede an empty line and then the HTML content.

检查HTTP头(不区分大小写);它们在ISO-8859-1中,在空行之前,然后是HTML内容。

Content-Type: text/html; charset=UTF-8

If not present, there also there might be

如果不在场,也可能有

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="UTF-8">      <!-- HTML5 -->

ISO-8859-1 is Latin 1, and you might do better to convert from Windows-1252, the Windows Latin-1 extension using 0x80 - 0xBF for some special characters like comma quotes and such. Even browsers on MacOS will understand these though ISO-8859-1 was specified.

ISO-8859-1是拉丁文1,您可以更好地从Windows-1252进行转换,Windows的Latin-1扩展名使用0x80 - 0xBF,用于某些特殊字符,如逗号引号等。即使是MacOS上的浏览器也会理解这些,尽管ISO-8859-1已经被指定。

Conversion libraries: alread mentioned by @syam.

转换库:@syam提到的alread。

Conversion

转换

Let's not consider UTF-16. One can read the headers and start till a meta statement for the charset as single-byte chars.

我们不要考虑utf - 16。一个人可以读取头信息,并开始直到将charset用作单字节字符的meta语句。

The conversion from single-byte encoding to UTF-8 can happen via a table. For instance generated with Java: a const char* table[] indexed by the char.

从单字节编码到UTF-8的转换可以通过表进行。例如,使用Java生成的const char* table[]。

table[157] = "\xEF\xBF\xBD";


public static void main(String[] args) {
    final String SOURCE_ENCODING = "windows-1252";
    byte[] sourceBytes = new byte[1];
    System.out.println("    const char* table[] = {");
    for (int c = 0; c < 256; ++c) {
        String comment = "";
        System.out.printf("       /* %3d */ \"", c);
        if (32 <= c && c < 127) {
            // Pure ASCII
            if (c == '\"' || c == '\\')
                System.out.print("\\");
            System.out.print((char)c);
        } else {
            if (c == 0) {
                comment = " // Unusable";
            }
            sourceBytes[0] = (byte)c;
            try {
                byte[] targetBytes = new String(sourceBytes, SOURCE_ENCODING).getBytes("UTF-8");
                for (int j = 0; j < targetBytes.length; ++j) {
                    int b = targetBytes[j] & 0xFF;
                    System.out.printf("\\x%02X", b);
                }
            } catch (UnsupportedEncodingException ex) {
                comment = " // " + ex.getMessage().replaceAll("\\s+", " "); // No newlines.
            }
        }
        System.out.print("\"");
        if (c < 255) {
            System.out.print(",");
        }
        System.out.println();
    }
    System.out.println("    };");
}