使用ICU库的UTF-8到ASCII

时间:2023-01-10 09:51:51

I have a std::string with UTF-8 characters in it.
I want to convert the string to its closest equivalent with ASCII characters.

我有一个带有UTF-8字符的std :: string。我想将字符串转换为与ASCII字符最接近的等效字符串。

For example:

Łódź => Lodz
Assunção => Assuncao
Schloß => Schloss

Łódź=>LodzAssunção=>AssuncaoSchloß=> Schloss

Unfortunatly ICU library is realy unintuitive and I haven't found good documentation on its usage, so it would take me too much time to learn to use it. Time I dont have.

不幸的是,ICU库真的不直观,我没有找到关于它的用法的好文档,所以我花了太多时间学习使用它。时间我没有。

Could someone give a little example about how can this be done??
thanks.

有人可以举一个例子说明如何做到这一点?谢谢。

5 个解决方案

#1


3  

I don't know about ICU but ICONV does this and its quite easy to learn. it's only about 3-4 calls and what you need in your case is to use the ICONV_SET_TRANSLITERATE flag using iconvctl().

我不知道ICU,但ICONV做到这一点,它很容易学习。它只有3-4次调用,你需要的是使用iconvctl()来使用ICONV_SET_TRANSLITERATE标志。

#2


3  

Try this, ucnv_convert("US-ASCII", "UTF-8", targer, targetsize, source, sourcesize, pError)

试试这个,ucnv_convert(“US-ASCII”,“UTF-8”,targer,targetsize,source,sourcesize,pError)

#3


1  

I wrote a callback that decomposes and then does some substitution. It could probably be implemented as a transliteration. code is here decompcb.c and header is nearby. Install it as follows on a Unicode-to-ASCII converter:

我写了一个回调分解,然后进行一些替换。它可能可以实现为音译。代码在这里decompcb.c和标题就在附近。在Unicode-to-ASCII转换器上按如下方式安装:

ucnv_setFromUCallBack(gConverter, &UCNV_FROM_U_CALLBACK_DECOMPOSE, &status);

then use gConverter to convert from unicode to ASCII

然后使用gConverter从unicode转换为ASCII

#4


0  

This isn't an area I'm an expert in, but if you don't have a library handy that does it for you easily then you might be better of just creating a lookup table/map which contains the UTF-8 -> ASCII values. ie. The key is the UTF-8 char, the value is the ASCII sequence of chars.

这不是我所熟悉的领域,但是如果你没有一个方便的库,那么你可能更容易创建一个包含UTF-8的查找表/地图 - > ASCII值。即。关键是UTF-8字符,值是字符的ASCII序列。

#5


0  

The ß->ss decomposition tells me you want the compatibility decomposition. In ICU, you need class Normalizer for that. Afterwards, you will end up with something like L'odz'. From this string, you can simply remove the non-ASCII characters. No need for ICU, plain STL will do.

ß-> ss分解告诉我你想要兼容性分解。在ICU中,您需要使用类Normalizer。之后,你会得到像L'odz'这样的东西。从该字符串中,您只需删除非ASCII字符即可。不需要ICU,普通的STL会做。

#1


3  

I don't know about ICU but ICONV does this and its quite easy to learn. it's only about 3-4 calls and what you need in your case is to use the ICONV_SET_TRANSLITERATE flag using iconvctl().

我不知道ICU,但ICONV做到这一点,它很容易学习。它只有3-4次调用,你需要的是使用iconvctl()来使用ICONV_SET_TRANSLITERATE标志。

#2


3  

Try this, ucnv_convert("US-ASCII", "UTF-8", targer, targetsize, source, sourcesize, pError)

试试这个,ucnv_convert(“US-ASCII”,“UTF-8”,targer,targetsize,source,sourcesize,pError)

#3


1  

I wrote a callback that decomposes and then does some substitution. It could probably be implemented as a transliteration. code is here decompcb.c and header is nearby. Install it as follows on a Unicode-to-ASCII converter:

我写了一个回调分解,然后进行一些替换。它可能可以实现为音译。代码在这里decompcb.c和标题就在附近。在Unicode-to-ASCII转换器上按如下方式安装:

ucnv_setFromUCallBack(gConverter, &UCNV_FROM_U_CALLBACK_DECOMPOSE, &status);

then use gConverter to convert from unicode to ASCII

然后使用gConverter从unicode转换为ASCII

#4


0  

This isn't an area I'm an expert in, but if you don't have a library handy that does it for you easily then you might be better of just creating a lookup table/map which contains the UTF-8 -> ASCII values. ie. The key is the UTF-8 char, the value is the ASCII sequence of chars.

这不是我所熟悉的领域,但是如果你没有一个方便的库,那么你可能更容易创建一个包含UTF-8的查找表/地图 - > ASCII值。即。关键是UTF-8字符,值是字符的ASCII序列。

#5


0  

The ß->ss decomposition tells me you want the compatibility decomposition. In ICU, you need class Normalizer for that. Afterwards, you will end up with something like L'odz'. From this string, you can simply remove the non-ASCII characters. No need for ICU, plain STL will do.

ß-> ss分解告诉我你想要兼容性分解。在ICU中,您需要使用类Normalizer。之后,你会得到像L'odz'这样的东西。从该字符串中,您只需删除非ASCII字符即可。不需要ICU,普通的STL会做。