如何在ruby中将UTF8组合字符转换为单个UTF8字符?

时间:2022-10-29 13:24:32

Some characters such as the Unicode Character 'LATIN SMALL LETTER C WITH CARON' can be encoded as 0xC4 0x8D, but can also be represented with the two code points for 'LATIN SMALL LETTER C' and 'COMBINING CARON', which is 0x63 0xcc 0x8c.
More info here: http://www.fileformat.info/info/unicode/char/10d/index.htm

有些字符,如Unicode字符‘LATIN SMALL LETTER C WITH CARON’可以编码为0xC4 0x8D,但也可以用‘LATIN SMALL LETTER C’和‘combination CARON’的两个代码点表示,即0x63 0xcc 0x8c。更多信息:http://www.fileformat.info/info/unicode/char/10d/index.htm

I wonder if there is a library which can convert a 'LATIN SMALL LETTER C' + 'COMBINING CARON' into 'LATIN SMALL LETTER C WITH CARON'. Or is there a table containing these conversions?

我想知道是否有一个库可以将“拉丁文小字母C”+“CARON”组合成“拉丁文小字母C + CARON”。或者是否存在包含这些转换的表?

3 个解决方案

#1


6  

Generally, you use Unicode Normalization to do this.

通常,您使用Unicode规范化来实现这一点。

Using UnicodeUtils.nfkc using the gem unicode_utils ( http://unicode-utils.rubyforge.org/) should get you the specific behavior you're asking for; unicode normalization form kC will use a compatibility decomposition followed by converting the string to a composed form, if available (basically what you asked for by your example). (You may also get close to what you want with normalization form c, sometimes acronymized NFC).

使用UnicodeUtils。nfkc使用gem unicode_utils (http://unicode-utils.rubyforge.org/)来获得您所要求的特定行为;unicode规范化形式kC将使用兼容性分解,然后将字符串转换为组合形式(如果可用的话)(基本上就是您的示例所要求的)。(您也可以使用归一化形式c(有时缩写为NFC)来接近您想要的结果)。

How to replace the Unicode gem on Ruby 1.9? has additional details.

如何替换Ruby 1.9上的Unicode gem ?更多细节。

In Ruby 1.8.7, you'd need do gem install Unicode, for which there is a similar function available.

在Ruby 1.8.7中,您需要使用gem安装Unicode,因为它具有类似的功能。

Edited to add: The main reason why you'll probably want normalization form kC instead of just normalization form C is that ligatures (characters that are squeezed together for historical/typographical reasons) will first be decomposed to the individual characters, which is sometimes desirable if you're doing lexicographic ordering or searching).

编辑补充:你可能想要标准化表格kC而不是正规化形式C的主要原因是那些连接(由于历史/排印原因被压缩在一起的字符)将首先被分解为单个字符,如果你在做字典排序或搜索时,这些字符有时是可取的。

#2


7  

These conversions don't always exist. The combination of U+0063 (c) with U+030C (combining caron) can be represented as a single character, for instance, but there's no precomposed character representing a lowercase 'w' with a caron (w̌).

这些转换并不总是存在。U + 0063(c)的组合与U + 030 c(结合caron)可以表示为一个字符,例如,但是没有预作字符代表一个小写字母“w”卡隆(w̌)。

Nevertheless, there exist libraries which can perform this composition where possible. Look for a Unicode function called "NFC" (Normalization Form: Composition). See, for instance: http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html#M000015

然而,有一些库可以在可能的情况下执行这种组合。查找名为“NFC”的Unicode函数(规范化形式:复合)。看到的,例如:http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html M000015

#3


0  

String#encode can be used since Ruby 1.9. UTF-8-MAC is a variant of NFD. The codepoints in the range between U+2000 and U+2FFF, or U+F900 and U+FAFF, or U+2F800 and U+2FAFF are not decomposed. See https://developer.apple.com/library/mac/qa/qa1173/_index.html for the details. UTF-8-HFS can be also used insted of UTF-8-MAC.

字符串#编码可以从Ruby 1.9开始使用。UTF-8-MAC是NFD的一个变种。U+2000和U+2FFF、U+F900和U+FAFF、U+2F800和U+2FAFF范围内的码点没有分解。有关详细信息,请参阅https://developer.apple.com/library/mac/qa1173/_index.html。UTF-8-HFS也可以用在UTF-8-MAC上。

# coding: utf-8

s = "\u010D"
s.encode!('UTF-8-MAC', 'UTF-8')
s.force_encoding('UTF-8')

p "\x63\xcc\x8c" == s
p "\u0063" == s[0]
p "\u030C" == s[1]

#1


6  

Generally, you use Unicode Normalization to do this.

通常,您使用Unicode规范化来实现这一点。

Using UnicodeUtils.nfkc using the gem unicode_utils ( http://unicode-utils.rubyforge.org/) should get you the specific behavior you're asking for; unicode normalization form kC will use a compatibility decomposition followed by converting the string to a composed form, if available (basically what you asked for by your example). (You may also get close to what you want with normalization form c, sometimes acronymized NFC).

使用UnicodeUtils。nfkc使用gem unicode_utils (http://unicode-utils.rubyforge.org/)来获得您所要求的特定行为;unicode规范化形式kC将使用兼容性分解,然后将字符串转换为组合形式(如果可用的话)(基本上就是您的示例所要求的)。(您也可以使用归一化形式c(有时缩写为NFC)来接近您想要的结果)。

How to replace the Unicode gem on Ruby 1.9? has additional details.

如何替换Ruby 1.9上的Unicode gem ?更多细节。

In Ruby 1.8.7, you'd need do gem install Unicode, for which there is a similar function available.

在Ruby 1.8.7中,您需要使用gem安装Unicode,因为它具有类似的功能。

Edited to add: The main reason why you'll probably want normalization form kC instead of just normalization form C is that ligatures (characters that are squeezed together for historical/typographical reasons) will first be decomposed to the individual characters, which is sometimes desirable if you're doing lexicographic ordering or searching).

编辑补充:你可能想要标准化表格kC而不是正规化形式C的主要原因是那些连接(由于历史/排印原因被压缩在一起的字符)将首先被分解为单个字符,如果你在做字典排序或搜索时,这些字符有时是可取的。

#2


7  

These conversions don't always exist. The combination of U+0063 (c) with U+030C (combining caron) can be represented as a single character, for instance, but there's no precomposed character representing a lowercase 'w' with a caron (w̌).

这些转换并不总是存在。U + 0063(c)的组合与U + 030 c(结合caron)可以表示为一个字符,例如,但是没有预作字符代表一个小写字母“w”卡隆(w̌)。

Nevertheless, there exist libraries which can perform this composition where possible. Look for a Unicode function called "NFC" (Normalization Form: Composition). See, for instance: http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html#M000015

然而,有一些库可以在可能的情况下执行这种组合。查找名为“NFC”的Unicode函数(规范化形式:复合)。看到的,例如:http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html M000015

#3


0  

String#encode can be used since Ruby 1.9. UTF-8-MAC is a variant of NFD. The codepoints in the range between U+2000 and U+2FFF, or U+F900 and U+FAFF, or U+2F800 and U+2FAFF are not decomposed. See https://developer.apple.com/library/mac/qa/qa1173/_index.html for the details. UTF-8-HFS can be also used insted of UTF-8-MAC.

字符串#编码可以从Ruby 1.9开始使用。UTF-8-MAC是NFD的一个变种。U+2000和U+2FFF、U+F900和U+FAFF、U+2F800和U+2FAFF范围内的码点没有分解。有关详细信息,请参阅https://developer.apple.com/library/mac/qa1173/_index.html。UTF-8-HFS也可以用在UTF-8-MAC上。

# coding: utf-8

s = "\u010D"
s.encode!('UTF-8-MAC', 'UTF-8')
s.force_encoding('UTF-8')

p "\x63\xcc\x8c" == s
p "\u0063" == s[0]
p "\u030C" == s[1]