Some characters such as the Unicode Character 'LATIN SMALL LETTER C WITH CARON' can be encoded as 0xC4 0x8D
, but can also be represented with the two code points for 'LATIN SMALL LETTER C' and 'COMBINING CARON', which is 0x63 0xcc 0x8c
.
More info here: http://www.fileformat.info/info/unicode/char/10d/index.htm
有些字符,如Unicode字符‘LATIN SMALL LETTER C WITH CARON’可以编码为0xC4 0x8D,但也可以用‘LATIN SMALL LETTER C’和‘combination CARON’的两个代码点表示,即0x63 0xcc 0x8c。更多信息:http://www.fileformat.info/info/unicode/char/10d/index.htm
I wonder if there is a library which can convert a 'LATIN SMALL LETTER C' + 'COMBINING CARON' into 'LATIN SMALL LETTER C WITH CARON'. Or is there a table containing these conversions?
我想知道是否有一个库可以将“拉丁文小字母C”+“CARON”组合成“拉丁文小字母C + CARON”。或者是否存在包含这些转换的表?
3 个解决方案
#1
6
Generally, you use Unicode Normalization to do this.
通常,您使用Unicode规范化来实现这一点。
Using UnicodeUtils.nfkc using the gem unicode_utils ( http://unicode-utils.rubyforge.org/) should get you the specific behavior you're asking for; unicode normalization form kC will use a compatibility decomposition followed by converting the string to a composed form, if available (basically what you asked for by your example). (You may also get close to what you want with normalization form c, sometimes acronymized NFC).
使用UnicodeUtils。nfkc使用gem unicode_utils (http://unicode-utils.rubyforge.org/)来获得您所要求的特定行为;unicode规范化形式kC将使用兼容性分解,然后将字符串转换为组合形式(如果可用的话)(基本上就是您的示例所要求的)。(您也可以使用归一化形式c(有时缩写为NFC)来接近您想要的结果)。
How to replace the Unicode gem on Ruby 1.9? has additional details.
如何替换Ruby 1.9上的Unicode gem ?更多细节。
In Ruby 1.8.7, you'd need do gem install Unicode, for which there is a similar function available.
在Ruby 1.8.7中,您需要使用gem安装Unicode,因为它具有类似的功能。
Edited to add: The main reason why you'll probably want normalization form kC instead of just normalization form C is that ligatures (characters that are squeezed together for historical/typographical reasons) will first be decomposed to the individual characters, which is sometimes desirable if you're doing lexicographic ordering or searching).
编辑补充:你可能想要标准化表格kC而不是正规化形式C的主要原因是那些连接(由于历史/排印原因被压缩在一起的字符)将首先被分解为单个字符,如果你在做字典排序或搜索时,这些字符有时是可取的。
#2
7
These conversions don't always exist. The combination of U+0063 (c) with U+030C (combining caron) can be represented as a single character, for instance, but there's no precomposed character representing a lowercase 'w' with a caron (w̌).
这些转换并不总是存在。U + 0063(c)的组合与U + 030 c(结合caron)可以表示为一个字符,例如,但是没有预作字符代表一个小写字母“w”卡隆(w̌)。
Nevertheless, there exist libraries which can perform this composition where possible. Look for a Unicode function called "NFC" (Normalization Form: Composition). See, for instance: http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html#M000015
然而,有一些库可以在可能的情况下执行这种组合。查找名为“NFC”的Unicode函数(规范化形式:复合)。看到的,例如:http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html M000015
#3
0
String#encode can be used since Ruby 1.9. UTF-8-MAC is a variant of NFD. The codepoints in the range between U+2000 and U+2FFF, or U+F900 and U+FAFF, or U+2F800 and U+2FAFF are not decomposed. See https://developer.apple.com/library/mac/qa/qa1173/_index.html for the details. UTF-8-HFS can be also used insted of UTF-8-MAC.
字符串#编码可以从Ruby 1.9开始使用。UTF-8-MAC是NFD的一个变种。U+2000和U+2FFF、U+F900和U+FAFF、U+2F800和U+2FAFF范围内的码点没有分解。有关详细信息,请参阅https://developer.apple.com/library/mac/qa1173/_index.html。UTF-8-HFS也可以用在UTF-8-MAC上。
# coding: utf-8
s = "\u010D"
s.encode!('UTF-8-MAC', 'UTF-8')
s.force_encoding('UTF-8')
p "\x63\xcc\x8c" == s
p "\u0063" == s[0]
p "\u030C" == s[1]
#1
6
Generally, you use Unicode Normalization to do this.
通常,您使用Unicode规范化来实现这一点。
Using UnicodeUtils.nfkc using the gem unicode_utils ( http://unicode-utils.rubyforge.org/) should get you the specific behavior you're asking for; unicode normalization form kC will use a compatibility decomposition followed by converting the string to a composed form, if available (basically what you asked for by your example). (You may also get close to what you want with normalization form c, sometimes acronymized NFC).
使用UnicodeUtils。nfkc使用gem unicode_utils (http://unicode-utils.rubyforge.org/)来获得您所要求的特定行为;unicode规范化形式kC将使用兼容性分解,然后将字符串转换为组合形式(如果可用的话)(基本上就是您的示例所要求的)。(您也可以使用归一化形式c(有时缩写为NFC)来接近您想要的结果)。
How to replace the Unicode gem on Ruby 1.9? has additional details.
如何替换Ruby 1.9上的Unicode gem ?更多细节。
In Ruby 1.8.7, you'd need do gem install Unicode, for which there is a similar function available.
在Ruby 1.8.7中,您需要使用gem安装Unicode,因为它具有类似的功能。
Edited to add: The main reason why you'll probably want normalization form kC instead of just normalization form C is that ligatures (characters that are squeezed together for historical/typographical reasons) will first be decomposed to the individual characters, which is sometimes desirable if you're doing lexicographic ordering or searching).
编辑补充:你可能想要标准化表格kC而不是正规化形式C的主要原因是那些连接(由于历史/排印原因被压缩在一起的字符)将首先被分解为单个字符,如果你在做字典排序或搜索时,这些字符有时是可取的。
#2
7
These conversions don't always exist. The combination of U+0063 (c) with U+030C (combining caron) can be represented as a single character, for instance, but there's no precomposed character representing a lowercase 'w' with a caron (w̌).
这些转换并不总是存在。U + 0063(c)的组合与U + 030 c(结合caron)可以表示为一个字符,例如,但是没有预作字符代表一个小写字母“w”卡隆(w̌)。
Nevertheless, there exist libraries which can perform this composition where possible. Look for a Unicode function called "NFC" (Normalization Form: Composition). See, for instance: http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html#M000015
然而,有一些库可以在可能的情况下执行这种组合。查找名为“NFC”的Unicode函数(规范化形式:复合)。看到的,例如:http://unicode-utils.rubyforge.org/classes/UnicodeUtils.html M000015
#3
0
String#encode can be used since Ruby 1.9. UTF-8-MAC is a variant of NFD. The codepoints in the range between U+2000 and U+2FFF, or U+F900 and U+FAFF, or U+2F800 and U+2FAFF are not decomposed. See https://developer.apple.com/library/mac/qa/qa1173/_index.html for the details. UTF-8-HFS can be also used insted of UTF-8-MAC.
字符串#编码可以从Ruby 1.9开始使用。UTF-8-MAC是NFD的一个变种。U+2000和U+2FFF、U+F900和U+FAFF、U+2F800和U+2FAFF范围内的码点没有分解。有关详细信息,请参阅https://developer.apple.com/library/mac/qa1173/_index.html。UTF-8-HFS也可以用在UTF-8-MAC上。
# coding: utf-8
s = "\u010D"
s.encode!('UTF-8-MAC', 'UTF-8')
s.force_encoding('UTF-8')
p "\x63\xcc\x8c" == s
p "\u0063" == s[0]
p "\u030C" == s[1]