匹配Objective-C中的CJK扩展B

时间:2022-09-07 11:20:48

I'm having trouble trying to match CJK extension B characters in a NSString.

我在尝试匹配NSString中的CJK扩展名B字符时遇到了麻烦。

Wikipédia CJK Unified Ideographs Extension B :

*CJK统一表意文字扩展B:

CJK Unified Ideographs Extension B is a Unicode block containing rare and historic CJK ideographs for Chinese, Japanese, Korean, and Vietnamese.

CJK Unified Ideographs Extension B是一个Unicode区块,包含了罕见的、历史悠久的中文、日文、韩文和越南文的CJK Ideographs。

The unicode block of the characters is : from U+20000 to U+2A6DF I'm using the regex : [\\ud840-\\ud868][\\udc00-\\udfff]|\\ud869[\\udc00-\\uded6]to match CJK extension B characters.

字符的unicode代码块是:从U+20000到U+2A6DF,我使用regex: [\\ud840-\ ud868][\\ \udf]|\\ udc00-\\ \uded6]以匹配CJK扩展B字符。

Here is my code:

这是我的代码:

NSString *searchedString = @"????"; // First character (U+20000) 

NSString *pattern = @"[\\ud840-\\ud868][\\udc00-\\udfff]|\\ud869[\\udc00-\\uded6]";

 NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:pattern options:NSRegularExpressionCaseInsensitive error:nil];
if ([regex numberOfMatchesInString:searchedString options:0 range:NSMakeRange(0, [searchedString length])] > 0) {
    NSLog(@"matches");
} else {
    NSLog(@"doesn't match");
}

Output : doesn't match

输出:不匹配

For exemple, if I try something more simple for a Hiragana character it is working:

例如,如果我为平假名尝试一些更简单的东西,它正在起作用:

NSString *searchedString = @"ひ";

NSString *pattern = @"[\\u3040-\\u309F]";

Output : matches

输出:匹配

Any help would be much appreciated. Thanks.

非常感谢您的帮助。谢谢。

1 个解决方案

#1


2  

You may use \Uxxxxxxxx notation to match those Unicode characters outside the BMP plane.

您可以使用\Uxxxxxxxx符号来匹配BMP平面外的Unicode字符。

Acc. to ICU regex docs:

Acc。ICU regex文档:

\Uhhhhhhhh     Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.

用hex值hhhhhhhh来匹配字符。必须提供8位十六进制数字,即使最大的Unicode编码点是\U0010ffff。

So, use

因此,使用

NSString *pattern = @"[\\U00020000-\\U0002A6DF]+";

See the online Obj-C demo

参见在线object - c演示

#1


2  

You may use \Uxxxxxxxx notation to match those Unicode characters outside the BMP plane.

您可以使用\Uxxxxxxxx符号来匹配BMP平面外的Unicode字符。

Acc. to ICU regex docs:

Acc。ICU regex文档:

\Uhhhhhhhh     Match the character with the hex value hhhhhhhh. Exactly eight hex digits must be provided, even though the largest Unicode code point is \U0010ffff.

用hex值hhhhhhhh来匹配字符。必须提供8位十六进制数字,即使最大的Unicode编码点是\U0010ffff。

So, use

因此,使用

NSString *pattern = @"[\\U00020000-\\U0002A6DF]+";

See the online Obj-C demo

参见在线object - c演示