如何只匹配Perl中的Unicode字符串中的完全组合字符?

时间:2021-05-07 21:44:48

I'm looking for a way to match only fully composed characters in a Unicode string.

我正在寻找一种方法来匹配Unicode字符串中完全组合的字符。

Is [:print:] dependent upon locale in any regular expression implementation that incorporates this character class? For example, will it match Japanese character 'あ', since it is not a control character, or is [:print:] always going to be ASCII codes 0x20 to 0x7E?

在任何合并了这个字符类的正则表达式实现中,[:print:]依赖于语言环境吗?例如,它将匹配日本字符“あ”,因为它不是一个控制字符,或[打印:]总是将ASCII码0 x20 0 x7e ?

Is there any character class, including Perl REs, that can be used to match anything other than a control character? If [:print:] includes only characters in ASCII range I would assume [:cntrl:] does too.

是否有任何字符类,包括Perl REs,可以用来匹配除控制字符以外的其他内容?如果[:print:]只包含ASCII范围内的字符,我认为[:cntrl:]也会包含。

5 个解决方案

#1


6  

echo あ| perl -nle 'BEGIN{binmode STDIN,":utf8"} print"[$_]"; print /[[:print:]]/ ? "YES" : "NO"'

This mostly works, though it generates a warning about a wide character. But it gives you the idea: you must be sure you're dealing with a real unicode string (check utf8::is_utf8). Or just check perlunicode at all - the whole subject still makes my head spin.

这主要是有效的,尽管它产生了一个关于宽字符的警告。但是它提供了这样的想法:您必须确保处理的是一个真正的unicode字符串(检查utf8::is_utf8)。或者只是检查一下perlunicode——整个主题仍然让我晕头转向。

#2


5  

I think you don't want or need locales for that but, but rather Unicode. If you have decoded a text string, \w will match word characters in any language, \d matches not just 0..9 but every Unicode digit etc. In regexes you can query Unicode properties with \p{PropertyName}. Particularly interesting for you might be \p{Print}. Here's a list of all the available Unicode character properties.

我认为您不需要或者不需要为此设置地区设置,而是需要Unicode。如果您已经解码了一个文本字符串,\w将匹配任何语言中的单词字符,\d将不仅仅匹配0。但是在regexes中,可以使用\p{PropertyName}查询Unicode属性。特别有趣的是,您可能是\p{Print}。这里列出了所有可用的Unicode字符属性。

I wrote an article about the basics and subtleties of Unicode and Perl, it should give you a good idea on what to do that perl will recognize your string as a sequence of characters, not just a sequence of bytes.

我写了一篇关于Unicode和Perl的基本知识和微妙之处的文章,它应该会让您对Perl将字符串识别为字符序列,而不仅仅是字节序列有一个很好的了解。

Update: with Unicode you don't get language dependent behaviour, but instead sane defaults regardless of language. This may or may not be what you want, but for the distinction of priintable/control character I don't see why you'd need language dependent behaviour.

更新:使用Unicode时,您不会得到依赖于语言的行为,而是完全不考虑语言的默认值。这可能是你想要的,也可能不是你想要的,但是对于priintable/control字符的区别,我不明白为什么你需要依赖语言的行为。

#3


4  

\X matches a fully-composed character (sequence). Proof:

\X匹配一个完全组合的字符(序列)。证明:

#!/usr/bin/env perl
use 5.010;
use utf8;
use Encode qw(encode_utf8);

for my $string (qw(あ ご ご), "\x{3099}") {
    say encode_utf8 sprintf "%s $string", $string =~ /\A \X \z/msx ? 'ok' : 'nok';
}

The test data are: a normal character, a pre-combined character, a combining character sequence and a combining character (which "doesn't count" on its own, a simplification of Chapter 3 of Unicode).

测试数据是:一个正常字符,一个预组合字符,一个组合字符序列和一个组合字符(它本身“不算数”,简化了Unicode的第3章)。

Substitute \X with [[:print:]] to see that Tanktalus' answer produces false matches for the last two cases.

用[][:print:]替换\X,看看Tanktalus的答案对最后两种情况产生错误匹配。

#4


2  

Yes, those expressions are locale dependant.

是的,这些表达式是语言环境依赖的。

#5


1  

You could always use the character class [^[:cntrl:]] to match non-control characters.

你总是可以使用字符类[^[cntrl:]]来匹配非控制性字符。

#1


6  

echo あ| perl -nle 'BEGIN{binmode STDIN,":utf8"} print"[$_]"; print /[[:print:]]/ ? "YES" : "NO"'

This mostly works, though it generates a warning about a wide character. But it gives you the idea: you must be sure you're dealing with a real unicode string (check utf8::is_utf8). Or just check perlunicode at all - the whole subject still makes my head spin.

这主要是有效的,尽管它产生了一个关于宽字符的警告。但是它提供了这样的想法:您必须确保处理的是一个真正的unicode字符串(检查utf8::is_utf8)。或者只是检查一下perlunicode——整个主题仍然让我晕头转向。

#2


5  

I think you don't want or need locales for that but, but rather Unicode. If you have decoded a text string, \w will match word characters in any language, \d matches not just 0..9 but every Unicode digit etc. In regexes you can query Unicode properties with \p{PropertyName}. Particularly interesting for you might be \p{Print}. Here's a list of all the available Unicode character properties.

我认为您不需要或者不需要为此设置地区设置,而是需要Unicode。如果您已经解码了一个文本字符串,\w将匹配任何语言中的单词字符,\d将不仅仅匹配0。但是在regexes中,可以使用\p{PropertyName}查询Unicode属性。特别有趣的是,您可能是\p{Print}。这里列出了所有可用的Unicode字符属性。

I wrote an article about the basics and subtleties of Unicode and Perl, it should give you a good idea on what to do that perl will recognize your string as a sequence of characters, not just a sequence of bytes.

我写了一篇关于Unicode和Perl的基本知识和微妙之处的文章,它应该会让您对Perl将字符串识别为字符序列,而不仅仅是字节序列有一个很好的了解。

Update: with Unicode you don't get language dependent behaviour, but instead sane defaults regardless of language. This may or may not be what you want, but for the distinction of priintable/control character I don't see why you'd need language dependent behaviour.

更新:使用Unicode时,您不会得到依赖于语言的行为,而是完全不考虑语言的默认值。这可能是你想要的,也可能不是你想要的,但是对于priintable/control字符的区别,我不明白为什么你需要依赖语言的行为。

#3


4  

\X matches a fully-composed character (sequence). Proof:

\X匹配一个完全组合的字符(序列)。证明:

#!/usr/bin/env perl
use 5.010;
use utf8;
use Encode qw(encode_utf8);

for my $string (qw(あ ご ご), "\x{3099}") {
    say encode_utf8 sprintf "%s $string", $string =~ /\A \X \z/msx ? 'ok' : 'nok';
}

The test data are: a normal character, a pre-combined character, a combining character sequence and a combining character (which "doesn't count" on its own, a simplification of Chapter 3 of Unicode).

测试数据是:一个正常字符,一个预组合字符,一个组合字符序列和一个组合字符(它本身“不算数”,简化了Unicode的第3章)。

Substitute \X with [[:print:]] to see that Tanktalus' answer produces false matches for the last two cases.

用[][:print:]替换\X,看看Tanktalus的答案对最后两种情况产生错误匹配。

#4


2  

Yes, those expressions are locale dependant.

是的,这些表达式是语言环境依赖的。

#5


1  

You could always use the character class [^[:cntrl:]] to match non-control characters.

你总是可以使用字符类[^[cntrl:]]来匹配非控制性字符。