Perl正则表达式组不能正确显示unicode字符

时间:2022-10-02 12:17:02

If I use the following code, the regex group does not show the expected unicode string. Can somebody explain to me whether I did a mistake, or is it even possible that it is an intrinsic problem in perl itself.

如果我使用以下代码,则regex组不会显示预期的unicode字符串。有人可以向我解释我是否犯了错误,或者甚至可能它是perl本身的内在问题。

echo 'éá'|perl -ne 'if ( /(\P{L}+)/ ) { print $1; }'
�

Even if I take this explanation into account and add the UTF-8 encoding layers to perl, it still does not give me the string 'éá' for the regex group:

即使我考虑到这个解释并将UTF-8编码层添加到perl,它仍然不会为正则表达式组提供字符串'éá':

echo 'éá'|perl -CS -ne 'if ( /(\P{L}+)/ ) { print $1,$_; }'

éá

The output for the group seems to be empty and includes a newline sign.

该组的输出似乎是空的,并包含换行符号。

Any help is much appreciated.

任何帮助深表感谢。

1 个解决方案

#1


2  

In your input, éá are 2 Unicode letters. \P{L} is a construct matching any character other than a Unicode letter.

在您的输入中,éá是2个Unicode字母。 \ P {L}是一个与Unicode字母以外的任何字符匹配的构造。

So, using the opposite construct - \p{L} - you will fix your issue.

因此,使用相反的结构 - \ p {L} - 您将解决您的问题。

Use

/(\p{L}+)/

#1


2  

In your input, éá are 2 Unicode letters. \P{L} is a construct matching any character other than a Unicode letter.

在您的输入中,éá是2个Unicode字母。 \ P {L}是一个与Unicode字母以外的任何字符匹配的构造。

So, using the opposite construct - \p{L} - you will fix your issue.

因此,使用相反的结构 - \ p {L} - 您将解决您的问题。

Use

/(\p{L}+)/