Regexp不适用于Perl中的特定特殊字符

时间:2021-06-26 00:22:25

I can't get rid of the special character ¤ and in a string:

我无法摆脱字符串中的特殊字符¤和❤:

$word = 'cɞi¤r$c❤u¨s';
$word =~ s/[^a-zöäåA-ZÖÄÅ]//g;
printf "$word\n";

On the second line I try to remove any non alphabetic characters from the string $word. I would expect to get the word circus printed out but instead I get:

在第二行,我尝试从字符串$ word中删除任何非字母字符。我希望打印出马戏团这个词,但我得到:

ci�rc�us

The öäå and ÖÄÅ in the expression are just normal characters in the Swedish alphabet that I need included.

表达式中的öäå和ÖÅ只是我需要的瑞典字母表中的普通字符。

3 个解决方案

#1


11  

If the characters are in your source code, be sure to use utf8. If they are being read from a file, binmode $FILEHANDLE, ':utf8'.

如果字符在源代码中,请务必使用utf8。如果正在从文件中读取它们,则binmode $ FILEHANDLE,':utf8'。

Be sure to read perldoc perlunicode.

一定要阅读perldoc perlunicode。

#2


3  

Short answer: add use utf8; to make sure your literal string in the source code are interepreted as utf8, that includes the content of the test string, and the content of the regexp.

简短回答:添加使用utf8;确保源代码中的文字字符串被解释为utf8,其中包括测试字符串的内容和正则表达式的内容。

Long answer:

答案很长:

#!/usr/bin/env perl

use warnings;
use Encode;

my $word = 'cɞi¤r$c❤u¨s';

foreach my $char (split //, $word) {
    print ord($char) . Encode::encode_utf8(":$char ");
}

my $allowed_chars = 'a-zöäåA-ZÖÄÅ';

print "\n";

foreach my $char (split //, $allowed_chars) {
    print ord($char) . Encode::encode_utf8(":$char ");
}

print "\n";

$word =~ s/[^$allowed_chars]//g;

printf Encode::encode_utf8("$word\n");

Executing it without utf8:

没有utf8执行它:

$ perl utf8_regexp.pl
99:c 201:É 158: 105:i 194:Â 164:¤ 114:r 36:$ 99:c 226:â 157: 164:¤ 117:u 194:Â 168:¨ 115:s 
97:a 45:- 122:z 195:Ã 182:¶ 195:Ã 164:¤ 195:Ã 165:¥ 65:A 45:- 90:Z 195:Ã 150: 195:Ã 132: 195:Ã 133: 
ci¤rc¤us

Executing it with utf8:

用utf8执行它:

$ perl -Mutf8 utf8_regexp.pl
99:c 606:ɞ 105:i 164:¤ 114:r 36:$ 99:c 10084:❤ 117:u 168:¨ 115:s 
97:a 45:- 122:z 246:ö 228:ä 229:å 65:A 45:- 90:Z 214:Ö 196:Ä 197:Å 
circus

Explanation:

说明:

The non-ascii characters that you are typing into your source code are represented by one than more byte. Since your input is utf8 encoded. In a pure ascii or latin-1 terminal the characters would've been one byte.

您在源代码中键入的非ascii字符由一个以上的字节表示。由于您的输入是utf8编码。在纯粹的ascii或latin-1终端中,字符将是一个字节。

When not using utf8 module, perl thinks that each and every byte you are inputting is a separate character, like you can see when doing the splitting and printing each and every individual character. When using the utf8 module, it treats the combination of several bytes as one character correctly according to the rules of utf8 encoding.

当不使用utf8模块时,perl认为您输入的每个字节都是一个单独的字符,就像您在分割和打印每个字符时所看到的那样。当使用utf8模块时,它根据utf8编码的规则将几个字节的组合正确地视为一个字符。

As you can see by coinscidence, some of the bytes that the swedish characters are made up of match with some of the bytes that some of the characters in your test string are made up of, and they are kept. Namely: the ö which in utf8 consists of 195:Ã 164:¤ - The 164 ends up as one of the characters you allow and it passes thru.

正如你可以看到的那样,由瑞典字符组成的一些字节与测试字符串中某些字符组成的一些字节相匹配,并保留它们。即:其中utf8由195组成:Ã164:¤ - 164最终作为你允许的角色之一,然后通过。

The solution is to tell perl that your strings are supposed to be considered as utf-8.

解决方案是告诉perl你的字符串应该被认为是utf-8。

The encode_utf8 calls are in place to avoid warnings about wide characters being printed to the terminal. As always you need to decode input, and encode output according to the character encoding that input or output is supposed to handle/operate in.

encode_utf8调用已到位,以避免有关将宽字符打印到终端的警告。一如既往,您需要解码输入,并根据输入或输出应处理/操作的字符编码对输出进行编码。

Hope this made it clearer.

希望这更清楚。

#3


-7  

As pointed out by choroba, adding this in the beginning of the perl script solves it:

正如choroba所指出的那样,在perl脚本的开头添加它可以解决它:

use utf8;
binmode(STDOUT, ":utf8");

where use utf8 lets you use the special characters correctly in the regular expression and binmode(STDOUT, ":utf8") lets you output the special characters correctly on the shell.

使用utf8允许您在正则表达式中正确使用特殊字符,而binmode(STDOUT,“:utf8”)允许您在shell上正确输出特殊字符。

#1


11  

If the characters are in your source code, be sure to use utf8. If they are being read from a file, binmode $FILEHANDLE, ':utf8'.

如果字符在源代码中,请务必使用utf8。如果正在从文件中读取它们,则binmode $ FILEHANDLE,':utf8'。

Be sure to read perldoc perlunicode.

一定要阅读perldoc perlunicode。

#2


3  

Short answer: add use utf8; to make sure your literal string in the source code are interepreted as utf8, that includes the content of the test string, and the content of the regexp.

简短回答:添加使用utf8;确保源代码中的文字字符串被解释为utf8,其中包括测试字符串的内容和正则表达式的内容。

Long answer:

答案很长:

#!/usr/bin/env perl

use warnings;
use Encode;

my $word = 'cɞi¤r$c❤u¨s';

foreach my $char (split //, $word) {
    print ord($char) . Encode::encode_utf8(":$char ");
}

my $allowed_chars = 'a-zöäåA-ZÖÄÅ';

print "\n";

foreach my $char (split //, $allowed_chars) {
    print ord($char) . Encode::encode_utf8(":$char ");
}

print "\n";

$word =~ s/[^$allowed_chars]//g;

printf Encode::encode_utf8("$word\n");

Executing it without utf8:

没有utf8执行它:

$ perl utf8_regexp.pl
99:c 201:É 158: 105:i 194:Â 164:¤ 114:r 36:$ 99:c 226:â 157: 164:¤ 117:u 194:Â 168:¨ 115:s 
97:a 45:- 122:z 195:Ã 182:¶ 195:Ã 164:¤ 195:Ã 165:¥ 65:A 45:- 90:Z 195:Ã 150: 195:Ã 132: 195:Ã 133: 
ci¤rc¤us

Executing it with utf8:

用utf8执行它:

$ perl -Mutf8 utf8_regexp.pl
99:c 606:ɞ 105:i 164:¤ 114:r 36:$ 99:c 10084:❤ 117:u 168:¨ 115:s 
97:a 45:- 122:z 246:ö 228:ä 229:å 65:A 45:- 90:Z 214:Ö 196:Ä 197:Å 
circus

Explanation:

说明:

The non-ascii characters that you are typing into your source code are represented by one than more byte. Since your input is utf8 encoded. In a pure ascii or latin-1 terminal the characters would've been one byte.

您在源代码中键入的非ascii字符由一个以上的字节表示。由于您的输入是utf8编码。在纯粹的ascii或latin-1终端中,字符将是一个字节。

When not using utf8 module, perl thinks that each and every byte you are inputting is a separate character, like you can see when doing the splitting and printing each and every individual character. When using the utf8 module, it treats the combination of several bytes as one character correctly according to the rules of utf8 encoding.

当不使用utf8模块时,perl认为您输入的每个字节都是一个单独的字符,就像您在分割和打印每个字符时所看到的那样。当使用utf8模块时,它根据utf8编码的规则将几个字节的组合正确地视为一个字符。

As you can see by coinscidence, some of the bytes that the swedish characters are made up of match with some of the bytes that some of the characters in your test string are made up of, and they are kept. Namely: the ö which in utf8 consists of 195:Ã 164:¤ - The 164 ends up as one of the characters you allow and it passes thru.

正如你可以看到的那样,由瑞典字符组成的一些字节与测试字符串中某些字符组成的一些字节相匹配,并保留它们。即:其中utf8由195组成:Ã164:¤ - 164最终作为你允许的角色之一,然后通过。

The solution is to tell perl that your strings are supposed to be considered as utf-8.

解决方案是告诉perl你的字符串应该被认为是utf-8。

The encode_utf8 calls are in place to avoid warnings about wide characters being printed to the terminal. As always you need to decode input, and encode output according to the character encoding that input or output is supposed to handle/operate in.

encode_utf8调用已到位,以避免有关将宽字符打印到终端的警告。一如既往,您需要解码输入,并根据输入或输出应处理/操作的字符编码对输出进行编码。

Hope this made it clearer.

希望这更清楚。

#3


-7  

As pointed out by choroba, adding this in the beginning of the perl script solves it:

正如choroba所指出的那样,在perl脚本的开头添加它可以解决它:

use utf8;
binmode(STDOUT, ":utf8");

where use utf8 lets you use the special characters correctly in the regular expression and binmode(STDOUT, ":utf8") lets you output the special characters correctly on the shell.

使用utf8允许您在正则表达式中正确使用特殊字符,而binmode(STDOUT,“:utf8”)允许您在shell上正确输出特殊字符。