使用正则表达式匹配utf-8编码中的任何中文字符

时间:2022-02-04 00:18:42

For example, I want to match a string consisting of m to n Chinese characters, then I can use:

例如,我想匹配一个由m到n个汉字组成的字符串,然后我可以使用:

[single Chinese character regular expression]{m,n}

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

是否有一个汉字的正则表达式,它可以是任何存在的汉字?

4 个解决方案

#1


21  

The regex to match a Chinese (well, CJK) character is

与中文(嗯,CJK)字符匹配的正则表达式是。

\p{script=Han}

which can be appreviated to simply

什么可以简单地欣赏

\p{Han}

This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.

这假定您的regex编译器满足UTS#18 Unicode正则表达式的RL1.2属性要求。Perl和Java 7都满足这个规范,但是其他很多都不满足。

#2


6  

In Java,

在Java中,

\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3}

#3


0  

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

是否有一个汉字的正则表达式,它可以是任何存在的汉字?

Recommendation

建议

To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.

要将模式与中文字符和其他Unicode代码点与flex兼容的词汇分析器匹配,可以使用向后兼容flex的c++的RE/flex词汇分析器。RE/flex支持Unicode,并与Bison一起构建lexer和解析器。

You can write Unicode patterns (and UTF-8 regular expressions) in RE/flex specifications such as:

您可以在RE/flex规范中编写Unicode模式(和UTF-8正则表达式),例如:

%option flex unicode
%%
[肖晗]   { printf ("xiaohan/2\n"); }
%%

Use global %option unicode to enable Unicode. You can also use a local modifier (?u:) to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):

使用全局%选项unicode以启用unicode。您还可以使用一个本地修饰符(?u:)将Unicode限制为一个模式(因此其他所有内容仍然是ASCII/8位,如Flex所示):

%option flex
%%
(?u:[肖晗])   { printf ("xiaohan/2\n"); }
(?u:\p{Han})  { printf ("Han character %s\n", yytext); }
.             { printf ("8-bit character %d\n", yytext[0]); }
%%

Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.

选项flex支持flex兼容性,因此可以使用yytext、yyleng、ECHO等。如果没有flex选项,RE/flex期望Lexer方法调用:text()(或str()和wstr()为std:::string和std:::wstring)、size()(或wsize()为宽字符长度)和echo()。RE/flex方法调用是更干净的IMHO,包括广泛的char操作。

Background

背景

In plain old Flex I ended up defining ugly UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id:

在普通的老Flex中,我最终定义了难看的UTF-8模式来捕获ASCII字母和UTF-8编码字母,用于一个需要支持Unicode标识符id的编译器项目:

digit           [0-9]
alpha           ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id              ({alpha})({alpha}|{digit})*            

The alpha pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L} etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns. Use a Unicode-capable scanner generator instead, such as RE/flex.

alpha模式支持ASCII字母、下划线和用于标识符(\p{L}等)的Unicode代码点。该模式允许更多的Unicode代码点,而不是绝对必要的,以保持该模式的可管理性,因此它在一些缺乏准确性的情况下交换紧凑性,并允许在某些非有效的UTF-8超长字符的情况下使用UTF-8。如果您正在考虑这种方法,那么就要对问题和安全问题保持警惕。使用具有unicode能力的扫描器生成器,比如RE/flex。

Safety

安全

When using UTF-8 directly in Flex patterns, there are several concerns:

当在Flex模式中直接使用UTF-8时,有以下几个问题:

  1. Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.

    用Flex编码自己的UTF-8模式以匹配任何Unicode字符可能会出现错误。模式应该仅限于有效的Unicode范围内的字符。Unicode代码点覆盖了U+0000到U+D7FF和U+E000到U+10FFFF的范围。U+D800到U+DFFF的范围是为UTF-16代理对保留的,并且是无效的代码点。当使用工具将Unicode范围转换为UTF-8时,请确保排除无效的代码点。

  2. Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.

    模式应该拒绝过长和其他无效的字节序列。无效的UTF-8不应该被默默地接受。

  3. To catch lexical input errors in your lexer will require a special . (dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".

    要捕获lexer中的词汇输入错误,需要一个特殊的。(dot)匹配有效和无效的Unicode,包括UTF-8超运行和无效的字节序列,以便产生输入被拒绝的错误消息。如果您使用点作为“捕获一切”来生成错误消息,但是您的点不匹配无效的Unicode,那么lexer将挂起(“扫描器被阻塞”),或者您的lexer将通过Flex的“默认规则”在输出上回显垃圾字符。

  4. Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).

    您的扫描器应该识别输入中的UTF BOM (Unicode字节顺序标记),以切换到UTF-8、UTF-16 (LE or BE)或UTF-32 (LE or BE)。

  5. As you point out, patterns such as [unicode characters] do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.

    正如您所指出的,诸如[unicode字符]之类的模式对于Flex根本不起作用,因为括号列表中的UTF-8字符是多字节字符,可以匹配每个字节字符,但不能匹配UTF-8字符。

See also invalid UTF encodings in the RE/flex user guide.

在RE/flex用户指南中也可以看到无效的UTF编码。

#4


-2  

In Java 7 and up, the format should be: "\p{IsHan}"

在Java 7和up中,格式应该是:“\p{IsHan}”

#1


21  

The regex to match a Chinese (well, CJK) character is

与中文(嗯,CJK)字符匹配的正则表达式是。

\p{script=Han}

which can be appreviated to simply

什么可以简单地欣赏

\p{Han}

This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.

这假定您的regex编译器满足UTS#18 Unicode正则表达式的RL1.2属性要求。Perl和Java 7都满足这个规范,但是其他很多都不满足。

#2


6  

In Java,

在Java中,

\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3}

#3


0  

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

是否有一个汉字的正则表达式,它可以是任何存在的汉字?

Recommendation

建议

To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.

要将模式与中文字符和其他Unicode代码点与flex兼容的词汇分析器匹配,可以使用向后兼容flex的c++的RE/flex词汇分析器。RE/flex支持Unicode,并与Bison一起构建lexer和解析器。

You can write Unicode patterns (and UTF-8 regular expressions) in RE/flex specifications such as:

您可以在RE/flex规范中编写Unicode模式(和UTF-8正则表达式),例如:

%option flex unicode
%%
[肖晗]   { printf ("xiaohan/2\n"); }
%%

Use global %option unicode to enable Unicode. You can also use a local modifier (?u:) to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):

使用全局%选项unicode以启用unicode。您还可以使用一个本地修饰符(?u:)将Unicode限制为一个模式(因此其他所有内容仍然是ASCII/8位,如Flex所示):

%option flex
%%
(?u:[肖晗])   { printf ("xiaohan/2\n"); }
(?u:\p{Han})  { printf ("Han character %s\n", yytext); }
.             { printf ("8-bit character %d\n", yytext[0]); }
%%

Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.

选项flex支持flex兼容性,因此可以使用yytext、yyleng、ECHO等。如果没有flex选项,RE/flex期望Lexer方法调用:text()(或str()和wstr()为std:::string和std:::wstring)、size()(或wsize()为宽字符长度)和echo()。RE/flex方法调用是更干净的IMHO,包括广泛的char操作。

Background

背景

In plain old Flex I ended up defining ugly UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id:

在普通的老Flex中,我最终定义了难看的UTF-8模式来捕获ASCII字母和UTF-8编码字母,用于一个需要支持Unicode标识符id的编译器项目:

digit           [0-9]
alpha           ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id              ({alpha})({alpha}|{digit})*            

The alpha pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L} etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns. Use a Unicode-capable scanner generator instead, such as RE/flex.

alpha模式支持ASCII字母、下划线和用于标识符(\p{L}等)的Unicode代码点。该模式允许更多的Unicode代码点,而不是绝对必要的,以保持该模式的可管理性,因此它在一些缺乏准确性的情况下交换紧凑性,并允许在某些非有效的UTF-8超长字符的情况下使用UTF-8。如果您正在考虑这种方法,那么就要对问题和安全问题保持警惕。使用具有unicode能力的扫描器生成器,比如RE/flex。

Safety

安全

When using UTF-8 directly in Flex patterns, there are several concerns:

当在Flex模式中直接使用UTF-8时,有以下几个问题:

  1. Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.

    用Flex编码自己的UTF-8模式以匹配任何Unicode字符可能会出现错误。模式应该仅限于有效的Unicode范围内的字符。Unicode代码点覆盖了U+0000到U+D7FF和U+E000到U+10FFFF的范围。U+D800到U+DFFF的范围是为UTF-16代理对保留的,并且是无效的代码点。当使用工具将Unicode范围转换为UTF-8时,请确保排除无效的代码点。

  2. Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.

    模式应该拒绝过长和其他无效的字节序列。无效的UTF-8不应该被默默地接受。

  3. To catch lexical input errors in your lexer will require a special . (dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".

    要捕获lexer中的词汇输入错误,需要一个特殊的。(dot)匹配有效和无效的Unicode,包括UTF-8超运行和无效的字节序列,以便产生输入被拒绝的错误消息。如果您使用点作为“捕获一切”来生成错误消息,但是您的点不匹配无效的Unicode,那么lexer将挂起(“扫描器被阻塞”),或者您的lexer将通过Flex的“默认规则”在输出上回显垃圾字符。

  4. Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).

    您的扫描器应该识别输入中的UTF BOM (Unicode字节顺序标记),以切换到UTF-8、UTF-16 (LE or BE)或UTF-32 (LE or BE)。

  5. As you point out, patterns such as [unicode characters] do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.

    正如您所指出的,诸如[unicode字符]之类的模式对于Flex根本不起作用,因为括号列表中的UTF-8字符是多字节字符,可以匹配每个字节字符,但不能匹配UTF-8字符。

See also invalid UTF encodings in the RE/flex user guide.

在RE/flex用户指南中也可以看到无效的UTF编码。

#4


-2  

In Java 7 and up, the format should be: "\p{IsHan}"

在Java 7和up中,格式应该是:“\p{IsHan}”