PHP PCRE中的\ x是什么意思？

从手册:

After \x, up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode, \x{...} is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 character if the value is greater than 127.

在\ x之后,最多读取两个十六进制数字(字母可以是大写或小写)。在UTF-8模式下,允许使用\ x {...},其中大括号的内容是十六进制数字的字符串。它被解释为UTF-8字符,其代码编号是给定的十六进制数。如果值大于127,则原始十六进制转义序列\ xhh与两字节UTF-8字符匹配。

So what does this mean?

那么这是什么意思?

The code point of "ä" is E4 while the UTF-8 representation is C3A4, but neiter of those matches:

“ä”的代码点是E4,而UTF-8表示是C3A4,但是这些匹配中的neiter:

$t = 'ä'; // same as "\xC3\xA4";

preg_match('/\\xC3A4/u', $t); // doesn't match
preg_match('/\\x00E4/u', $t); // doesn't match

With the curly braces it does match when I give the code point:

使用花括号,当我给出代码点时,它确实匹配:

preg_match('/\\x{00E4}/u', $t); // matches

1 个解决方案

#1

The syntax is a way to specify a character by value:

语法是一种按值指定字符的方法:

\xAB specifies a code-point in the range 0-FF.

\ xAB指定0-FF范围内的代码点。

\x{ABCD} specifies a code-point in the range 0-FFFF.

\ x {ABCD}指定0-FFFF范围内的代码点。

The indicated wording from the manual is bit confusing, perhaps in an attempt to be precise. Character values 128-255 (and some) are encoded as 2-bytes in UTF-8. Thus, a unicode regular expression will match 7-bit clean ASCII but will not match different encodings/codepages (i.e. CP437) that utilize values in said range. The manual is, in a roundabout way, saying that a unicode regular expression is only suitable to be used with correctly encoded input. However;

手册中指出的措辞有点令人困惑,可能是为了准确。字符值128-255(和一些)在UTF-8中被编码为2字节。因此,unicode正则表达式将匹配7位干净的ASCII,但不匹配利用所述范围中的值的不同编码/代码页(即CP437)。该手册以迂回的方式表示,unicode正则表达式仅适用于正确编码的输入。然而;

It doesn't mean that \xABCD is parsed as \x{ABCD} (one character). It is parsed as \xAB (one character) and then CD (two characters)¹. The braces address this parsing ambiguity issue:

这并不意味着\ xABCD被解析为\ x {ABCD}(一个字符)。它被解析为\ xAB(一个字符),然后是CD(两个字符)1。大括号解决了这个解析模糊问题:

After \x, up to two hexadecimal digits are read .. In UTF-8 mode, \x{...} is allowed ..

在\ x之后,最多读取两个十六进制数字。在UTF-8模式下,允许使用\ x {...}。

Some other languages use \u instead of \x for the longer form.

对于较长的表单,其他一些语言使用\ u而不是\ x。

¹ Consider that this matches:

1考虑到这匹配:

preg_match('/\xC3A4/u', "\xC3" . "A4");

preg_match('/ \ xC3A4 / u',“\ xC3”。“A4”);

#1