From the manual:
从手册:
After
\x
, up to two hexadecimal digits are read (letters can be in upper or lower case). In UTF-8 mode,\x{...}
is allowed, where the contents of the braces is a string of hexadecimal digits. It is interpreted as a UTF-8 character whose code number is the given hexadecimal number. The original hexadecimal escape sequence,\xhh
, matches a two-byte UTF-8 character if the value is greater than 127.在\ x之后,最多读取两个十六进制数字(字母可以是大写或小写)。在UTF-8模式下,允许使用\ x {...},其中大括号的内容是十六进制数字的字符串。它被解释为UTF-8字符,其代码编号是给定的十六进制数。如果值大于127,则原始十六进制转义序列\ xhh与两字节UTF-8字符匹配。
So what does this mean?
那么这是什么意思?
The code point of "ä" is E4 while the UTF-8 representation is C3A4, but neiter of those matches:
“ä”的代码点是E4,而UTF-8表示是C3A4,但是这些匹配中的neiter:
$t = 'ä'; // same as "\xC3\xA4";
preg_match('/\\xC3A4/u', $t); // doesn't match
preg_match('/\\x00E4/u', $t); // doesn't match
With the curly braces it does match when I give the code point:
使用花括号,当我给出代码点时,它确实匹配:
preg_match('/\\x{00E4}/u', $t); // matches
1 个解决方案
#1
5
The syntax is a way to specify a character by value:
语法是一种按值指定字符的方法:
-
\xAB
specifies a code-point in the range 0-FF. -
\x{ABCD}
specifies a code-point in the range 0-FFFF.
\ xAB指定0-FF范围内的代码点。
\ x {ABCD}指定0-FFFF范围内的代码点。
The indicated wording from the manual is bit confusing, perhaps in an attempt to be precise. Character values 128-255 (and some) are encoded as 2-bytes in UTF-8. Thus, a unicode regular expression will match 7-bit clean ASCII but will not match different encodings/codepages (i.e. CP437) that utilize values in said range. The manual is, in a roundabout way, saying that a unicode regular expression is only suitable to be used with correctly encoded input. However;
手册中指出的措辞有点令人困惑,可能是为了准确。字符值128-255(和一些)在UTF-8中被编码为2字节。因此,unicode正则表达式将匹配7位干净的ASCII,但不匹配利用所述范围中的值的不同编码/代码页(即CP437)。该手册以迂回的方式表示,unicode正则表达式仅适用于正确编码的输入。然而;
It doesn't mean that \xABCD
is parsed as \x{ABCD}
(one character). It is parsed as \xAB
(one character) and then CD
(two characters)1. The braces address this parsing ambiguity issue:
这并不意味着\ xABCD被解析为\ x {ABCD}(一个字符)。它被解析为\ xAB(一个字符),然后是CD(两个字符)1。大括号解决了这个解析模糊问题:
After \x, up to two hexadecimal digits are read .. In UTF-8 mode, \x{...} is allowed ..
在\ x之后,最多读取两个十六进制数字。在UTF-8模式下,允许使用\ x {...}。
Some other languages use \u
instead of \x
for the longer form.
对于较长的表单,其他一些语言使用\ u而不是\ x。
1 Consider that this matches:
1考虑到这匹配:
preg_match('/\xC3A4/u', "\xC3" . "A4");
preg_match('/ \ xC3A4 / u',“\ xC3”。“A4”);
#1
5
The syntax is a way to specify a character by value:
语法是一种按值指定字符的方法:
-
\xAB
specifies a code-point in the range 0-FF. -
\x{ABCD}
specifies a code-point in the range 0-FFFF.
\ xAB指定0-FF范围内的代码点。
\ x {ABCD}指定0-FFFF范围内的代码点。
The indicated wording from the manual is bit confusing, perhaps in an attempt to be precise. Character values 128-255 (and some) are encoded as 2-bytes in UTF-8. Thus, a unicode regular expression will match 7-bit clean ASCII but will not match different encodings/codepages (i.e. CP437) that utilize values in said range. The manual is, in a roundabout way, saying that a unicode regular expression is only suitable to be used with correctly encoded input. However;
手册中指出的措辞有点令人困惑,可能是为了准确。字符值128-255(和一些)在UTF-8中被编码为2字节。因此,unicode正则表达式将匹配7位干净的ASCII,但不匹配利用所述范围中的值的不同编码/代码页(即CP437)。该手册以迂回的方式表示,unicode正则表达式仅适用于正确编码的输入。然而;
It doesn't mean that \xABCD
is parsed as \x{ABCD}
(one character). It is parsed as \xAB
(one character) and then CD
(two characters)1. The braces address this parsing ambiguity issue:
这并不意味着\ xABCD被解析为\ x {ABCD}(一个字符)。它被解析为\ xAB(一个字符),然后是CD(两个字符)1。大括号解决了这个解析模糊问题:
After \x, up to two hexadecimal digits are read .. In UTF-8 mode, \x{...} is allowed ..
在\ x之后,最多读取两个十六进制数字。在UTF-8模式下,允许使用\ x {...}。
Some other languages use \u
instead of \x
for the longer form.
对于较长的表单,其他一些语言使用\ u而不是\ x。
1 Consider that this matches:
1考虑到这匹配:
preg_match('/\xC3A4/u', "\xC3" . "A4");
preg_match('/ \ xC3A4 / u',“\ xC3”。“A4”);