为什么\ w匹配Mac OS X PHP环境中的非英文字符？

I found that "\w" can match Chinese characters in my Mac OS X PHP environment. but same code does not working on Linux.

我发现“\ w”可以在我的Mac OS X PHP环境中匹配中文字符。但是相同的代码在Linux上不起作用。

php -r "echo preg_match('/^\w+$/','人1234', \$m).chr(10); var_dump(\$m);"

Mac OS X 11.11.3 with PHP 5.6.18 (cli), PHP 5.4.45 (cli) result

Mac OS X 11.11.3,PHP 5.6.18(cli),PHP 5.4.45(cli)结果

1
array(1) {
  [0] =>
  string(7) "人1234"
}

Cent OS 6 with PHP 5.6.18 (cli), PHP 5.2.17p1 (cli) result

Cent OS 6,PHP 5.6.18(cli),PHP 5.2.17p1(cli)结果

0
array(0) {
}

PHP manual says

PHP手册说

The definition of letters and digits is controlled by PCRE's character tables, and may vary if locale-specific matching is taking place. For example, in the "fr" (French) locale, some character codes greater than 128 are used for accented letters, and these are matched by \w."

字母和数字的定义由PCRE的字符表控制,如果发生特定于语言环境的匹配,则可能会有所不同。例如,在“fr”(法语)语言环境中,一些大于128的字符代码用于重音字母,并且这些字符代码由\ w匹配。“

I guess something link to the PCRE library. Could anybody explain why?

我想有些东西链接到PCRE库。谁有人解释原因?

1 个解决方案

#1

Yes, it is all about how PCRE is compiled alongside PHP:

是的,这完全是关于如何与PHP一起编译PCRE:

pcre *pcre_compile(const char *pattern,
      int options,
      const char **errptr,
      int *erroffset,
      const unsigned char *tableptr);

In this function, that is responsible for compiling RegExes into their internal forms, options argument is a list of bits including PCRE_UCP (UCP = Unicode Character Properties) which allows \w, \d and other tokens use unicode properties and it seems that PHP's PCRE on your Mac OS X machine is compiled with this flag on.

在这个函数中,它负责将RegExes编译成它们的内部形式,options参数是一个包含PCRE_UCP(UCP = Unicode字符属性)的位列表,它允许\ w,\ d和其他标记使用unicode属性,它似乎是PHP的PCRE在您的Mac OS X机器上使用此标志进行编译。

There is also an special modifier (*UCP) which you can use on the fly that even if your PCRE is not compiled with PCRE_UCP flag set, you can have such an option enabled on runtime.

还有一个特殊的修饰符(* UCP)可以动态使用,即使你的PCRE没有使用PCRE_UCP标志设置编译,你也可以在运行时启用这样的选项。

E.g. /(*UCP)\w+/ matches unicode characters as well. (See it online)

例如。 /(* UCP)\ w + /匹配unicode字符。 (在线查看)

From PCRE website:

来自PCRE网站:

PCRE handles caseless matching, and determines whether characters are letters, digits, or whatever, by reference to a set of tables, indexed by character code point.

PCRE通过引用由字符代码点索引的一组表来处理无外壳匹配,并确定字符是字母,数字还是其他。

When running in UTF-8 mode, or in the 16- or 32-bit libraries, this applies only to characters with code points less than 256. By default, higher-valued code points never match escapes such as \w or \d.

在UTF-8模式或16位或32位库中运行时,这仅适用于代码点小于256的字符。默认情况下,较高值的代码点永远不会匹配转义符,例如\ w或\ d。

However, if PCRE is built with Unicode property support, all characters can be tested with \p and \P, or, alternatively, the PCRE_UCP option can be set when a pattern is compiled; this causes \w and friends to use Unicode property support instead of the built-in tables.

但是,如果PCRE是使用Unicode属性支持构建的,则可以使用\ p和\ P测试所有字符,或者,可以在编译模式时设置PCRE_UCP选项;这会导致\ w和朋友使用Unicode属性支持而不是内置表。

The use of locales with Unicode is discouraged. If you are handling characters with code points greater than 128, you should either use Unicode support, or use locales, but not try to mix the two.

不鼓励使用带有Unicode的语言环境。如果要处理代码点大于128的字符,则应使用Unicode支持,或使用区域设置,但不要尝试将两者混合使用。

#1