基于PHP中的unicode字符范围将字符串分割成数组

时间:2022-12-22 21:41:02

Sorry for the ambiguous subject, what I'm looking for is to have a string with cyrillic characters that may go like

不好意思,这个主题很模糊,我要找的是一个带有西里尔字母的字符串

«Добрый день!» - сказал он, потянувшись…

into an array that goes like

变成一个像这样的数组

[0] => «
[1] => Добрый␠
[2] => день!»␠-␠
[3] => сказал␠
[4] => он,␠
[5] => потянувшись…

So essentially I'm looking for a break to occur on a border between any character and a cyrillic character ([а-я] range) although this must only be true when we transit from any character to a cyrillic character, not vice versa. I've seen examples that successfully solve this with punctuation characters and latin alphabet with

所以我要找一个打破任何字符之间发生在边境和斯拉夫字母字符([а-я]范围)虽然这必须是真实的,当我们从任何一个斯拉夫字母字符,而不是反之亦然。我见过用标点符号和拉丁字母成功地解决这个问题的例子

preg_split('/([^.:!?]+[.:!?]+)/', 'hello:there.everyone!so.how?are:you', NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

but my attempts to repurpose it into something different have so far failed:

但到目前为止,我试图把它重新定位成不同的东西,但失败了:

preg_split ('/(?<=[^а-я])/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

almost works but it also splits by regular characters such as spaces and punctuation marks and that is not what I want. Clearly there's something wrong with my regex. How should I modify that to get the result as in the example above?

这几乎是可行的,但它也被常规字符分割,如空格和标点符号,这不是我想要的。显然我的regex出了问题。我应该如何修改它以获得上面示例中的结果?

4 个解决方案

#1


1  

You have to check also with a look ahead if the next character is a cyrrilic one. This code will do the job:

如果下一个字符是cyrrilic字符,你也必须检查一下。此代码将完成以下工作:

$t = preg_split ('/(?<=[^а-я])(?=[а-я]+)/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

It gives this output:

它给这个输出:

Array
(
    [0] => «
    [1] => Добрый 
    [2] => день!» - 
    [3] => сказал 
    [4] => он, 
    [5] => потянувшись…
)

Here you can try it.

在这里你可以试试。

#2


2  

Use the following regex solution:

使用以下regex解决方案:

$s = "«Добрый день!» - сказал он, потянувшись…";
$res = preg_split('/\b(\p{Cyrillic}+\W*)/u', $s, NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// Array(
//   [0] => «
//   [1] => Добрый 
//   [2] => день!» - 
//   [3] => сказал 
//   [4] => он, 
//   [5] => потянувшись…
//)

See the PHP demo

查看PHP演示

Details:

细节:

  • \b(\p{Cyrillic}+\W*) - matches and captures a whole Cyrillic word with 0+ non-word chars after it
  • \b(\p{Cyrillic}+\W*) -匹配并捕获一个带有0+非单词字符的完整的Cyrillic单词
  • The pattern is wrapped with capturing parentheses and PREG_SPLIT_DELIM_CAPTURE will push the captured values into the resulting array
  • 使用捕获括号包装模式,PREG_SPLIT_DELIM_CAPTURE将捕获的值推入结果数组
  • PREG_SPLIT_NO_EMPTY will discard empty values in the array
  • PREG_SPLIT_NO_EMPTY将丢弃数组中的空值
  • /u modifier will make the \b (word boundary) and \W Unicode aware, and will allow processing Unicode strings with regex.
  • /u修改器将使\b(单词边界)和\W Unicode敏感,并允许使用regex处理Unicode字符串。

#3


2  

How about splitting at an initial \b word boundary with u modifier.

用u形修饰语在一个初始的\b字边界上分裂怎么样?

$res = preg_split('/\b(?=\w)(?!^)/u', $str);

The lookahead ensures \b is followed by a word character. (?!^) prevents empty match if start.

前视确保\b后面跟着一个单词字符。(? ! ^)防止空如果开始比赛。

See this demo at eval.in

在evalin上可以看到这个演示

#4


0  

Try this regex: [\x{0400}-\x{04FF}]*[^\x{0400}-\x{04FF}]*. All unicode characters from 0400 to 04FF are considered as cyrillic. It should match exactly what you want. You can also replace \x{0400}-\x{04FF} with \p{Cyrillic} as suggested in another answer.

试试这个正则表达式:[\ x { 0400 } - x { 04 ff } \]*[x ^ \ { 0400 } - x { 04 ff } \]*。从0400到04FF的所有unicode字符都被认为是cyrillic。它应该和你想要的一模一样。您也可以用另一个答案中所建议的\ {Cyrillic}替换\ {0400}-\ {04FF}。

This is all the characters in that range:
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04D0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04F0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ

这是中所有的人物,范围:ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04 d0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04f0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ

#1


1  

You have to check also with a look ahead if the next character is a cyrrilic one. This code will do the job:

如果下一个字符是cyrrilic字符,你也必须检查一下。此代码将完成以下工作:

$t = preg_split ('/(?<=[^а-я])(?=[а-я]+)/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

It gives this output:

它给这个输出:

Array
(
    [0] => «
    [1] => Добрый 
    [2] => день!» - 
    [3] => сказал 
    [4] => он, 
    [5] => потянувшись…
)

Here you can try it.

在这里你可以试试。

#2


2  

Use the following regex solution:

使用以下regex解决方案:

$s = "«Добрый день!» - сказал он, потянувшись…";
$res = preg_split('/\b(\p{Cyrillic}+\W*)/u', $s, NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// Array(
//   [0] => «
//   [1] => Добрый 
//   [2] => день!» - 
//   [3] => сказал 
//   [4] => он, 
//   [5] => потянувшись…
//)

See the PHP demo

查看PHP演示

Details:

细节:

  • \b(\p{Cyrillic}+\W*) - matches and captures a whole Cyrillic word with 0+ non-word chars after it
  • \b(\p{Cyrillic}+\W*) -匹配并捕获一个带有0+非单词字符的完整的Cyrillic单词
  • The pattern is wrapped with capturing parentheses and PREG_SPLIT_DELIM_CAPTURE will push the captured values into the resulting array
  • 使用捕获括号包装模式,PREG_SPLIT_DELIM_CAPTURE将捕获的值推入结果数组
  • PREG_SPLIT_NO_EMPTY will discard empty values in the array
  • PREG_SPLIT_NO_EMPTY将丢弃数组中的空值
  • /u modifier will make the \b (word boundary) and \W Unicode aware, and will allow processing Unicode strings with regex.
  • /u修改器将使\b(单词边界)和\W Unicode敏感,并允许使用regex处理Unicode字符串。

#3


2  

How about splitting at an initial \b word boundary with u modifier.

用u形修饰语在一个初始的\b字边界上分裂怎么样?

$res = preg_split('/\b(?=\w)(?!^)/u', $str);

The lookahead ensures \b is followed by a word character. (?!^) prevents empty match if start.

前视确保\b后面跟着一个单词字符。(? ! ^)防止空如果开始比赛。

See this demo at eval.in

在evalin上可以看到这个演示

#4


0  

Try this regex: [\x{0400}-\x{04FF}]*[^\x{0400}-\x{04FF}]*. All unicode characters from 0400 to 04FF are considered as cyrillic. It should match exactly what you want. You can also replace \x{0400}-\x{04FF} with \p{Cyrillic} as suggested in another answer.

试试这个正则表达式:[\ x { 0400 } - x { 04 ff } \]*[x ^ \ { 0400 } - x { 04 ff } \]*。从0400到04FF的所有unicode字符都被认为是cyrillic。它应该和你想要的一模一样。您也可以用另一个答案中所建议的\ {Cyrillic}替换\ {0400}-\ {04FF}。

This is all the characters in that range:
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04D0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04F0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ

这是中所有的人物,范围:ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04 d0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04f0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ