正则表达式 - 基于标点符号/空格将字符串拆分为数组

时间:2022-07-03 21:38:11

I need a way to split a string into several different parts based on the presence of punctuation marks or spaces.

我需要一种方法,根据标点符号或空格的存在将字符串拆分成几个不同的部分。

What I mean by this, is that every word should be split into its own array element, furthermore punctuation which is at the start or at the end of the word should also be put into its own array element.

我的意思是,每个单词都应该被分成它自己的数组元素,而且在单词的开头或结尾处的标点符号也应该放入它自己的数组元素中。

E.g: I need to be able to turn the string Hello, Harry Potter. I'm Tom Riddle. into

例如:我需要能够打开弦乐Hello,Harry Potter。我是汤姆里德尔。成

array(
   "Hello",
    ", "
    "Harry",
    "Potter"
    ". ",
    "I'm",
    "Tom",
    "Riddle",
    ". "
)

So punctuation in the middle of words (e.g. apostrophes in the middle of words) should not cause a separation **Edit: ** so to clarify the desired behaviour, I'm, didn't, etc. should remain one word, but hello!, "okay, etc should be separated from the punctuation mark at the start or end.

所以在单词中间的标点符号(例如单词中间的撇号)不应该导致分离**编辑:**所以为了澄清所需的行为,我,不,等等应该保持一个单词,但是你好!,“好吧,等等应该从开头或结尾的标点符号中分离出来。

Also, the punctuation marks which I want to be included in the search are:

另外,我希望包含在搜索中的标点符号是:

  • . (full stop/period)
  • 。 (句号/句号)

  • ? (question mark)
  • ? (问号)

  • ! (exclaimation mark)
  • ! (惊叹号)

  • , (comma)
  • ; (semi-colon)
  • : (colon)
  • (-) (hyphen-dash)
  • ( (start bracket)
  • ((开始括号)

  • ) (end bracket)
  • )(结束括号)

  • { (start squigly brace)
  • {(开始squigly大括号)

  • } (end squigly brace)
  • }(结束squigly支持)

  • [ (start square bracket)
  • [(开始方括号)

  • ] (end square bracket)
  • ](方括号)

  • ' (single quotation mark)
  • '(单引号)

  • " (double quotation mark)
  • “(双引号)

  • … (elpises)

The closest I have found to the result I need is this:

我找到的最接近我需要的结果是:

preg_split('/(\s|[\.,\/])/', $string, -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);

However, the problems with this are:

但是,问题是:

  • Punctuation mid-word counts as normal punctuation
  • 标点符号中间词计为正常标点符号

  • The array element containing the array element does not contain the space as well. Edit: Sorry for the vagueness; by this, I meant that I wanted the punctuation characters to contain the space which is after/before the puntuation mark. e.g. If it is a comma, it would be , (space after), but if it is an opening bracket, it would be ( (space before).
  • 包含数组元素的数组元素也不包含空格。编辑:抱歉模糊;通过这个,我的意思是我希望标点字符包含在puntuation标记之后/之前的空格。例如如果它是逗号,它将是(后面的空格),但如果它是一个开括号,它将是((之前的空格)。

  • When I add the rest of the punctuation marks I need (preg_split("/(\s|[\.?!,;:-(){}[]'\"…\/])/",) I get an error. I'm pretty sure that this error is due to an unescaped character, so I ran that whole thing through preg_quote, which returned \.\?\!,;\:\-\(\)\{\}\[\]'"…, but this still gives the error: Parse error: syntax error, unexpected '…' (T_STRING), expecting ',' or ')' in [...][...] on line 5
  • 当我添加剩余的标点符号时我需要(preg_split(“/(\ s | [\。?!,;: - (){} []'\”... \ /])/“,我得到一个错误。我很确定这个错误是由于一个未转义的字符,所以我通过preg_quote运行了整个事情,它返回了\。\?\!,; \:\ - \(\)\ {\} \ [ \]'“......,但这仍然给出了错误:解析错误:语法错误,意外'...'(T_STRING),期待','或')'在[...] [...]第5行

My understanding of regex is fairly limited, but after looking at the php docs I can gather that the code above separates words at each whitespace it encounters, or every time it encounters a comma or a punctuation. (Correct me if I'm wrong there?) And as I understood, adding the rest of the characters within the square brackets would make it separate the string at any of those characters as well(?) Since this isn't working, I suppose I have some sort of fundamental misunderstanding about how this works, so an explanation would be greatly appreciated.

我对正则表达式的理解是相当有限的,但在查看php文档后,我可以收集到上面的代码在它遇到的每个空格中分隔单词,或者每次遇到逗号或标点符号时。 (纠正我,如果我错在那里?)正如我所理解的那样,在方括号内添加其余字符会使它在任何一个字符处分开字符串(?)因为这不起作用,我假设我对这是如何工作有某种基本的误解,所以我将非常感激解释。

3 个解决方案

#1


1  

Do you really want all word-internal punctuation to stay attached? Also it looks like you want to tokenize each punctuation character separately (but attach nearby whitespace), which is most of the work. If you really do, this should do it. Comes with a test string to show it at work.

你真的希望所有单词内部标点符号保持附加吗?此外,您似乎想要单独标记每个标点字符(但附加附近的空格),这是大部分工作。如果你真的这样做,那就应该这样做。附带一个测试字符串以显示它的工作情况。

$string = "Hello, it's me-me-it's-me!!! o... (a friend?)";
print_r( preg_split("/(\w\S+\w)|(\w+)|(\s*\.{3}\s*)|(\s*[^\w\s]\s*)|\s+/", $string, 
        -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE) );

Output:

Array
(
    [0] => Hello
    [1] => ,
    [2] => it's
    [3] => me-me-it's-me
    [4] => !
    [5] => !
    [6] => !
    [7] => o
    [8] => ... 
    [9] => (
    [10] => a
    [11] => friend
    [12] => ?
    [13] => )
)

This is how it works:

这是它的工作原理:

  1. (\w\S+\w) Capture any word of 3+ characters, allowing embedded non-letters.
  2. (\ w \ S + \ w)捕获3个以上字符的任何单词,允许嵌入的非字母。

  3. (\w+) Capture any word (to catch short words).
  4. (\ w +)捕获任何单词(捕捉短词)。

  5. (\s*\.{3}\s*) Capture ellipsis ..., together with any surrounding space.
  6. (\ s * \。{3} \ s *)捕获省略号...,以及任何周围空间。

  7. (\s*[^\w\s]\s*) Capture any non-letter, non-space characters individually; but attach any nearby spaces.
  8. (\ s * [^ \ w \ s] \ s *)分别捕获任何非字母,非空格字符;但附上任何附近的空间。

  9. \s+ Any other spaces (i.e., between words) split the string, but are not captured.
  10. \ s +任何其他空格(即单词之间)分割字符串,但不会被捕获。

If you want to be selective about what can be inside a word, replace the \S+ in the first alternative with a list of what you want to allow, e.g., [\w'-]+ to allow apostrophes and hyphens only.

如果您想要选择单词中的内容,请将第一个替换中的\ S +替换为您要允许的列表,例如,[\ w' - ] +仅允许使用撇号和连字符。

#2


4  

This will do it, however the output is slightly different as you included ' as a character to split on, so I'm will be split:

这样做,但输出略有不同,因为你包括'作为要拆分的角色,所以我将被拆分:

$result = preg_split('/(\.\.\.\s?|[-.?!,;:(){}\[\]\'"]\s?)|\s/',
                     $string, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

It might be simplified, but I just included the ellipses ... with an optional space OR all your other characters with an optional space OR a space.

它可能是简化的,但我只是包含了省略号...带有可选空格或所有其他字符带有可选空格或空格。

You need to escape the dots . outside of the character class [], escape the [ and ] inside the character class and - needs to be escaped or come first or last so as not to denote a range. Obviously you need to escape the quote that you use to contain the pattern, in this case the single '.

你需要逃避点。在字符类[]之外,转义字符类中的[和]并且 - 需要被转义或者先来或最后一个以便不表示范围。显然,您需要转义用于包含模式的引用,在本例中为单个'。

You didn't specify whether a space is required on either side of the punctuation and it isn't clear if this "Punctuation mid-word counts as normal punctuation" means it should or shouldn't count.

您没有指定标点符号两侧是否需要空格,并且不清楚这个“标点符号中间词是否为正常标点符号”表示它应该或不应该计算。

#3


0  

In general you could use the pattern

通常,您可以使用该模式

word character+[all your punctuation characters here]+word character(*SKIP)(*FAIL)

For example:

\w[\[\].?\"\']\w(*SKIP)(*FAIL)|[\[\].?\"\']

See a demo on regex101.com.

请参阅regex101.com上的演示。

#1


1  

Do you really want all word-internal punctuation to stay attached? Also it looks like you want to tokenize each punctuation character separately (but attach nearby whitespace), which is most of the work. If you really do, this should do it. Comes with a test string to show it at work.

你真的希望所有单词内部标点符号保持附加吗?此外,您似乎想要单独标记每个标点字符(但附加附近的空格),这是大部分工作。如果你真的这样做,那就应该这样做。附带一个测试字符串以显示它的工作情况。

$string = "Hello, it's me-me-it's-me!!! o... (a friend?)";
print_r( preg_split("/(\w\S+\w)|(\w+)|(\s*\.{3}\s*)|(\s*[^\w\s]\s*)|\s+/", $string, 
        -1, PREG_SPLIT_NO_EMPTY|PREG_SPLIT_DELIM_CAPTURE) );

Output:

Array
(
    [0] => Hello
    [1] => ,
    [2] => it's
    [3] => me-me-it's-me
    [4] => !
    [5] => !
    [6] => !
    [7] => o
    [8] => ... 
    [9] => (
    [10] => a
    [11] => friend
    [12] => ?
    [13] => )
)

This is how it works:

这是它的工作原理:

  1. (\w\S+\w) Capture any word of 3+ characters, allowing embedded non-letters.
  2. (\ w \ S + \ w)捕获3个以上字符的任何单词,允许嵌入的非字母。

  3. (\w+) Capture any word (to catch short words).
  4. (\ w +)捕获任何单词(捕捉短词)。

  5. (\s*\.{3}\s*) Capture ellipsis ..., together with any surrounding space.
  6. (\ s * \。{3} \ s *)捕获省略号...,以及任何周围空间。

  7. (\s*[^\w\s]\s*) Capture any non-letter, non-space characters individually; but attach any nearby spaces.
  8. (\ s * [^ \ w \ s] \ s *)分别捕获任何非字母,非空格字符;但附上任何附近的空间。

  9. \s+ Any other spaces (i.e., between words) split the string, but are not captured.
  10. \ s +任何其他空格(即单词之间)分割字符串,但不会被捕获。

If you want to be selective about what can be inside a word, replace the \S+ in the first alternative with a list of what you want to allow, e.g., [\w'-]+ to allow apostrophes and hyphens only.

如果您想要选择单词中的内容,请将第一个替换中的\ S +替换为您要允许的列表,例如,[\ w' - ] +仅允许使用撇号和连字符。

#2


4  

This will do it, however the output is slightly different as you included ' as a character to split on, so I'm will be split:

这样做,但输出略有不同,因为你包括'作为要拆分的角色,所以我将被拆分:

$result = preg_split('/(\.\.\.\s?|[-.?!,;:(){}\[\]\'"]\s?)|\s/',
                     $string, null, PREG_SPLIT_DELIM_CAPTURE|PREG_SPLIT_NO_EMPTY);

It might be simplified, but I just included the ellipses ... with an optional space OR all your other characters with an optional space OR a space.

它可能是简化的,但我只是包含了省略号...带有可选空格或所有其他字符带有可选空格或空格。

You need to escape the dots . outside of the character class [], escape the [ and ] inside the character class and - needs to be escaped or come first or last so as not to denote a range. Obviously you need to escape the quote that you use to contain the pattern, in this case the single '.

你需要逃避点。在字符类[]之外,转义字符类中的[和]并且 - 需要被转义或者先来或最后一个以便不表示范围。显然,您需要转义用于包含模式的引用,在本例中为单个'。

You didn't specify whether a space is required on either side of the punctuation and it isn't clear if this "Punctuation mid-word counts as normal punctuation" means it should or shouldn't count.

您没有指定标点符号两侧是否需要空格,并且不清楚这个“标点符号中间词是否为正常标点符号”表示它应该或不应该计算。

#3


0  

In general you could use the pattern

通常,您可以使用该模式

word character+[all your punctuation characters here]+word character(*SKIP)(*FAIL)

For example:

\w[\[\].?\"\']\w(*SKIP)(*FAIL)|[\[\].?\"\']

See a demo on regex101.com.

请参阅regex101.com上的演示。