折叠并捕获一个Regex表达式中的重复模式

时间:2022-05-18 11:40:33

I keep bumping into situations where I need to capture a number of tokens from a string and after countless tries I couldn't find a way to simplify the process.

我经常遇到这样的情况:我需要从字符串中捕获一些令牌,在无数次尝试之后,我无法找到简化过程的方法。

So let's say the text is:

假设文本是

start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end

开始:test-test-lorem-ipsum-sir-doloret-etc-etc-something:结束

This example has 8 items inside, but say it could have between 3 and 10 items.

这个例子中有8个项目,但是它可能有3到10个项目。

I'd ideally like something like this:
start:(?:(\w+)-?){3,10}:end nice and clean BUT it only captures the last match. see here

理想情况下,我想要这样的东西:start:(?:(\w+)-?){3,10}:end nice and clean, BUT it only caught the last match。在这里看到的

I usually use something like this in simple situations:

我通常在简单的情况下使用这样的东西:

start:(\w+)-(\w+)-(\w+)-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?-?(\w+)?:end

3 groups mandatory and another 7 optional because of the max 10 limit, but this doesn't look 'nice' and it would be a pain to write and track if the max limit was 100 and the matches were more complex. demo

3个组是强制性的,另外7个组是可选的,因为最大值是10,但是这看起来不太“好”,如果最大值是100,而且匹配更复杂的话,那么编写和跟踪会很麻烦。演示

And the best I could do so far:

到目前为止,我所能做的最好的事情是:

start:(\w+)-((?1))-((?1))-?((?1))?-?((?1))?-?((?1))?-?((?1))?-?((?1))?:end

shorter especially if the matches are complex but still long. demo

更短,特别是如果比赛很复杂但仍然很长。演示

Anyone managed to make it work as a 1 regex-only solution without programming?

有人设法使它作为一个1 regex唯一的解决方案而不进行编程吗?

I'm mostly interested on how can this be done in PCRE but other flavors would be ok too.

我最感兴趣的是如何在PCRE中实现这一点,但是其他口味也可以。

Update:

The purpose is to validate a match and capture individual tokens inside match 0 by RegEx alone, without any OS/Software/Programming-Language limitation

目的是验证匹配并仅通过RegEx在match 0中捕获单个令牌,而不受任何OS/软件/编程语言限制

Update 2 (bounty):

With @nhahtdh's help I got to the RegExp below by using \G:

在@nhahtdh的帮助下,我使用\G进入了下面的RegExp:

(?:start:(?=(?:[\w]+(?:-|(?=:end))){3,10}:end)|(?!^)\G-)([\w]+)

demo even shorter, but can be described without repeating code

演示甚至更短,但是可以不重复代码进行描述

I'm also interested in the ECMA flavor and as it doesn't support \G wondering if there's another way, especially without using /g modifier.

我也对ECMA的味道感兴趣,因为它不支持\G,不知道是否有其他的方法,特别是不使用/ G修改器。

5 个解决方案

#1


34  

Read this first!

This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.

这篇文章是为了展示可能性,而不是支持“一切regex”方法来解决问题。作者已经写了3-4个变种,每个都有微妙的缺陷,在达到当前的解决方案之前,很难检测到。

For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.

对于您的特定示例,还有其他更好的、更易于维护的解决方案,例如沿着分隔符匹配和分割匹配。

This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.

本文将讨论您的具体示例。我真的怀疑全面概括是可能的,但其背后的思想对于类似的情况是可重用的。

Summary

  • .NET supports capturing repeating pattern with CaptureCollection class.
  • . net支持使用capturecall类捕获重复模式。
  • For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
  • 对于支持\G和look-behind的语言,我们可以构建一个使用全局匹配函数的regex。要把它写得完全正确并不容易,编写一个有细微缺陷的regex也不容易。
  • For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
  • 语言没有\ G和向后看的支持:可以模拟\ G ^,并用输入字符串后一个匹配。(答案中没有提及)。

Solution

This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.

这个解决方案假设regex引擎支持\G匹配边界、向前查找(?=模式)和向后查找(?<=模式)。Java、Perl、PCRE、。net、Ruby regex的风格支持上述所有高级特性。

However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.

但是,您可以使用. net中的regex。因为。net支持捕获所有的实例,所以捕获组可以通过capturecall类进行匹配。

For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:

对于您的情况,可以在一个regex中完成,使用\G匹配边界,并向前查看以限制重复次数:

(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)

DEMO. The construction is \w+- repeated, then \w+:end.

演示。结构是\w+-重复的,然后是\w+:结束。

(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)

DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.

演示。第一个项目的结构是\w+,然后-\w+重复。(感谢kaᵠ建议)。这种结构更容易对其正确性进行推理,因为变化较少。

\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.

当您需要进行标记时,G匹配边界是特别有用的,您需要确保引擎没有向前跳,并匹配应该无效的内容。

Explanation

Let us break down the regex:

让我们分解一下regex:

(?:
  start:(?=\w+(?:-\w+){2,9}:end)
    |
  (?<=-)\G
)
(\w+)
(?:-|:end)

The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.

最容易识别的部分是(\w+)在最后一行,这是您希望捕获的单词。

The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.

最后一行也很容易识别:要匹配的单词后面可能跟着- or:end。

I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.

我允许regex*地开始匹配字符串中的任何位置。换句话说,开始:……:端点可以出现在字符串的任何位置,任何次数;regex将简单地匹配所有的单词。您只需要处理返回到分离匹配令牌实际来自何处的数组。

As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.

至于解释,regex的开头检查字符串start:的存在,接下来的查找检查单词的数量是否在指定的限制内,并以:end结尾。或者,我们检查前一个匹配前的字符是否是-,并从前一个匹配继续。

For the other construction:

对于其他结构:

(?:
  start:(?=\w+(?:-\w+){2,9}:end)
    |
  (?!^)\G-
)
(\w+)

Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).

所有的东西几乎都是一样的,除了我们匹配开始:只\w+在匹配重复的形式-只\w+之前。与第一个结构相反,我们匹配开始:\w+- first和\w+- -(或\w+:最后一次重复)的重复实例。

It is quite tricky to make this regex works for matching in middle of the string:

让这个regex在字符串的中间进行匹配是相当棘手的:

  • We need to check the number of words between start: and :end (as part of the requirement of the original regex).

    我们需要检查start: and:end之间的单词数(作为原始regex的一部分需求)。

  • \G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.

    \G也匹配字符串的开头!(? ! ^)需要防止这种行为。如果不考虑这一点,regex可能在没有任何开始时产生匹配:。

    For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).

    第一建设回顾(? < = -)已经阻止这种情况下((? ! ^)隐含(? < = -))。

  • For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.

    对于第一个结构(?:start:(?=\w+(? -\w+){2,9}:end)|(?<=-)\G)(?后面的查找是为了这个目的:它阻止:end不匹配后的任何垃圾。

    The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.

    第二个构造没有遇到这个问题,因为在我们匹配了中间的所有令牌之后,我们将陷入:(of:end)。

Validation Version

If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:

如果您想验证输入字符串是否遵循格式(前后没有额外的内容)并提取数据,您可以添加锚点如下:

(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)

(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).

(后面找了也没有必要,但我们仍然需要(? ! ^)防止\ G匹配字符串的开始)。

Construction

For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.

对于所有想要捕获重复的所有实例的问题,我认为不存在修改regex的通用方法。“硬”(或不可能?)案例转换的一个例子是,当重复必须回溯一个或多个循环,以满足特定条件时。

When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.

当原始regex描述整个输入字符串(验证类型)时,与尝试从字符串中间匹配的regex(匹配类型)相比,通常更容易进行转换。但是,您总是可以与原始regex进行匹配,我们将匹配类型问题转换回验证类型问题。

We build such regex by going through these steps:

我们通过以下步骤构建这样的regex:

  • Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
  • 在重复之前写一个包含部分的regex(例如:start:)。我们称这个前缀为regex。
  • Match and capture the first instance. (e.g. (\w+))
    (At this point, the first instance and delimiter should have been matched)
  • 匹配并捕获第一个实例。(例如(\w+))(此时,应该匹配第一个实例和分隔符)
  • Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
  • 添加G作为交替。通常还需要防止它匹配字符串的开始。
  • Add the delimiter (if any). (e.g. -)
    (After this step, the rest of the tokens should have also been matched, except the last maybe)
  • 添加分隔符(如果有的话)。(例:-)(在此步骤之后,其余的令牌也应该被匹配,除了最后一个可能)
  • Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
  • 在重复(如有必要)(如结束)后,添加覆盖部分的部分。让我们在重复后缀regex之后调用这个部分(我们是否将其添加到构造中并不重要)。
  • Now the hard part. You need to check that:
    • There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
    • 除了前缀regex之外,没有其他方式启动匹配。注意到\G分支。
    • There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
    • 当后缀regex被匹配之后,就无法启动任何匹配。注意\G分支如何开始匹配。
    • For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
    • 对于第一个结构,如果您将后缀regex(如:end)和分隔符(如-)混合在一起,请确保您最终不会允许后缀regex作为分隔符。
  • 现在困难的部分。您需要检查:除了前缀regex之外,没有其他方式启动匹配。注意\G分支。当后缀regex被匹配之后,就无法启动任何匹配。注意\G分支如何开始匹配。对于第一个结构,如果您将后缀regex(如:end)和分隔符(如-)混合在一起,请确保您最终不会允许后缀regex作为分隔符。

#2


6  

Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.

虽然理论上可以写一个表达式,但要先匹配外部边界,然后在内部执行一个分割,这是非常实际的。

In ECMAScript I would write it like this:

在ECMAScript中,我这样写:

'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
    .match(/^start:([\w-]+):end$/)[1] // match the inner part
    .split('-') // split inner part (this could be a split regex as well)

In PHP:

在PHP中:

$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
    print_r(explode('-', $matches[1]));
}

#3


1  

Of course you can use the regex in this quoted string.

当然,您可以在这个引用的字符串中使用regex。

"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"

Is it a good idea? No, I don't think so.

这是个好主意吗?不,我不这么认为。

#4


0  

Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:

我不确定你能不能这样做,但是你可以用全局标记找到冒号之间的所有单词,见:

http://regex101.com/r/gK0lX1

http://regex101.com/r/gK0lX1

You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.

但是您必须自己验证组的数量。如果没有全局标记,您只能得到一个匹配项,而不是所有匹配项——将{3,10}更改为{1,5},您将得到结果'sir'。

import re

s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)

produces

生产

['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']

['测试',‘测试’,‘lorem’,‘添加’,‘先生’,‘doloret’,‘等’,‘等’,‘东西’)

#5


0  

When you combine:

当你将:

  1. Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
  2. 您的观察:任何对单个捕获组的任何恢复都会导致最后一次捕获的覆盖,因此只返回捕获组的最后一次捕获。
  3. The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
  4. 知识:任何一种基于部件的捕获,而不是整体,使得在regex引擎重复的次数上设置一个限制是不可能的。限制必须是元数据(而不是regex)。
  5. With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
  6. 有一个要求,答案不能涉及编程(循环),也不需要简单的复制粘贴capturegroups,就像您在问题中所做的那样。

It can be deduced that it cannot be done.

可以推断这是不可能的。

Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

更新:有些regex引擎的p. 1不一定是正确的。在这种情况下,您指定的正则表达式开始:(?:(\w+)-?){3,10}:end将完成任务(source)。

#1


34  

Read this first!

This post is to show the possibility rather than endorsing the "everything regex" approach to problem. The author has written 3-4 variations, each has subtle bug that are tricky to detect, before reaching the current solution.

这篇文章是为了展示可能性,而不是支持“一切regex”方法来解决问题。作者已经写了3-4个变种,每个都有微妙的缺陷,在达到当前的解决方案之前,很难检测到。

For your specific example, there are other better solution that is more maintainable, such as matching and splitting the match along the delimiters.

对于您的特定示例,还有其他更好的、更易于维护的解决方案,例如沿着分隔符匹配和分割匹配。

This post deals with your specific example. I really doubt a full generalization is possible, but the idea behind is reusable for similar cases.

本文将讨论您的具体示例。我真的怀疑全面概括是可能的,但其背后的思想对于类似的情况是可重用的。

Summary

  • .NET supports capturing repeating pattern with CaptureCollection class.
  • . net支持使用capturecall类捕获重复模式。
  • For languages that supports \G and look-behind, we may be able to construct a regex that works with global matching function. It is not easy to write it completely correct and easy to write a subtly buggy regex.
  • 对于支持\G和look-behind的语言,我们可以构建一个使用全局匹配函数的regex。要把它写得完全正确并不容易,编写一个有细微缺陷的regex也不容易。
  • For languages without \G and look-behind support: it is possible to emulate \G with ^, by chomping the input string after a single match. (Not covered in this answer).
  • 语言没有\ G和向后看的支持:可以模拟\ G ^,并用输入字符串后一个匹配。(答案中没有提及)。

Solution

This solution assumes the regex engine supports \G match boundary, look-ahead (?=pattern), and look-behind (?<=pattern). Java, Perl, PCRE, .NET, Ruby regex flavors support all those advanced features above.

这个解决方案假设regex引擎支持\G匹配边界、向前查找(?=模式)和向后查找(?<=模式)。Java、Perl、PCRE、。net、Ruby regex的风格支持上述所有高级特性。

However, you can go with your regex in .NET. Since .NET supports capturing all instances of that is matched by a capturing group that is repeated via CaptureCollection class.

但是,您可以使用. net中的regex。因为。net支持捕获所有的实例,所以捕获组可以通过capturecall类进行匹配。

For your case, it can be done in one regex, with the use of \G match boundary, and look-ahead to constrain the number of repetitions:

对于您的情况,可以在一个regex中完成,使用\G匹配边界,并向前查看以限制重复次数:

(?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end)

DEMO. The construction is \w+- repeated, then \w+:end.

演示。结构是\w+-重复的,然后是\w+:结束。

(?:start:(?=\w+(?:-\w+){2,9}:end)|(?!^)\G-)(\w+)

DEMO. The construction is \w+ for the first item, then -\w+ repeated. (Thanks to ka ᵠ for the suggestion). This construction is simpler to reason about its correctness, since there are less alternations.

演示。第一个项目的结构是\w+,然后-\w+重复。(感谢kaᵠ建议)。这种结构更容易对其正确性进行推理,因为变化较少。

\G match boundary is especially useful when you need to do tokenization, where you need to make sure the engine not skipping ahead and matching stuffs that should have been invalid.

当您需要进行标记时,G匹配边界是特别有用的,您需要确保引擎没有向前跳,并匹配应该无效的内容。

Explanation

Let us break down the regex:

让我们分解一下regex:

(?:
  start:(?=\w+(?:-\w+){2,9}:end)
    |
  (?<=-)\G
)
(\w+)
(?:-|:end)

The easiest part to recognize is (\w+) in the line before last, which is the word that you want to capture.

最容易识别的部分是(\w+)在最后一行,这是您希望捕获的单词。

The last line is also quite easy to recognize: the word to be matched may be followed by - or :end.

最后一行也很容易识别:要匹配的单词后面可能跟着- or:end。

I allow the regex to freely start matching anywhere in the string. In other words, start:...:end can appear anywhere in the string, and any number of times; the regex will simply match all the words. You only need to process the array returned to separate where the matched tokens actually come from.

我允许regex*地开始匹配字符串中的任何位置。换句话说,开始:……:端点可以出现在字符串的任何位置,任何次数;regex将简单地匹配所有的单词。您只需要处理返回到分离匹配令牌实际来自何处的数组。

As for the explanation, the beginning of the regex checks for the presence of the string start:, and the following look-ahead checks that the number of words is within specified limit and it ends with :end. Either that, or we check that the character before the previous match is a -, and continue from previous match.

至于解释,regex的开头检查字符串start:的存在,接下来的查找检查单词的数量是否在指定的限制内,并以:end结尾。或者,我们检查前一个匹配前的字符是否是-,并从前一个匹配继续。

For the other construction:

对于其他结构:

(?:
  start:(?=\w+(?:-\w+){2,9}:end)
    |
  (?!^)\G-
)
(\w+)

Everything is almost the same, except that we match start:\w+ first before matching the repetition of the form -\w+. In contrast to the first construction, where we match start:\w+- first, and the repeated instances of \w+- (or \w+:end for the last repetition).

所有的东西几乎都是一样的,除了我们匹配开始:只\w+在匹配重复的形式-只\w+之前。与第一个结构相反,我们匹配开始:\w+- first和\w+- -(或\w+:最后一次重复)的重复实例。

It is quite tricky to make this regex works for matching in middle of the string:

让这个regex在字符串的中间进行匹配是相当棘手的:

  • We need to check the number of words between start: and :end (as part of the requirement of the original regex).

    我们需要检查start: and:end之间的单词数(作为原始regex的一部分需求)。

  • \G matches the beginning of the string also! (?!^) is needed to prevent this behavior. Without taking care of this, the regex may produce a match when there isn't any start:.

    \G也匹配字符串的开头!(? ! ^)需要防止这种行为。如果不考虑这一点,regex可能在没有任何开始时产生匹配:。

    For the first construction, the look-behind (?<=-) already prevent this case ((?!^) is implied by (?<=-)).

    第一建设回顾(? < = -)已经阻止这种情况下((? ! ^)隐含(? < = -))。

  • For the first construction (?:start:(?=\w+(?:-\w+){2,9}:end)|(?<=-)\G)(\w+)(?:-|:end), we need to make sure that we don't match anything funny after :end. The look-behind is for that purpose: it prevents any garbage after :end from matching.

    对于第一个结构(?:start:(?=\w+(? -\w+){2,9}:end)|(?<=-)\G)(?后面的查找是为了这个目的:它阻止:end不匹配后的任何垃圾。

    The second construction doesn't run into this problem, since we will get stuck at : (of :end) after we have matched all the tokens in between.

    第二个构造没有遇到这个问题,因为在我们匹配了中间的所有令牌之后,我们将陷入:(of:end)。

Validation Version

If you want to do validation that the input string follows the format (no extra stuff in front and behind), and extract the data, you can add anchors as such:

如果您想验证输入字符串是否遵循格式(前后没有额外的内容)并提取数据,您可以添加锚点如下:

(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G-)(\w+)
(?:^start:(?=\w+(?:-\w+){2,9}:end$)|(?!^)\G)(\w+)(?:-|:end)

(Look-behind is also not needed, but we still need (?!^) to prevent \G from matching the start of the string).

(后面找了也没有必要,但我们仍然需要(? ! ^)防止\ G匹配字符串的开始)。

Construction

For all the problems where you want to capture all instances of a repetition, I don't think there exists a general way to modify the regex. One example of a "hard" (or impossible?) case to convert is when a repetition has to backtrack one or more loop to fulfill certain condition to match.

对于所有想要捕获重复的所有实例的问题,我认为不存在修改regex的通用方法。“硬”(或不可能?)案例转换的一个例子是,当重复必须回溯一个或多个循环,以满足特定条件时。

When the original regex describes the whole input string (validation type), it is usually easier to convert compared to a regex that tries to match from the middle of the string (matching type). However, you can always do a match with the original regex, and we convert matching type problem back to validation type problem.

当原始regex描述整个输入字符串(验证类型)时,与尝试从字符串中间匹配的regex(匹配类型)相比,通常更容易进行转换。但是,您总是可以与原始regex进行匹配,我们将匹配类型问题转换回验证类型问题。

We build such regex by going through these steps:

我们通过以下步骤构建这样的regex:

  • Write a regex that covers the part before the repetition (e.g. start:). Let us call this prefix regex.
  • 在重复之前写一个包含部分的regex(例如:start:)。我们称这个前缀为regex。
  • Match and capture the first instance. (e.g. (\w+))
    (At this point, the first instance and delimiter should have been matched)
  • 匹配并捕获第一个实例。(例如(\w+))(此时,应该匹配第一个实例和分隔符)
  • Add the \G as an alternation. Usually also need to prevent it from matching the start of the string.
  • 添加G作为交替。通常还需要防止它匹配字符串的开始。
  • Add the delimiter (if any). (e.g. -)
    (After this step, the rest of the tokens should have also been matched, except the last maybe)
  • 添加分隔符(如果有的话)。(例:-)(在此步骤之后,其余的令牌也应该被匹配,除了最后一个可能)
  • Add the part that covers the part after the repetition (if necessary) (e.g. :end). Let us call the part after the repetition suffix regex (whether we add it to the construction doesn't matter).
  • 在重复(如有必要)(如结束)后,添加覆盖部分的部分。让我们在重复后缀regex之后调用这个部分(我们是否将其添加到构造中并不重要)。
  • Now the hard part. You need to check that:
    • There is no other way to start a match, apart from the prefix regex. Take note of the \G branch.
    • 除了前缀regex之外,没有其他方式启动匹配。注意到\G分支。
    • There is no way to start any match after the suffix regex has been matched. Take note of how \G branch starts a match.
    • 当后缀regex被匹配之后,就无法启动任何匹配。注意\G分支如何开始匹配。
    • For the first construction, if you mix the suffix regex (e.g. :end) with delimiter (e.g. -) in an alternation, make sure you don't end up allowing the suffix regex as delimiter.
    • 对于第一个结构,如果您将后缀regex(如:end)和分隔符(如-)混合在一起,请确保您最终不会允许后缀regex作为分隔符。
  • 现在困难的部分。您需要检查:除了前缀regex之外,没有其他方式启动匹配。注意\G分支。当后缀regex被匹配之后,就无法启动任何匹配。注意\G分支如何开始匹配。对于第一个结构,如果您将后缀regex(如:end)和分隔符(如-)混合在一起,请确保您最终不会允许后缀regex作为分隔符。

#2


6  

Although it might theoretically be possible to write a single expression, it's a lot more practical to match the outer boundaries first and then perform a split on the inner part.

虽然理论上可以写一个表达式,但要先匹配外部边界,然后在内部执行一个分割,这是非常实际的。

In ECMAScript I would write it like this:

在ECMAScript中,我这样写:

'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end'
    .match(/^start:([\w-]+):end$/)[1] // match the inner part
    .split('-') // split inner part (this could be a split regex as well)

In PHP:

在PHP中:

$txt = 'start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end';
if (preg_match('/^start:([\w-]+):end$/', $txt, $matches)) {
    print_r(explode('-', $matches[1]));
}

#3


1  

Of course you can use the regex in this quoted string.

当然,您可以在这个引用的字符串中使用regex。

"(?<a>\\w+)-(?<b>\\w+)-(?:(?<c>\\w+)" \
"(?:-(?<d>\\w+)(?:-(?<e>\\w+)(?:-(?<f>\\w+)" \
"(?:-(?<g>\\w+)(?:-(?<h>\\w+)(?:-(?<i>\\w+)" \
"(?:-(?<j>\\w+))?" \
")?)?)?" \
")?)?)?" \
")"

Is it a good idea? No, I don't think so.

这是个好主意吗?不,我不这么认为。

#4


0  

Not sure you can do it in that way, but you can use the global flag to find all of the words between the colons, see:

我不确定你能不能这样做,但是你可以用全局标记找到冒号之间的所有单词,见:

http://regex101.com/r/gK0lX1

http://regex101.com/r/gK0lX1

You'd have to validate the number of groups yourself though. Without the global flag you're only getting a single match, not all matches - change {3,10} to {1,5} and you get the result 'sir' instead.

但是您必须自己验证组的数量。如果没有全局标记,您只能得到一个匹配项,而不是所有匹配项——将{3,10}更改为{1,5},您将得到结果'sir'。

import re

s = "start:test-test-lorem-ipsum-sir-doloret-etc-etc-something:end"
print re.findall(r"(\b\w+?\b)(?:-|:end)", s)

produces

生产

['test', 'test', 'lorem', 'ipsum', 'sir', 'doloret', 'etc', 'etc', 'something']

['测试',‘测试’,‘lorem’,‘添加’,‘先生’,‘doloret’,‘等’,‘等’,‘东西’)

#5


0  

When you combine:

当你将:

  1. Your observation: any kind of repitition of a single capture group will result in an overwrite of the last capture, thus returning only the last capture of the capture group.
  2. 您的观察:任何对单个捕获组的任何恢复都会导致最后一次捕获的覆盖,因此只返回捕获组的最后一次捕获。
  3. The knowledge: Any kind of capturing based on the parts, instead of the whole, makes it impossible to set a limit on the amount of times the regex engine will repeat. The limit would have to be metadata (not regex).
  4. 知识:任何一种基于部件的捕获,而不是整体,使得在regex引擎重复的次数上设置一个限制是不可能的。限制必须是元数据(而不是regex)。
  5. With a requirement that the answer cannot involve programming (looping), nor an answer that involves simply copy-pasting capturegroups as you've done in your question.
  6. 有一个要求,答案不能涉及编程(循环),也不需要简单的复制粘贴capturegroups,就像您在问题中所做的那样。

It can be deduced that it cannot be done.

可以推断这是不可能的。

Update: There are some regex engines for which p. 1 is not necessarily true. In that case the regex you have indicated start:(?:(\w+)-?){3,10}:end will do the job (source).

更新:有些regex引擎的p. 1不一定是正确的。在这种情况下,您指定的正则表达式开始:(?:(\w+)-?){3,10}:end将完成任务(source)。