正则表达式在输入块的中间捕获可选组

时间:2023-02-08 11:13:20

I'm stuck on a RegEx problem that's seemingly very simple and yet I can't get it working.

我遇到了一个似乎非常简单的RegEx问题,但是我无法让它工作。

Suppose I have input like this:

假设我有这样的输入:

Some text %interestingbit% lots of random text lots and lots more %anotherinterestingbit%
Some text %interestingbit% lots of random text OPTIONAL_THING lots and lots more %anotherinterestingbit%
Some text %interestingbit% lots of random text lots and lots more %anotherinterestingbit%

There are many repeating blocks in the input and in each block I want to capture some things that are always there (%interestingbit% and %anotherinterestingbit%), but there is also a bit of text that may or may not occur in-between them (OPTIONAL_THING) and I want to capture it if it's there.

在输入中有很多重复块,在每个块中我想捕获一些总是存在的东西(%interestingbit%和%anotherinterestingbit%),但是也有一些文本可能会或可能不会发生在它们之间(OPTIONAL_THING)我想抓住它,如果它在那里。

A RegEx like this matches only blocks with OPTIONAL_THING in it (and the named capture works):

像这样的RegEx只匹配其中带有OPTIONAL_THING的块(并且命名的捕获工作):

%interestingbit%.+?((?<OptionalCapture>OPTIONAL_THING)).+?%anotherinterestingbit%

So it seems like it's just a matter of making the whole group optional, right? That's what I tried:

所以看起来这只是让整个团队成为可选的问题,对吧?这就是我尝试过的:

%interestingbit%.+?((?<OptionalCapture>OPTIONAL_THING))?.+?%anotherinterestingbit%

But I find that although this matches all 3 blocks the named capture (OptionalCapture) is empty in all of them! How do I get this to work?

但我发现尽管这匹配了所有3个块,但命名捕获(OptionalCapture)在所有块中都是空的!我如何让它工作?

Note that there can be a lot of text within each block, including newlines, which is why I put in ".+?" rather than something more specific. I'm using .NET regular expressions, testing with The Regulator.

请注意,每个块中可能有很多文本,包括换行符,这就是我输入“。+?”的原因。而不是更具体的东西。我正在使用.NET正则表达式,使用The Regulator进行测试。

3 个解决方案

#1


My thoughts are along similar lines to Niko's idea. However, I would suggest placing the 2nd .+? inside the optional group instead of the first, as follows:

我的想法与Niko的想法类似。但是,我建议放置第二个。+?在可选组内而不是第一个,如下所示:

%interestingbit%.+?(?:(?<optionalCapture>OPTIONAL_THING).+?)?%anotherinterestingbit%

This avoids unnecessary backtracking. If the first .+? is inside the optional group and OPTIONAL_THING does not exist in the search string, the regex won't know this until it gets to the end of the string. It will then need to backtrack, perhaps quite a bit, to match %anotherinterestingbit%, which as you said will always exist.

这避免了不必要的回溯。如果是第一个。+?在可选组内,并且搜索字符串中不存在OPTIONAL_THING,正则表达式在到达字符串末尾之前不会知道这一点。然后它需要回溯,或许相当多,以匹配%anotherrestrestbit%,正如你所说,它将永远存在。

Also, since OPTIONAL_THING, when it exists, will always be before %anotherinterestingbit%, then the text after it is effectively optional as well and fits more naturally into the optional group.

此外,由于OPTIONAL_THING(如果存在)将始终位于%anotherinterestingbit%之前,因此它之后的文本也是有效可选的,并且更自然地适合可选组。

#2


Why do you have the extra set of parentheses?

为什么你有额外的括号?

Try this:

%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING)?.+?%anotherinterestingbit%

Or maybe this will work:

或许这可行:

%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING|).+?%anotherinterestingbit%

In this example, the group captures OPTIONAL_THING, or nothing.

在此示例中,组捕获OPTIONAL_THING,或者不捕获任何内容。

#3


Try this:

%interestingbit%(?:(.+)(?<optionalCapture>OPTIONAL_THING))?(.+?)%anotherinterestingbit%

First there's a non-capturing group which matches .+OPTIONAL_THING or nothing. If a match is found, there's the named group inside, which captures OPTIONAL_THING for you. The rest is captured with .+?%anotherinterestingbit%.

首先是一个匹配的非捕获组。+ OPTIONAL_THING或没有。如果找到匹配项,则会在里面找到命名组,它会为您捕获OPTIONAL_THING。其余部分用。+?%anotherinterestingbit%捕获。

[edit]: I added a couple of parentheses for additional capture groups, so now the captured groups match the following:

[编辑]:我为其他捕获组添加了几个括号,所以现在捕获的组匹配以下内容:

  • $1 : text before OPTIONAL_THING or nothing
  • $ 1:OPTIONAL_THING之前的文本或什么也没有

  • $2 or $optionalCapture : OPTIONAL_THING or nothing
  • $ 2或$ optionalCapture:OPTIONAL_THING或什么也没有

  • $3 : text after OPTIONAL_THING, or if OPTIONAL_THING is not found, the full text between %interestingbit% and %anotherinterestingbit%
  • $ 3:OPTIONAL_THING之后的文本,或者如果找不到OPTIONAL_THING,则%interestingbit%和%anotherinterestingbit%之间的全文

Are these the three matches you're looking for?

这些是你正在寻找的三场比赛吗?

#1


My thoughts are along similar lines to Niko's idea. However, I would suggest placing the 2nd .+? inside the optional group instead of the first, as follows:

我的想法与Niko的想法类似。但是,我建议放置第二个。+?在可选组内而不是第一个,如下所示:

%interestingbit%.+?(?:(?<optionalCapture>OPTIONAL_THING).+?)?%anotherinterestingbit%

This avoids unnecessary backtracking. If the first .+? is inside the optional group and OPTIONAL_THING does not exist in the search string, the regex won't know this until it gets to the end of the string. It will then need to backtrack, perhaps quite a bit, to match %anotherinterestingbit%, which as you said will always exist.

这避免了不必要的回溯。如果是第一个。+?在可选组内,并且搜索字符串中不存在OPTIONAL_THING,正则表达式在到达字符串末尾之前不会知道这一点。然后它需要回溯,或许相当多,以匹配%anotherrestrestbit%,正如你所说,它将永远存在。

Also, since OPTIONAL_THING, when it exists, will always be before %anotherinterestingbit%, then the text after it is effectively optional as well and fits more naturally into the optional group.

此外,由于OPTIONAL_THING(如果存在)将始终位于%anotherinterestingbit%之前,因此它之后的文本也是有效可选的,并且更自然地适合可选组。

#2


Why do you have the extra set of parentheses?

为什么你有额外的括号?

Try this:

%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING)?.+?%anotherinterestingbit%

Or maybe this will work:

或许这可行:

%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING|).+?%anotherinterestingbit%

In this example, the group captures OPTIONAL_THING, or nothing.

在此示例中,组捕获OPTIONAL_THING,或者不捕获任何内容。

#3


Try this:

%interestingbit%(?:(.+)(?<optionalCapture>OPTIONAL_THING))?(.+?)%anotherinterestingbit%

First there's a non-capturing group which matches .+OPTIONAL_THING or nothing. If a match is found, there's the named group inside, which captures OPTIONAL_THING for you. The rest is captured with .+?%anotherinterestingbit%.

首先是一个匹配的非捕获组。+ OPTIONAL_THING或没有。如果找到匹配项,则会在里面找到命名组,它会为您捕获OPTIONAL_THING。其余部分用。+?%anotherinterestingbit%捕获。

[edit]: I added a couple of parentheses for additional capture groups, so now the captured groups match the following:

[编辑]:我为其他捕获组添加了几个括号,所以现在捕获的组匹配以下内容:

  • $1 : text before OPTIONAL_THING or nothing
  • $ 1:OPTIONAL_THING之前的文本或什么也没有

  • $2 or $optionalCapture : OPTIONAL_THING or nothing
  • $ 2或$ optionalCapture:OPTIONAL_THING或什么也没有

  • $3 : text after OPTIONAL_THING, or if OPTIONAL_THING is not found, the full text between %interestingbit% and %anotherinterestingbit%
  • $ 3:OPTIONAL_THING之后的文本,或者如果找不到OPTIONAL_THING,则%interestingbit%和%anotherinterestingbit%之间的全文

Are these the three matches you're looking for?

这些是你正在寻找的三场比赛吗?