I'm stuck on a RegEx problem that's seemingly very simple and yet I can't get it working.
我遇到了一个似乎非常简单的RegEx问题,但是我无法让它工作。
Suppose I have input like this:
假设我有这样的输入:
Some text %interestingbit% lots of random text lots and lots more %anotherinterestingbit%
Some text %interestingbit% lots of random text OPTIONAL_THING lots and lots more %anotherinterestingbit%
Some text %interestingbit% lots of random text lots and lots more %anotherinterestingbit%
There are many repeating blocks in the input and in each block I want to capture some things that are always there (%interestingbit% and %anotherinterestingbit%), but there is also a bit of text that may or may not occur in-between them (OPTIONAL_THING) and I want to capture it if it's there.
在输入中有很多重复块,在每个块中我想捕获一些总是存在的东西(%interestingbit%和%anotherinterestingbit%),但是也有一些文本可能会或可能不会发生在它们之间(OPTIONAL_THING)我想抓住它,如果它在那里。
A RegEx like this matches only blocks with OPTIONAL_THING in it (and the named capture works):
像这样的RegEx只匹配其中带有OPTIONAL_THING的块(并且命名的捕获工作):
%interestingbit%.+?((?<OptionalCapture>OPTIONAL_THING)).+?%anotherinterestingbit%
So it seems like it's just a matter of making the whole group optional, right? That's what I tried:
所以看起来这只是让整个团队成为可选的问题,对吧?这就是我尝试过的:
%interestingbit%.+?((?<OptionalCapture>OPTIONAL_THING))?.+?%anotherinterestingbit%
But I find that although this matches all 3 blocks the named capture (OptionalCapture) is empty in all of them! How do I get this to work?
但我发现尽管这匹配了所有3个块,但命名捕获(OptionalCapture)在所有块中都是空的!我如何让它工作?
Note that there can be a lot of text within each block, including newlines, which is why I put in ".+?" rather than something more specific. I'm using .NET regular expressions, testing with The Regulator.
请注意,每个块中可能有很多文本,包括换行符,这就是我输入“。+?”的原因。而不是更具体的东西。我正在使用.NET正则表达式,使用The Regulator进行测试。
3 个解决方案
#1
My thoughts are along similar lines to Niko's idea. However, I would suggest placing the 2nd .+? inside the optional group instead of the first, as follows:
我的想法与Niko的想法类似。但是,我建议放置第二个。+?在可选组内而不是第一个,如下所示:
%interestingbit%.+?(?:(?<optionalCapture>OPTIONAL_THING).+?)?%anotherinterestingbit%
This avoids unnecessary backtracking. If the first .+? is inside the optional group and OPTIONAL_THING does not exist in the search string, the regex won't know this until it gets to the end of the string. It will then need to backtrack, perhaps quite a bit, to match %anotherinterestingbit%, which as you said will always exist.
这避免了不必要的回溯。如果是第一个。+?在可选组内,并且搜索字符串中不存在OPTIONAL_THING,正则表达式在到达字符串末尾之前不会知道这一点。然后它需要回溯,或许相当多,以匹配%anotherrestrestbit%,正如你所说,它将永远存在。
Also, since OPTIONAL_THING, when it exists, will always be before %anotherinterestingbit%, then the text after it is effectively optional as well and fits more naturally into the optional group.
此外,由于OPTIONAL_THING(如果存在)将始终位于%anotherinterestingbit%之前,因此它之后的文本也是有效可选的,并且更自然地适合可选组。
#2
Why do you have the extra set of parentheses?
为什么你有额外的括号?
Try this:
%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING)?.+?%anotherinterestingbit%
Or maybe this will work:
或许这可行:
%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING|).+?%anotherinterestingbit%
In this example, the group captures OPTIONAL_THING, or nothing.
在此示例中,组捕获OPTIONAL_THING,或者不捕获任何内容。
#3
Try this:
%interestingbit%(?:(.+)(?<optionalCapture>OPTIONAL_THING))?(.+?)%anotherinterestingbit%
First there's a non-capturing group which matches .+OPTIONAL_THING
or nothing. If a match is found, there's the named group inside, which captures OPTIONAL_THING
for you. The rest is captured with .+?%anotherinterestingbit%
.
首先是一个匹配的非捕获组。+ OPTIONAL_THING或没有。如果找到匹配项,则会在里面找到命名组,它会为您捕获OPTIONAL_THING。其余部分用。+?%anotherinterestingbit%捕获。
[edit]: I added a couple of parentheses for additional capture groups, so now the captured groups match the following:
[编辑]:我为其他捕获组添加了几个括号,所以现在捕获的组匹配以下内容:
- $1 : text before OPTIONAL_THING or nothing
- $2 or $optionalCapture : OPTIONAL_THING or nothing
- $3 : text after OPTIONAL_THING, or if OPTIONAL_THING is not found, the full text between %interestingbit% and %anotherinterestingbit%
$ 1:OPTIONAL_THING之前的文本或什么也没有
$ 2或$ optionalCapture:OPTIONAL_THING或什么也没有
$ 3:OPTIONAL_THING之后的文本,或者如果找不到OPTIONAL_THING,则%interestingbit%和%anotherinterestingbit%之间的全文
Are these the three matches you're looking for?
这些是你正在寻找的三场比赛吗?
#1
My thoughts are along similar lines to Niko's idea. However, I would suggest placing the 2nd .+? inside the optional group instead of the first, as follows:
我的想法与Niko的想法类似。但是,我建议放置第二个。+?在可选组内而不是第一个,如下所示:
%interestingbit%.+?(?:(?<optionalCapture>OPTIONAL_THING).+?)?%anotherinterestingbit%
This avoids unnecessary backtracking. If the first .+? is inside the optional group and OPTIONAL_THING does not exist in the search string, the regex won't know this until it gets to the end of the string. It will then need to backtrack, perhaps quite a bit, to match %anotherinterestingbit%, which as you said will always exist.
这避免了不必要的回溯。如果是第一个。+?在可选组内,并且搜索字符串中不存在OPTIONAL_THING,正则表达式在到达字符串末尾之前不会知道这一点。然后它需要回溯,或许相当多,以匹配%anotherrestrestbit%,正如你所说,它将永远存在。
Also, since OPTIONAL_THING, when it exists, will always be before %anotherinterestingbit%, then the text after it is effectively optional as well and fits more naturally into the optional group.
此外,由于OPTIONAL_THING(如果存在)将始终位于%anotherinterestingbit%之前,因此它之后的文本也是有效可选的,并且更自然地适合可选组。
#2
Why do you have the extra set of parentheses?
为什么你有额外的括号?
Try this:
%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING)?.+?%anotherinterestingbit%
Or maybe this will work:
或许这可行:
%interestingbit%.+?(?<OptionalCapture>OPTIONAL_THING|).+?%anotherinterestingbit%
In this example, the group captures OPTIONAL_THING, or nothing.
在此示例中,组捕获OPTIONAL_THING,或者不捕获任何内容。
#3
Try this:
%interestingbit%(?:(.+)(?<optionalCapture>OPTIONAL_THING))?(.+?)%anotherinterestingbit%
First there's a non-capturing group which matches .+OPTIONAL_THING
or nothing. If a match is found, there's the named group inside, which captures OPTIONAL_THING
for you. The rest is captured with .+?%anotherinterestingbit%
.
首先是一个匹配的非捕获组。+ OPTIONAL_THING或没有。如果找到匹配项,则会在里面找到命名组,它会为您捕获OPTIONAL_THING。其余部分用。+?%anotherinterestingbit%捕获。
[edit]: I added a couple of parentheses for additional capture groups, so now the captured groups match the following:
[编辑]:我为其他捕获组添加了几个括号,所以现在捕获的组匹配以下内容:
- $1 : text before OPTIONAL_THING or nothing
- $2 or $optionalCapture : OPTIONAL_THING or nothing
- $3 : text after OPTIONAL_THING, or if OPTIONAL_THING is not found, the full text between %interestingbit% and %anotherinterestingbit%
$ 1:OPTIONAL_THING之前的文本或什么也没有
$ 2或$ optionalCapture:OPTIONAL_THING或什么也没有
$ 3:OPTIONAL_THING之后的文本,或者如果找不到OPTIONAL_THING,则%interestingbit%和%anotherinterestingbit%之间的全文
Are these the three matches you're looking for?
这些是你正在寻找的三场比赛吗?