用于查找最多n个连续模式的正则表达式

Lets say our pattern is a regex for capital letters (but we could have a more complex pattern than searching for capitals)

让我们说我们的模式是大写字母的正则表达式(但我们可能有比搜索大写更复杂的模式)

To find at least n consecutive patterns (in this case, the pattern we are looking for is simply a capital letter), we can do this:

要找到至少n个连续模式(在这种情况下,我们正在寻找的模式只是一个大写字母),我们可以这样做:

(Using Ruby)

somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf"

at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ"]
at_least_3_capitals = somestring.scan(/[A-Z]{3}[A-Z]*/)
=> ["ABC", "XYZ"]

However, how do I search for at most n consecutive patterns, for example, at most one consecutive capital letter:

但是,如何搜索最多n个连续模式,例如,最多连续一个大写字母:

matches = somestring.scan(/ ??? /)
=> [" deFgHij kLmN pQrS ", " abcdEf"]

Detailed strategy

I read that I need to negate the "at least" regex, by turning it into a DFA, negating the accept states, (then converting it back to NFA, though we can leave it as it is) so to write it as a regex. If we think of encountering our pattern as receiving a '1' and not receiving the pattern as receiving a '0', we can draw a simple DFA diagram (where n=1, we want at most one of our pattern):

我读到我需要否定“至少”正则表达式,将其转换为DFA,否定接受状态,(然后将其转换回NFA,尽管我们可以保持原样),所以将其写为正则表达式。如果我们将我们的模式看作接收'1'并且没有接收到接收'0'的模式,我们可以绘制一个简单的DFA图(其中n = 1,我们最多只需要一个模式):

用于查找最多n个连续模式的正则表达式

Specifically, I was wondering how this becomes a regex. Generally, I hope to find how to find "at most" with regex, as my regex skills feel stunted with "at least" alone.

具体来说,我想知道这是如何成为一个正则表达式。一般来说,我希望找到如何用正则表达式找到“最多”,因为我的正则表达式技能感到特别“至少”单独发痒。

Trip Hazards - not quite the right solution in spirit

Note that this question is not a dupicate of this post, as using the accepted methodology there would give:

请注意,这个问题不是这篇文章的重复,因为使用公认的方法会给出:

somestring.scan(/[A-Z]{2}[A-Z]*(.*)[A-Z]{2}[A-Z]*/)
=> [[" deFgHij kLmN pQrS X"]]

Which is not what the DFA shows, not just because it misses the second sought match - more importantly that it includes the 'X', which it should not, as 'X' is followed by another capital, and from the DFA we see that a capital which is followed by another capital is not an accept state.

这不是DFA所展示的,不仅仅是因为它错过了第二次寻求的匹配 - 更重要的是它包含了'X',它不应该,因为'X'后面是另一个资本,而且从DFA我们看到一个资本,其后是另一个资本不是一个接受国。

You could suggest

你可以建议

somestring.split(/[A-Z]{2}[A-Z]*/)
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]

(Thanks to Rubber Duck)

(感谢橡皮鸭)

but I still want to know how to find at most n occurrences using regex alone. (For knowledge!)

但我仍然想知道如何仅使用正则表达式找到最多n次出现。 (知识!)

3 个解决方案

#1

Why your attempt does not work

There are a few problems with your current attempt.

您当前的尝试存在一些问题。

The reason that X is part of the match is that .* is greedy and consumes as much as possible - hence, leaving only the required two capital letters to be matched by the trailing bit. This could be fixed with a non-greedy quantifier.

X是匹配的一部分的原因是。*是贪婪的并且消耗尽可能多 - 因此,只留下所需的两个大写字母与尾随位匹配。这可以用非贪心量词来修复。

The reason why you don't get the second match is twofold. First, you require two trailing capital letters to be there, but instead there is the end of the string. Second, matches cannot overlap. The first match includes at least two trailing capital letters, but the second would need to match these again at the start which is not possible.

你没有得到第二场比赛的原因是双重的。首先,你需要两个尾随大写字母,而不是字符串的结尾。其次,比赛不能重叠。第一个匹配包括至少两个尾随大写字母,但第二个匹配需要在开始时再次匹配这些不可能。

There are more hidden problems: try an input with four consecutive capital letters - it can give you an empty match (provided you use the non-greedy quantifier - the greedy one has even worse problems).

还有更多隐藏的问题:尝试输入四个连续的大写字母 - 它可以给你一个空的匹配(前提是你使用非贪婪的量词 - 贪婪的一个甚至更糟糕的问题)。

Fixing all of these with the current approach is hard (I tried and failed - check the edit history of this post if you want to see my attempt until I decided to scrap this approach altogether). So let's try something else!

使用当前方法修复所有这些很难(我尝试过但失败了 - 如果你想看到我的尝试,请检查这篇文章的编辑历史,直到我决定完全废弃这种方法)。所以让我们试试别的!

Looking for another solution

What is it that we want to match? Disregarding the edge cases, where the match starts at the beginning of the string or ends at the end of the string, we want to match:

我们想要匹配的是什么?忽略边缘情况,匹配从字符串的开头开始或在字符串的结尾处结束,我们希望匹配:

(non-caps) 1 cap (non-caps) 1 cap (non-caps) ....

(非帽)1帽(非帽)1帽(非帽)....

This is ideal for Jeffrey Friedl's unrolling-the-loop. Which looks like

这对Jeffrey Friedl的展开循环来说是理想的选择。看起来像

[^A-Z]+(?:[A-Z][^A-Z]+)*

Now what about the edge cases? We can phrase them like this:

现在边缘情况怎么样?我们可以这样说:

We want to allow a single capital letter at the beginning of the match, only if it's at the beginning of the string.

我们希望在比赛开始时允许一个大写字母,只要它在字符串的开头。

We want to allow a single capital letter at the end of the match, only if it's at the end of the string.

我们希望在匹配结束时允许一个大写字母,只要它在字符串的末尾。

To add these to our pattern, we simply group a capital letter with the appropriate anchor and mark both together as optional:

要将这些添加到我们的模式中,我们只需将大写字母与适当的锚分组,并将它们标记为可选:

(?:^[A-Z])?[^A-Z]+(?:[A-Z][^A-Z]+)*(?:[A-Z]$)?

Now it's really working. And even better, we don't need capturing any more!

现在它真的有用了。更好的是,我们不再需要捕捉了!

Generalizing the solution

This solution is easily generalized to the case of "at most n consecutive capital letters", by changing each [A-Z] to [A-Z]{1,n} and thereby allowing up to n capital letters where there is only one allowed so far.

通过将每个[A-Z]改变为[A-Z] {1,n},从而允许最多n个大写字母到目前为止只允许一个大写字母,这个解决方案很容易推广到“最多n个连续大写字母”的情况。

See the demo for n = 2.

请参阅n = 2的演示。

#2

tl;dr

To match words containing at most N PATTERNs, use the regex

要匹配最多包含N个PATTERN的单词,请使用正则表达式

/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/

For example, to match words containing at most 1 capital letter,

例如,要匹配包含最多1个大写字母的单词,

/\b(?:\w(?:(?<![A-Z])|(?!(?:[A-Z]){1,})))+\b/

This works for multi-character patterns too.

这也适用于多字符模式。

Clarification Needed

I'm afraid your examples may cause confusion. Let's add a few words:

我担心你的例子可能引起混淆。我们加几句话:

somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf mixedCaps mixeDCaps mIxedCaps mIxeDCaps T TT t tt"
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now, rerunning your at-least-2-capitals regex returns

现在,重新运行至少2个大写的正则表达式返回

at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ", "DC", "DC", "TT"]

Note how complete words are not captured! Are you sure this is what you wanted? I ask, of course, because in your latter examples, your at-most-1-capital regex returns complete words, instead of just the capital letters being captured.

请注意如何捕获完整的单词!你确定这是你想要的吗?当然,我问,因为在你的后一个例子中,你最多的1资本正则表达式返回完整的单词,而不仅仅是被捕获的大写字母。

Solution

Here's the solution either way.

无论哪种方式,这都是解决方案。

First, for matching just patterns (and not entire words, as consistent with your initial examples), here's a regex for at-most-N-PATTERNs:

首先,为了匹配模式(而不是整个单词,与初始示例一致),这里是最多N-PATTERN的正则表达式:

/(?<!PATTERN)(?!(?:PATTERN){N+1,})(?:PATTERN)+/

For example, the at-most-1-capitals regex would be

例如,最多1个大写的正则表达式将是

/(?<![A-Z])(?!(?:[A-Z]){2,})(?:[A-Z])+/

and returns

=> ["F", "H", "L", "N", "Q", "S", "E", "C", "DC", "I", "C", "I", "DC", "T", "TT"]

To further exemplify, the at-most-2-capitals regex returns

为了进一步说明,最多2个大写的正则表达式返回

=>

Finally, if you wanted to match entire words that contained at most a certain number of consecutive patterns, then here's a whole different approach:

最后,如果你想匹配最多包含一定数量的连续模式的整个单词,那么这是一个完全不同的方法:

/\b(?:\w(?:(?<![A-Z])|(?![A-Z]{1,})))+\b/

This returns

["deFgHij", "kLmN", "pQrS", "abcdEf", "mixedCaps", "mIxedCaps", "T", "t", "tt"]

The general form is

一般形式是

/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/

You can see all these examples at http://ideone.com/hImmZr.

您可以在http://ideone.com/hImmZr上查看所有这些示例。

#3

to find "at most" with a regex, you use the suffix {1,n} (possibly preceded by a negative lookbehind and followed by a positive lookahead), so it seems that what you want is:

使用正则表达式找到“最多”,你使用后缀{1,n}(可能前面是负面的lookbehind,然后是正向前瞻),所以看起来你想要的是:

irb(main):006:0> somestring.scan(/[A-Z]{1,2}/)
=> ["AB", "C", "F", "H", "L", "N", "Q", "S", "XY", "Z", "E"]

irb(main):007:0> somestring.scan(/(?<![A-Z])[A-Z]{1,2}(?![A-Z])/)
=> ["F", "H", "L", "N", "Q", "S", "E"]

EDIT: if the OP still wants "the longest strings not including more than two uppercase letters", it can use:

编辑:如果OP仍然想要“最长的字符串不包括两个以上的大写字母”,它可以使用:

irb(main):025:0> somestring.scan(/[^A-Z]+(?:[A-Z]{1,2}[^A-Z]+)*/)                                                                                    
=> [" deFgHij kLmN pQrS ", " abcdEf"]

(but that regex possibly won't match in the beginning and the end of the string)

(但是正则表达式可能在字符串的开头和结尾不匹配)

It seems that

看起来

irb(main):026:0> somestring.split(/[A-Z]{3,}/)                                                                                                       
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]

would be better for that.

会更好。

#1

Why your attempt does not work

There are a few problems with your current attempt.

您当前的尝试存在一些问题。

The reason that X is part of the match is that .* is greedy and consumes as much as possible - hence, leaving only the required two capital letters to be matched by the trailing bit. This could be fixed with a non-greedy quantifier.

X是匹配的一部分的原因是。*是贪婪的并且消耗尽可能多 - 因此,只留下所需的两个大写字母与尾随位匹配。这可以用非贪心量词来修复。

The reason why you don't get the second match is twofold. First, you require two trailing capital letters to be there, but instead there is the end of the string. Second, matches cannot overlap. The first match includes at least two trailing capital letters, but the second would need to match these again at the start which is not possible.

There are more hidden problems: try an input with four consecutive capital letters - it can give you an empty match (provided you use the non-greedy quantifier - the greedy one has even worse problems).

还有更多隐藏的问题:尝试输入四个连续的大写字母 - 它可以给你一个空的匹配(前提是你使用非贪婪的量词 - 贪婪的一个甚至更糟糕的问题)。

Looking for another solution

What is it that we want to match? Disregarding the edge cases, where the match starts at the beginning of the string or ends at the end of the string, we want to match:

我们想要匹配的是什么?忽略边缘情况,匹配从字符串的开头开始或在字符串的结尾处结束,我们希望匹配:

(non-caps) 1 cap (non-caps) 1 cap (non-caps) ....

(非帽)1帽(非帽)1帽(非帽)....

This is ideal for Jeffrey Friedl's unrolling-the-loop. Which looks like

这对Jeffrey Friedl的展开循环来说是理想的选择。看起来像

[^A-Z]+(?:[A-Z][^A-Z]+)*

Now what about the edge cases? We can phrase them like this:

现在边缘情况怎么样?我们可以这样说:

We want to allow a single capital letter at the beginning of the match, only if it's at the beginning of the string.

我们希望在比赛开始时允许一个大写字母,只要它在字符串的开头。

We want to allow a single capital letter at the end of the match, only if it's at the end of the string.

我们希望在匹配结束时允许一个大写字母,只要它在字符串的末尾。

To add these to our pattern, we simply group a capital letter with the appropriate anchor and mark both together as optional:

要将这些添加到我们的模式中,我们只需将大写字母与适当的锚分组,并将它们标记为可选:

(?:^[A-Z])?[^A-Z]+(?:[A-Z][^A-Z]+)*(?:[A-Z]$)?

Now it's really working. And even better, we don't need capturing any more!

现在它真的有用了。更好的是,我们不再需要捕捉了!

Generalizing the solution

通过将每个[A-Z]改变为[A-Z] {1,n},从而允许最多n个大写字母到目前为止只允许一个大写字母,这个解决方案很容易推广到“最多n个连续大写字母”的情况。

See the demo for n = 2.

请参阅n = 2的演示。

#2

tl;dr

To match words containing at most N PATTERNs, use the regex

要匹配最多包含N个PATTERN的单词,请使用正则表达式

/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/

For example, to match words containing at most 1 capital letter,

例如,要匹配包含最多1个大写字母的单词,

/\b(?:\w(?:(?<![A-Z])|(?!(?:[A-Z]){1,})))+\b/

This works for multi-character patterns too.

这也适用于多字符模式。

Clarification Needed

I'm afraid your examples may cause confusion. Let's add a few words:

我担心你的例子可能引起混淆。我们加几句话:

somestring = "ABC deFgHij kLmN pQrS XYZ abcdEf mixedCaps mixeDCaps mIxedCaps mIxeDCaps T TT t tt"
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Now, rerunning your at-least-2-capitals regex returns

现在,重新运行至少2个大写的正则表达式返回

at_least_2_capitals = somestring.scan(/[A-Z][A-Z]+/)
=> ["ABC", "XYZ", "DC", "DC", "TT"]

Solution

Here's the solution either way.

无论哪种方式,这都是解决方案。

First, for matching just patterns (and not entire words, as consistent with your initial examples), here's a regex for at-most-N-PATTERNs:

首先,为了匹配模式(而不是整个单词,与初始示例一致),这里是最多N-PATTERN的正则表达式:

/(?<!PATTERN)(?!(?:PATTERN){N+1,})(?:PATTERN)+/

For example, the at-most-1-capitals regex would be

例如,最多1个大写的正则表达式将是

/(?<![A-Z])(?!(?:[A-Z]){2,})(?:[A-Z])+/

and returns

=> ["F", "H", "L", "N", "Q", "S", "E", "C", "DC", "I", "C", "I", "DC", "T", "TT"]

To further exemplify, the at-most-2-capitals regex returns

为了进一步说明,最多2个大写的正则表达式返回

=>

Finally, if you wanted to match entire words that contained at most a certain number of consecutive patterns, then here's a whole different approach:

最后,如果你想匹配最多包含一定数量的连续模式的整个单词,那么这是一个完全不同的方法:

/\b(?:\w(?:(?<![A-Z])|(?![A-Z]{1,})))+\b/

This returns

["deFgHij", "kLmN", "pQrS", "abcdEf", "mixedCaps", "mIxedCaps", "T", "t", "tt"]

The general form is

一般形式是

/\b(?:\w(?:(?<!PATTERN)|(?!(?:PATTERN){N,})))+\b/

You can see all these examples at http://ideone.com/hImmZr.

您可以在http://ideone.com/hImmZr上查看所有这些示例。

#3

to find "at most" with a regex, you use the suffix {1,n} (possibly preceded by a negative lookbehind and followed by a positive lookahead), so it seems that what you want is:

使用正则表达式找到“最多”,你使用后缀{1,n}(可能前面是负面的lookbehind,然后是正向前瞻),所以看起来你想要的是:

irb(main):006:0> somestring.scan(/[A-Z]{1,2}/)
=> ["AB", "C", "F", "H", "L", "N", "Q", "S", "XY", "Z", "E"]

irb(main):007:0> somestring.scan(/(?<![A-Z])[A-Z]{1,2}(?![A-Z])/)
=> ["F", "H", "L", "N", "Q", "S", "E"]

EDIT: if the OP still wants "the longest strings not including more than two uppercase letters", it can use:

编辑:如果OP仍然想要“最长的字符串不包括两个以上的大写字母”,它可以使用:

irb(main):025:0> somestring.scan(/[^A-Z]+(?:[A-Z]{1,2}[^A-Z]+)*/)                                                                                    
=> [" deFgHij kLmN pQrS ", " abcdEf"]

(but that regex possibly won't match in the beginning and the end of the string)

(但是正则表达式可能在字符串的开头和结尾不匹配)

It seems that

看起来

irb(main):026:0> somestring.split(/[A-Z]{3,}/)                                                                                                       
=> ["", " deFgHij kLmN pQrS ", " abcdEf"]

would be better for that.

会更好。

秒客网

用于查找最多n个连续模式的正则表达式

Detailed strategy

Trip Hazards - not quite the right solution in spirit

3 个解决方案

#1

Why your attempt does not work

Looking for another solution

Generalizing the solution

#2

tl;dr

Clarification Needed

Solution

#3

#1

Why your attempt does not work

Looking for another solution

Generalizing the solution

#2

tl;dr

Clarification Needed

Solution

#3

相关文章