正则表达式匹配和限制字符类

I'm not sure if this is possible using Regex but I'd like to be able to limit the number of underscores allowed based on a different character. This is to limit crazy wildcard queries to a search engine written in Java.

我不确定使用Regex是否可行,但我希望能够根据不同的字符限制允许的下划线数量。这是为了将疯狂的通配符限制限制为用Java编写的搜索引擎。

The starting characters would be alphanumeric. But I basically want a match if there are more underscores than preceding characters. So

起始字符是字母数字。但是如果有更多的下划线而不是前面的字符,我基本上想要一个匹配。所以

BA_ would be fine but BA___ would match the regex and would get kicked out of the query parser.

BA_会很好但是BA___会匹配正则表达式并且会被踢出查询解析器。

Is that possible using Regex?

这可能使用正则表达式吗?

3 个解决方案

#1

Yes you can do it. This pattern will succeed only if there are less underscores than letters (you can adapt it with the characters you want):

是的,你可以做到。只有当下划线少于字母时,此模式才会成功(您可以使用所需的字符调整它):

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*+[A-Z]+\\1?$

(as Pshemo notices it, anchors are not needed if you use the matches() method, I wrote them to illustrate the fact that this pattern must be bounded whatever the means. With lookarounds for example.)

(正如Pshemo注意到的那样,如果使用matches()方法则不需要锚点,我编写它们来说明这个模式必须以任何方式限制的事实。例如,使用外观。)

negated version:

^(?:[A-Z](?=[A-Z]*(\\1?+_)))*\\1?_*$

The idea is to repeat a capture group that contains a backreference to itself + an underscore. At each repetition, the capture group is growing. ^(?:[A-Z](?=[A-Z]*+(\\1?+_)))*+ will match all letters that have a correspondant underscore. You only need to add [A-Z]+ to be sure that there is more letters, and to finish your pattern with \\1? that contains all the underscores (I make it optional, in case there is no underscore at all).

我们的想法是重复一个包含对自身的反向引用的捕获组+下划线。在每次重复时,捕获组都在增长。 ^(?:[A-Z](?= [A-Z] * +(\\ 1?+ _)))* +将匹配具有对应下划线的所有字母。你只需要添加[A-Z] +以确保有更多的字母,并用\\ 1来完成你的模式?包含所有下划线(我将其设为可选,以防根本没有下划线)。

Note that if you replace [A-Z]+ with [A-Z]{n} in the first pattern, you can set exactly the number of characters difference between letters and underscores.

请注意,如果在第一个模式中将[A-Z] +替换为[A-Z] {n},则可以精确设置字母和下划线之间的字符数差异。

To give a better idea, I will try to describe step by step how it works with the string ABC-- (since it's impossible to put underscores in bold, i use hyphens instead) :

为了给出一个更好的主意,我将尝试逐步描述它如何与字符串ABC一起使用 - (因为不可能将下划线以粗体显示,我使用连字符代替):

 In the non-capturing group, the first letter is found 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 let's enter the lookahead (keep in mind that all in the lookahead is only
 a check and not a part of the match result.)
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 the first capturing group is encounter for the first time and its content is not
 defined. This is the reason why an optional quantifier is used, to avoid to make
 the lookahead fail. Consequence: \1?+ doesn't match something new.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 the first hyphen is matched. Once the capture group closed, the first capture
    group is now defined and contains one hyphen. 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 The lookahead succeeds, let's repeat the non-capturing group.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 The second letter is found
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 We enter the lookahead
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 but now, things are different. The capture group was defined before and
 contains an hyphen, this is why \1?+ will match the first hyphen.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 the literal hyphen matches the second hyphen in the string. And now the
 capture group 1 contains the two hypens. The lookahead succeeds.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 We repeat one more time the non capturing group.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 In the lookahead. There is no more letters, it's not a problem, since
 the * quantifier is used.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 \\1?+ matches now two hyphens.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 but there is no more hyphen in the string for the literal hypen and the regex engine can not use the bactracking since \1?+ has a possessive quantifier.
 The lookahead fails. Thus the third repetition of the non-capturing group too!
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 ensure that there is at least one more letter.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

 We match the end of the string with the backreference to capture group 1 that
 contains the two hyphens. Note that the fact that this backreference is optional
 allows the string to not have hyphens at all. 
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$
 This is the end of the string. The pattern succeeds.
ABC--        ^(?:[A-Z](?=[A-Z]*(\1?+-)))*+[A-Z]+\1?$

Note: The use of the possessive quantifier for the non-capturing group is needed to avoid false results. (Where you can observe a strange behavior, that can be useful.)

Example:ABC--- and the pattern: ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$ (without the possessive quantifier)

示例:ABC ---和模式:^(?:[A-Z](?= [A-Z] *(\ 1?+ - )))* [A-Z] + \ 1?$(没有占有量词)

 The non-capturing group is repeated three times and `ABC` are matched:
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 Note that at this step the first capturing group contains ---
 But after the non capturing group, there is no more letter to match for [A-Z]+
 and the regex engine must backtrack.
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Question: How many hyphens are in the capture group now?
Answer: Always three!

问题:现在捕获组中有多少个连字符?答:总是三个!

If the repeated non-capturing group gives a letter back, the capture group contains always three hyphens (as the last time the capture group has been read by the regex engine).This is counter-intuitive, but logical.

如果重复的非捕获组返回一个字母,捕获组总是包含三个连字符(正如最后一次正则表达式引擎读取捕获组)。这是违反直觉的,但是合乎逻辑的。

 Then the letter C is found:
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 And the three hyphens
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$
 The pattern succeeds
ABC---     ^(?:[A-Z](?=[A-Z]*(\1?+-)))*[A-Z]+\1?$

Robby Pond asked me in comments how to find strings that have more underscores than letters (all that is not an underscore). The best way is obviously to count the numbers of underscores and to compare with the string length. But about a full regex solution, it is not possible to build a pattern for that with Java since the pattern needs to use the recursion feature. For example you can do it with PHP:

Robby Pond在评论中问我如何找到下划线比字母更多的字符串(所有这些都不是下划线)。最好的方法显然是计算下划线的数量并与字符串长度进行比较。但是关于完整的正则表达式解决方案,由于模式需要使用递归功能,因此无法使用Java为其构建模式。例如,您可以使用PHP执行此操作:

$pattern = <<<'EOD'
~
 (?(DEFINE)
     (?<neutral> (?: _ \g<neutral>?+ [A-Z] | [A-Z] \g<neutral>?+ _ )+ )
 )

 \A (?: \g<neutral> | _ )+ \z
~x
EOD;

var_dump(preg_match($pattern, '____ABC_DEF___'));

#2

Its not possible in singular regular expression.

它在单数正则表达式中是不可能的。

i) Logic needs to be implemented to get number of characters before underscores(regular expression should be written to get characters word before underscore).

i)需要实现逻辑以获得下划线之前的字符数(应该写正则表达式以在下划线之前获得字符)。

ii) And validate result (number of characters - 1) = number of semicolons followed(regular expression which returns stream of underscores followed by characters).

ii)并验证结果(字符数 - 1)=所遵循的分号数(正则表达式返回下划线后跟字符)。

#3

Edit: Dang! I just noticed that you need this for java. Anyways...I leave it here if someone from the .Net world stumbles upon this post.

编辑:当!我刚注意到你需要这个用于java。无论如何......如果来自.Net世界的人偶然发现这篇文章,我会把它留在这里。

You can use Balancing Groups if you are using .Net:

如果您使用.Net,则可以使用平衡组:

^(?:(?<letter>[^_])|(?<-letter>_))*(?(letter)(?=)|(?!))$

The .net regex engine has the ability to maintain all captured patterns in the captured groups. In other flavors the captured group would always contain the last matched pattern but in .net all previous matches are contained in a capture collection for your use. Also the .net engine has the ability to push and pop to the stack of the captured groups using the ?<group-name>, ?<-group-name> constructs. These two handy constructs can be utilized to match pairs of paranthesis, etc.

.net正则表达式引擎能够维护捕获的组中的所有捕获模式。在其他类型中,捕获的组将始终包含最后匹配的模式,但在.net中,所有先前的匹配都包含在捕获集合*您使用。此外,.net引擎还能够使用? ,?< - group-name>构造来推送和弹出捕获的组的堆栈。这两个方便的结构可用于匹配paranthesis对等。

In the above regex, the engine starts from the start of the string and tries to match anything other than "_". This of course can be changed to whatever works for you(e.g [A-Z][a-z]). The alternation basically means either match [^\_] or [\_] and doing so either push or pop from the captured group.

在上面的正则表达式中,引擎从字符串的开头开始,并尝试匹配“_”以外的任何内容。这当然可以改为适合你的任何东西(例如[A-Z] [a-z])。交替基本上意味着匹配[^ \ _]或[\ _ _],并且这样做可以从捕获的组中推送或弹出。

The latter part of the regex is a conditional (?(group-name)true|false). It basically says, if the group still exists(more pushes than pops), then do the true section and if not do the false section. The easiest way to make the pattern match is to use an empty positive look ahead: (?=) and the easiest way to make it fail is (?!) which is a negative lookahead.

正则表达式的后半部分是条件(?(组名)true | false)。它基本上说,如果该组仍然存在(比弹出更多推送),那么执行true部分,如果不执行false部分。使模式匹配的最简单方法是使用空的正向前看:(?=)并且使其失败的最简单方法是(?!),这是一个负前瞻。

#1