Is it possible to create a regular expression with a variable number of groups?
是否可以创建一个具有可变数量的组的正则表达式?
After running this for instance...
例如,在运行这个之后……
Pattern p = Pattern.compile("ab([cd])*ef");
Matcher m = p.matcher("abcddcef");
m.matches();
... I would like to have something like
…我想要一些类似的东西
-
m.group(1)
="c"
- m.group(1)= " c "
-
m.group(2)
="d"
- m.group(2)= " d "
-
m.group(3)
="d"
- m.group(3)= " d "
-
m.group(4)
="c"
. - m.group(4)=“c”。
(Background: I'm parsing some lines of data, and one of the "fields" is repeating. I would like to avoid a matcher.find
loop for these fields.)
(背景:我正在解析一些数据行,其中一个“字段”正在重复。我想避免一场比赛。查找这些字段的循环。)
As pointed out by @Tim Pietzcker in the comments, perl6 and .NET have this feature.
正如@Tim Pietzcker在评论中指出的,perl6和。net都有这个功能。
6 个解决方案
#1
18
According to the documentation, Java regular expressions can't do this:
根据文档,Java正则表达式不能这样做:
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.
与组相关联的捕获的输入始终是组最近匹配的子序列。如果由于量化而对一个组进行第二次评估,那么如果第二次评估失败,则保留先前捕获的值(如果有的话)。例如,将字符串“aba”与表达式(a(b)?)+匹配,将第二组设置为“b”。所有捕获的输入在每次匹配开始时被丢弃。
(emphasis added)
(重点)
#2
3
Pattern p = Pattern.compile("ab(?:(c)|(d))*ef");
Matcher m = p.matcher("abcdef");
m.matches();
should do what you want.
应该做你想做的。
EDIT:
编辑:
@aioobe, I understand now. You want to be able to do something like the grammar
@aioobe,现在我明白了。你想要做一些像语法这样的事情
A ::== <Foo> <Bars> <Baz>
Foo ::== "foo"
Baz ::== "baz"
Bars ::== <Bar> <Bars>
| ε
Bar ::== "A"
| "B"
and pull out all the individual matches of Bar
.
然后取出所有单独的Bar火柴。
No, there is no way to do that using java.util.regex
. You can recurse and use a regex on the match of Bars
or use a parser generator like ANTLR and attach a side-effect to Bar
.
不,使用java.util.regex无法做到这一点。您可以在Bar的匹配上递归并使用regex,或者使用类似ANTLR的解析器生成器,并将副作用附加到Bar。
#3
3
You can use split to get the fields you need into an array and loop through that.
可以使用split将需要的字段获取到一个数组中,并对其进行循环。
http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String)
http://download.oracle.com/javase/1 5.0 / docs / api / java / lang / String.html #分裂(以)
#4
2
I have not used java regex, but for many languages the answer is: No.
我没有使用java regex,但是对于许多语言来说,答案是:没有。
Capturing groups seem to be created when the regex is parsed, and filled when it matches the string. The expression (a)|(b)(c)
has three capturing groups, only if either one, or two of them can be filled. (a)*
has just one group, the parser leaves the last match in the group after matching.
捕获组似乎是在解析regex时创建的,并在匹配字符串时填充。表达式(a)|(b)(c)有三个捕获组,只有其中一个或两个可以被填充。(a)*只有一个组,解析器在匹配后将最后一个匹配留在组中。
#5
0
I would think that backtracking inhibits this behavior, and say the effect of /([\S\s])/
in its grouping accumulative state on something like the Bible. Even if it can be done, the output is unknowable as the groups will lose positional meaning. Its better to do a separate regex on like kind in a global sense and have it deposited into an array.
我认为回溯会抑制这种行为,并说/([\S\ S\ S\ S\ S])/在它的分组累积状态中对圣经之类的东西的影响。即使可以这样做,输出也是不可知的,因为组将失去位置意义。最好在全局意义上对同类进行单独的regex,并将其存入一个数组。
#6
0
I have just had the very similar problem, and managed to do "variable number of groups" but a combination of a while loop and resetting the matcher.
我刚刚遇到了一个非常类似的问题,并设法做了“组数变量”,但是一个while循环的组合,并重新设置了matcher。
int i=0;
String m1=null, m2=null;
while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
{
// do work on two found groups
i=matcher.end();
}
But this is for my problem (with two repeating
但这是我的问题(有两个重复
Pattern pattern = Pattern.compile("(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)");
Matcher matcher = pattern.matcher("abcddcef")
int i=0;
String res=null;
while(matcher.find(i) && (res=matcher.group())!=null)
{
System.out.println(res);
i=matcher.end();
}
You lose the ability to specify arbitrary length of repetition with *
or +
because look-ahead and look-behind must be of the predictable length.
您失去了使用*或+指定任意重复长度的能力,因为前视和后视必须具有可预测的长度。
#1
18
According to the documentation, Java regular expressions can't do this:
根据文档,Java正则表达式不能这样做:
The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.
与组相关联的捕获的输入始终是组最近匹配的子序列。如果由于量化而对一个组进行第二次评估,那么如果第二次评估失败,则保留先前捕获的值(如果有的话)。例如,将字符串“aba”与表达式(a(b)?)+匹配,将第二组设置为“b”。所有捕获的输入在每次匹配开始时被丢弃。
(emphasis added)
(重点)
#2
3
Pattern p = Pattern.compile("ab(?:(c)|(d))*ef");
Matcher m = p.matcher("abcdef");
m.matches();
should do what you want.
应该做你想做的。
EDIT:
编辑:
@aioobe, I understand now. You want to be able to do something like the grammar
@aioobe,现在我明白了。你想要做一些像语法这样的事情
A ::== <Foo> <Bars> <Baz>
Foo ::== "foo"
Baz ::== "baz"
Bars ::== <Bar> <Bars>
| ε
Bar ::== "A"
| "B"
and pull out all the individual matches of Bar
.
然后取出所有单独的Bar火柴。
No, there is no way to do that using java.util.regex
. You can recurse and use a regex on the match of Bars
or use a parser generator like ANTLR and attach a side-effect to Bar
.
不,使用java.util.regex无法做到这一点。您可以在Bar的匹配上递归并使用regex,或者使用类似ANTLR的解析器生成器,并将副作用附加到Bar。
#3
3
You can use split to get the fields you need into an array and loop through that.
可以使用split将需要的字段获取到一个数组中,并对其进行循环。
http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String)
http://download.oracle.com/javase/1 5.0 / docs / api / java / lang / String.html #分裂(以)
#4
2
I have not used java regex, but for many languages the answer is: No.
我没有使用java regex,但是对于许多语言来说,答案是:没有。
Capturing groups seem to be created when the regex is parsed, and filled when it matches the string. The expression (a)|(b)(c)
has three capturing groups, only if either one, or two of them can be filled. (a)*
has just one group, the parser leaves the last match in the group after matching.
捕获组似乎是在解析regex时创建的,并在匹配字符串时填充。表达式(a)|(b)(c)有三个捕获组,只有其中一个或两个可以被填充。(a)*只有一个组,解析器在匹配后将最后一个匹配留在组中。
#5
0
I would think that backtracking inhibits this behavior, and say the effect of /([\S\s])/
in its grouping accumulative state on something like the Bible. Even if it can be done, the output is unknowable as the groups will lose positional meaning. Its better to do a separate regex on like kind in a global sense and have it deposited into an array.
我认为回溯会抑制这种行为,并说/([\S\ S\ S\ S\ S])/在它的分组累积状态中对圣经之类的东西的影响。即使可以这样做,输出也是不可知的,因为组将失去位置意义。最好在全局意义上对同类进行单独的regex,并将其存入一个数组。
#6
0
I have just had the very similar problem, and managed to do "variable number of groups" but a combination of a while loop and resetting the matcher.
我刚刚遇到了一个非常类似的问题,并设法做了“组数变量”,但是一个while循环的组合,并重新设置了matcher。
int i=0;
String m1=null, m2=null;
while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
{
// do work on two found groups
i=matcher.end();
}
But this is for my problem (with two repeating
但这是我的问题(有两个重复
Pattern pattern = Pattern.compile("(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)");
Matcher matcher = pattern.matcher("abcddcef")
int i=0;
String res=null;
while(matcher.find(i) && (res=matcher.group())!=null)
{
System.out.println(res);
i=matcher.end();
}
You lose the ability to specify arbitrary length of repetition with *
or +
because look-ahead and look-behind must be of the predictable length.
您失去了使用*或+指定任意重复长度的能力,因为前视和后视必须具有可预测的长度。