组数可变的正则表达式?

时间:2021-08-29 20:14:45

Is it possible to create a regular expression with a variable number of groups?

是否可以创建一个具有可变数量的组的正则表达式?

After running this for instance...

例如,在运行这个之后……

Pattern p = Pattern.compile("ab([cd])*ef");
Matcher m = p.matcher("abcddcef");
m.matches();

... I would like to have something like

…我想要一些类似的东西

  • m.group(1) = "c"
  • m.group(1)= " c "
  • m.group(2) = "d"
  • m.group(2)= " d "
  • m.group(3) = "d"
  • m.group(3)= " d "
  • m.group(4) = "c".
  • m.group(4)=“c”。

(Background: I'm parsing some lines of data, and one of the "fields" is repeating. I would like to avoid a matcher.find loop for these fields.)

(背景:我正在解析一些数据行,其中一个“字段”正在重复。我想避免一场比赛。查找这些字段的循环。)


As pointed out by @Tim Pietzcker in the comments, perl6 and .NET have this feature.

正如@Tim Pietzcker在评论中指出的,perl6和。net都有这个功能。

6 个解决方案

#1


18  

According to the documentation, Java regular expressions can't do this:

根据文档,Java正则表达式不能这样做:

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

与组相关联的捕获的输入始终是组最近匹配的子序列。如果由于量化而对一个组进行第二次评估,那么如果第二次评估失败,则保留先前捕获的值(如果有的话)。例如,将字符串“aba”与表达式(a(b)?)+匹配,将第二组设置为“b”。所有捕获的输入在每次匹配开始时被丢弃。

(emphasis added)

(重点)

#2


3  

Pattern p = Pattern.compile("ab(?:(c)|(d))*ef");
Matcher m = p.matcher("abcdef");
m.matches();

should do what you want.

应该做你想做的。

EDIT:

编辑:

@aioobe, I understand now. You want to be able to do something like the grammar

@aioobe,现在我明白了。你想要做一些像语法这样的事情

A    ::== <Foo> <Bars> <Baz>
Foo  ::== "foo"
Baz  ::== "baz"
Bars ::== <Bar> <Bars>
        | ε
Bar  ::== "A"
        | "B"

and pull out all the individual matches of Bar.

然后取出所有单独的Bar火柴。

No, there is no way to do that using java.util.regex. You can recurse and use a regex on the match of Bars or use a parser generator like ANTLR and attach a side-effect to Bar.

不,使用java.util.regex无法做到这一点。您可以在Bar的匹配上递归并使用regex,或者使用类似ANTLR的解析器生成器,并将副作用附加到Bar。

#3


3  

You can use split to get the fields you need into an array and loop through that.

可以使用split将需要的字段获取到一个数组中,并对其进行循环。

http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String)

http://download.oracle.com/javase/1 5.0 / docs / api / java / lang / String.html #分裂(以)

#4


2  

I have not used java regex, but for many languages the answer is: No.

我没有使用java regex,但是对于许多语言来说,答案是:没有。

Capturing groups seem to be created when the regex is parsed, and filled when it matches the string. The expression (a)|(b)(c) has three capturing groups, only if either one, or two of them can be filled. (a)* has just one group, the parser leaves the last match in the group after matching.

捕获组似乎是在解析regex时创建的,并在匹配字符串时填充。表达式(a)|(b)(c)有三个捕获组,只有其中一个或两个可以被填充。(a)*只有一个组,解析器在匹配后将最后一个匹配留在组中。

#5


0  

I would think that backtracking inhibits this behavior, and say the effect of /([\S\s])/ in its grouping accumulative state on something like the Bible. Even if it can be done, the output is unknowable as the groups will lose positional meaning. Its better to do a separate regex on like kind in a global sense and have it deposited into an array.

我认为回溯会抑制这种行为,并说/([\S\ S\ S\ S\ S])/在它的分组累积状态中对圣经之类的东西的影响。即使可以这样做,输出也是不可知的,因为组将失去位置意义。最好在全局意义上对同类进行单独的regex,并将其存入一个数组。

#6


0  

I have just had the very similar problem, and managed to do "variable number of groups" but a combination of a while loop and resetting the matcher.

我刚刚遇到了一个非常类似的问题,并设法做了“组数变量”,但是一个while循环的组合,并重新设置了matcher。

    int i=0;
    String m1=null, m2=null;

    while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
    {
        // do work on two found groups
        i=matcher.end();
    }

But this is for my problem (with two repeating

但这是我的问题(有两个重复

    Pattern pattern = Pattern.compile("(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)");
    Matcher matcher = pattern.matcher("abcddcef")
    int i=0;
    String res=null;

    while(matcher.find(i) && (res=matcher.group())!=null)
    {
        System.out.println(res);
        i=matcher.end();
    }

You lose the ability to specify arbitrary length of repetition with * or + because look-ahead and look-behind must be of the predictable length.

您失去了使用*或+指定任意重复长度的能力,因为前视和后视必须具有可预测的长度。

#1


18  

According to the documentation, Java regular expressions can't do this:

根据文档,Java正则表达式不能这样做:

The captured input associated with a group is always the subsequence that the group most recently matched. If a group is evaluated a second time because of quantification then its previously-captured value, if any, will be retained if the second evaluation fails. Matching the string "aba" against the expression (a(b)?)+, for example, leaves group two set to "b". All captured input is discarded at the beginning of each match.

与组相关联的捕获的输入始终是组最近匹配的子序列。如果由于量化而对一个组进行第二次评估,那么如果第二次评估失败,则保留先前捕获的值(如果有的话)。例如,将字符串“aba”与表达式(a(b)?)+匹配,将第二组设置为“b”。所有捕获的输入在每次匹配开始时被丢弃。

(emphasis added)

(重点)

#2


3  

Pattern p = Pattern.compile("ab(?:(c)|(d))*ef");
Matcher m = p.matcher("abcdef");
m.matches();

should do what you want.

应该做你想做的。

EDIT:

编辑:

@aioobe, I understand now. You want to be able to do something like the grammar

@aioobe,现在我明白了。你想要做一些像语法这样的事情

A    ::== <Foo> <Bars> <Baz>
Foo  ::== "foo"
Baz  ::== "baz"
Bars ::== <Bar> <Bars>
        | ε
Bar  ::== "A"
        | "B"

and pull out all the individual matches of Bar.

然后取出所有单独的Bar火柴。

No, there is no way to do that using java.util.regex. You can recurse and use a regex on the match of Bars or use a parser generator like ANTLR and attach a side-effect to Bar.

不,使用java.util.regex无法做到这一点。您可以在Bar的匹配上递归并使用regex,或者使用类似ANTLR的解析器生成器,并将副作用附加到Bar。

#3


3  

You can use split to get the fields you need into an array and loop through that.

可以使用split将需要的字段获取到一个数组中,并对其进行循环。

http://download.oracle.com/javase/1,5.0/docs/api/java/lang/String.html#split(java.lang.String)

http://download.oracle.com/javase/1 5.0 / docs / api / java / lang / String.html #分裂(以)

#4


2  

I have not used java regex, but for many languages the answer is: No.

我没有使用java regex,但是对于许多语言来说,答案是:没有。

Capturing groups seem to be created when the regex is parsed, and filled when it matches the string. The expression (a)|(b)(c) has three capturing groups, only if either one, or two of them can be filled. (a)* has just one group, the parser leaves the last match in the group after matching.

捕获组似乎是在解析regex时创建的,并在匹配字符串时填充。表达式(a)|(b)(c)有三个捕获组,只有其中一个或两个可以被填充。(a)*只有一个组,解析器在匹配后将最后一个匹配留在组中。

#5


0  

I would think that backtracking inhibits this behavior, and say the effect of /([\S\s])/ in its grouping accumulative state on something like the Bible. Even if it can be done, the output is unknowable as the groups will lose positional meaning. Its better to do a separate regex on like kind in a global sense and have it deposited into an array.

我认为回溯会抑制这种行为,并说/([\S\ S\ S\ S\ S])/在它的分组累积状态中对圣经之类的东西的影响。即使可以这样做,输出也是不可知的,因为组将失去位置意义。最好在全局意义上对同类进行单独的regex,并将其存入一个数组。

#6


0  

I have just had the very similar problem, and managed to do "variable number of groups" but a combination of a while loop and resetting the matcher.

我刚刚遇到了一个非常类似的问题,并设法做了“组数变量”,但是一个while循环的组合,并重新设置了matcher。

    int i=0;
    String m1=null, m2=null;

    while(matcher.find(i) && (m1=matcher.group(1))!=null && (m2=matcher.group(2))!=null)
    {
        // do work on two found groups
        i=matcher.end();
    }

But this is for my problem (with two repeating

但这是我的问题(有两个重复

    Pattern pattern = Pattern.compile("(?<=^ab[cd]{0,100})[cd](?=[cd]{0,100}ef$)");
    Matcher matcher = pattern.matcher("abcddcef")
    int i=0;
    String res=null;

    while(matcher.find(i) && (res=matcher.group())!=null)
    {
        System.out.println(res);
        i=matcher.end();
    }

You lose the ability to specify arbitrary length of repetition with * or + because look-ahead and look-behind must be of the predictable length.

您失去了使用*或+指定任意重复长度的能力,因为前视和后视必须具有可预测的长度。