需要regexp才能找到两个令牌之间的子串

时间:2020-11-28 19:21:09

I suspect this has already been answered somewhere, but I can't find it, so...

我怀疑这已经在某处得到了解答,但我找不到它,所以...

I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)

我需要从一个更大的字符串中的两个标记之间提取一个字符串,其中第二个标记可能会再次显示... ...(伪代码...)

myString = "A=abc;B=def_3%^123+-;C=123;"  ;

myB = getInnerString(myString, "B=", ";" )  ;

method getInnerString(inStr, startToken, endToken){
   return inStr.replace( EXPRESSION, "$1");
}

so, when I run this using expression ".+B=(.+);.+" I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.

所以,当我使用表达式“。+ B =(。+);。+”运行时,我得到“def_3%^ 123 + - ; C = 123;”大概是因为它只是寻找';'的最后一个例子在字符串中,而不是在第一个停止。

I've tried using (?=) in search of that first ';' but it gives me the same result.

我尝试使用(?=)搜索第一个';'但它给了我相同的结果。

I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.

我似乎找不到一个regExp引用来解释如何指定“NEXT”令牌而不是最后一个令牌。

any and all help greatly appreciated.

任何和所有的帮助非常感谢。


Similar question on SO:

关于SO的类似问题:

3 个解决方案

#1


7  

You're using a greedy pattern by not specifying the ? in it. Try this:

你没有指定贪婪模式?在里面。尝试这个:

".+B=(.+?);.+" 

#2


5  

Try this:

B=([^;]+);

This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.

这匹配B =和之间的所有内容;除非它是;所以它匹配B =和第一个之间的所有东西;其后。

#3


2  

(This is a continuation of the conversation from the comments to Evan's answer.)

(这是从评论到Evan回答的对话的延续。)

Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.

以下是应用(更正的)正则表达式时发生的情况:首先,。+匹配整个字符串。然后它回溯,放弃它刚刚匹配的大部分字符,直到它到达B =可以匹配的点。然后(。+?)匹配(并捕获)它看到的所有内容,直到下一部分分号可以匹配。然后决赛。+吞噬剩下的角色。

All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):

你真正感兴趣的是“B =”和“;”以及它们之间的任何东西,为什么要匹配其余的字符串呢?您必须这样做的唯一原因是您可以使用捕获组的内容替换整个字符串。但是,如果您可以直接访问该组的内容,为什么还要这样做呢?这是一个演示(在Java中,因为我不知道你正在使用什么语言):

String s = "A=abc;B=def_3%^123+-;C=123;";

Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
  System.out.println(m.group(1));
}

Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:

当“发现”更直接时,为什么要“替换”?可能是因为您的API更容易;这就是我们用Java做的原因。 Java在其String类中有几个面向正则表达式的便捷方法:replaceAll(),replaceFirst(),split()和matches()(如果正则表达式与整个字符串匹配则返回true),但不是find()。并且也没有用于访问捕获组的便捷方法。我们无法与Perl单行的优雅相匹配:

print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;

...so we content ourselves with hacks like this:

...所以我们满足于这样的黑客:

System.out.println("A=abc;B=def_3%^123+-;C=123;"
    .replaceFirst(".+B=(.*?);.+", "$1"));

Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.

为了清楚起见,我不是说不要使用这些黑客,或者说Evan的答案有任何问题 - 没有。我认为我们应该理解为什么要使用它们,以及我们在做什么时做出的权衡。

#1


7  

You're using a greedy pattern by not specifying the ? in it. Try this:

你没有指定贪婪模式?在里面。尝试这个:

".+B=(.+?);.+" 

#2


5  

Try this:

B=([^;]+);

This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.

这匹配B =和之间的所有内容;除非它是;所以它匹配B =和第一个之间的所有东西;其后。

#3


2  

(This is a continuation of the conversation from the comments to Evan's answer.)

(这是从评论到Evan回答的对话的延续。)

Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.

以下是应用(更正的)正则表达式时发生的情况:首先,。+匹配整个字符串。然后它回溯,放弃它刚刚匹配的大部分字符,直到它到达B =可以匹配的点。然后(。+?)匹配(并捕获)它看到的所有内容,直到下一部分分号可以匹配。然后决赛。+吞噬剩下的角色。

All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):

你真正感兴趣的是“B =”和“;”以及它们之间的任何东西,为什么要匹配其余的字符串呢?您必须这样做的唯一原因是您可以使用捕获组的内容替换整个字符串。但是,如果您可以直接访问该组的内容,为什么还要这样做呢?这是一个演示(在Java中,因为我不知道你正在使用什么语言):

String s = "A=abc;B=def_3%^123+-;C=123;";

Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
  System.out.println(m.group(1));
}

Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:

当“发现”更直接时,为什么要“替换”?可能是因为您的API更容易;这就是我们用Java做的原因。 Java在其String类中有几个面向正则表达式的便捷方法:replaceAll(),replaceFirst(),split()和matches()(如果正则表达式与整个字符串匹配则返回true),但不是find()。并且也没有用于访问捕获组的便捷方法。我们无法与Perl单行的优雅相匹配:

print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;

...so we content ourselves with hacks like this:

...所以我们满足于这样的黑客:

System.out.println("A=abc;B=def_3%^123+-;C=123;"
    .replaceFirst(".+B=(.*?);.+", "$1"));

Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.

为了清楚起见,我不是说不要使用这些黑客,或者说Evan的答案有任何问题 - 没有。我认为我们应该理解为什么要使用它们,以及我们在做什么时做出的权衡。