Regex非捕获组正在捕获

时间:2021-09-29 22:33:22

I have this regex

我有这个正则表达式

(?:\<a[^*]href="(http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?)>

The point of this regex is to capture every closing tag ('>') of an anchor that has an href that starts with "http://" or ends with ".pdf".

这个regex的要点是捕获锚的每个结束标记('>'),锚的href以“http://”开头,以“.pdf”结尾。

The regex works, however it is capturing the first part of the anchor, which I absolutely need to NOT capture.

regex工作,但是它捕获了锚的第一部分,我绝对不需要捕捉。

In the following samples all are matching except second (which is fine) but only the last bracket should be captured and it is not the case.

在下面的示例中,除了second(这很好)之外,所有的都是匹配的,但是应该只捕获最后一个括号,而不是这种情况。

<a href="http://blabla">omg</a>
<a href="blabla">omg</a>
<a href="http://blabla.pdf">omg</a>
<a href="/blabla.pdf">omg</a>

For example: If we take the first match which is :

例如:如果我们进行第一场比赛,即:

<a href="http://blabla">

I only want to capture the last bracket (the one I surounded with parenthesis) :

我只想抓住最后一个括号(我用括号括起来的那个):

<a href="http://blabla"(>)

So why the non-capturing group is capturing? And how can I only grab the last bracket of the anchor

为什么非捕获组要捕获?我怎么能只抓住锚的最后一个支架呢

Even if I streamline my regex to the following, it still doesnt work

即使我将regex简化为以下内容,它仍然不起作用

(?:\<a[^*]href="http://[^"]+"+[^>]*)(>)

Thank you,

谢谢你!

5 个解决方案

#1


3  

You're conflating two distinct concepts: capturing and consuming. Regexes normally consume whatever they match; that's just how they work. Additionally, most regex flavors let you use capturing groups to pluck out specific parts of the overall match. (The overall match is often referred to as the zero'th capturing group, but that's just a figure of speech.)

你混淆了两个截然不同的概念:捕获和消费。regexe通常使用它们匹配的任何内容;这就是他们的工作方式。此外,大多数regex风味允许您使用捕获组来提取整体匹配的特定部分。(整场比赛通常被称为“零捕捉组”,但这只是一个比喻。)

It sounds like you're trying to match a whole <A> tag, but only consume the final >. That's not possible in most regex flavors, JavaScript included. But if you're using Perl or PHP, you could use \K to spoof the match start position:

这听起来好像你在试图匹配一个完整的< >标签,但只消耗最后的>。这在包括JavaScript在内的大多数regex版本中是不可能实现的。但是,如果您使用Perl或PHP,您可以使用\K来欺骗匹配起始位置:

(?i)<a\s+[^>]+?href="http://[^"]+"[^>]*\K>

And in .NET you could use a lookbehind (which, like a lookahead, matches without consuming):

在。net中,你可以使用一个lookbehind(它就像前面的,不需要消耗):

(?i)"(?<=<a\s+[^>]+?href="http://[^"]+"[^>]*)>

Of the other flavors that support lookbehinds, most place restrictions on them that render them unusable for this task.

在其他支持“后来居上”的特性中,大多数都对它们进行了限制,使它们不能用于此任务。

#2


4  

Rewrite your regex as :

重写你的正则表达式为:

(?:\<a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?)(>)
   non capture __^^                                    ^ ^
                                             capture __|_|

As Tony Lukasavage said, there is an unnecessary non-capture group, and, moreover, there is no need to escape <, so it becomes:

正如Tony Lukasavage所说,存在一个不必要的非捕获组,而且,也没有必要逃避<,所以它变成:

  <a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)
non capture __^^                                    ^ ^
                                          capture __|_|

#3


2  

If I'm understanding correctly that you want to match just the greater-than sign (>) that's part of the closing anchor tag, this should do it:

如果我正确理解了你想要匹配的是大于号(>)这是结束锚标记的一部分,应该这样做:

\<a[^*]href="(http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)

#4


1  

If I'm understanding your request correctly...

如果我理解正确的话……

\<a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)

#5


0  

Your parentheses are around the tag itself and the href's contents, so that's what will be captured. If you need to capture the closing > then put the parenthesis around it.

您的圆括号在标记本身和href的内容周围,因此将捕获这些内容。如果需要捕获结束的>,那么将括号括起来。

#1


3  

You're conflating two distinct concepts: capturing and consuming. Regexes normally consume whatever they match; that's just how they work. Additionally, most regex flavors let you use capturing groups to pluck out specific parts of the overall match. (The overall match is often referred to as the zero'th capturing group, but that's just a figure of speech.)

你混淆了两个截然不同的概念:捕获和消费。regexe通常使用它们匹配的任何内容;这就是他们的工作方式。此外,大多数regex风味允许您使用捕获组来提取整体匹配的特定部分。(整场比赛通常被称为“零捕捉组”,但这只是一个比喻。)

It sounds like you're trying to match a whole <A> tag, but only consume the final >. That's not possible in most regex flavors, JavaScript included. But if you're using Perl or PHP, you could use \K to spoof the match start position:

这听起来好像你在试图匹配一个完整的< >标签,但只消耗最后的>。这在包括JavaScript在内的大多数regex版本中是不可能实现的。但是,如果您使用Perl或PHP,您可以使用\K来欺骗匹配起始位置:

(?i)<a\s+[^>]+?href="http://[^"]+"[^>]*\K>

And in .NET you could use a lookbehind (which, like a lookahead, matches without consuming):

在。net中,你可以使用一个lookbehind(它就像前面的,不需要消耗):

(?i)"(?<=<a\s+[^>]+?href="http://[^"]+"[^>]*)>

Of the other flavors that support lookbehinds, most place restrictions on them that render them unusable for this task.

在其他支持“后来居上”的特性中,大多数都对它们进行了限制,使它们不能用于此任务。

#2


4  

Rewrite your regex as :

重写你的正则表达式为:

(?:\<a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?)(>)
   non capture __^^                                    ^ ^
                                             capture __|_|

As Tony Lukasavage said, there is an unnecessary non-capture group, and, moreover, there is no need to escape <, so it becomes:

正如Tony Lukasavage所说,存在一个不必要的非捕获组,而且,也没有必要逃避<,所以它变成:

  <a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)
non capture __^^                                    ^ ^
                                          capture __|_|

#3


2  

If I'm understanding correctly that you want to match just the greater-than sign (>) that's part of the closing anchor tag, this should do it:

如果我正确理解了你想要匹配的是大于号(>)这是结束锚标记的一部分,应该这样做:

\<a[^*]href="(http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)

#4


1  

If I'm understanding your request correctly...

如果我理解正确的话……

\<a[^*]href="(?:http://[^"]+?|[^"]+?\.pdf)"+?[^>]*?(>)

#5


0  

Your parentheses are around the tag itself and the href's contents, so that's what will be captured. If you need to capture the closing > then put the parenthesis around it.

您的圆括号在标记本身和href的内容周围,因此将捕获这些内容。如果需要捕获结束的>,那么将括号括起来。