
时间:2021-04-09 22:25:02

I want to match all text following >, and optionally match links on the same line:


preg_match('#(href="([^"]*))?.*>(.*)#', '<a href="world.html">Hello', $m);

Input examples:


<a href="#catch-me" style="nice">Capture this text
This text should be ignored <a href="#me-too">Other text to capture
<p>This line has no link, but should be matched anyway.

Expected result:


[2] => world.html
[3] => Hello

Actual result:


[2] => 
[3] => Hello

It works if I remove the question mark, but then the link obviously isn't optional anymore.


Why is this happening and how do I fix it?


1 个解决方案



When dealing with optional subpatterns that are followed with .*, one must be very careful.


The point is that the .* after an optional pattern will almost always "take" the optional subpattern value. Your regex would work for a string like href="world.html">Hello. But not if it is preceded with other symbols.


Look: when you try your regex against <a href="world.html">Hello, the (href="([^"]*))? that can match an empty string (does not fail when non-matching symbol is encountered), matches the place before < at the beginning. Then, the .* comes into play and matches all up to the end, and starts backtracking. So, the expression finds the last > and then (.*) captures the rest of the line into Group 3.

看:当你尝试你的regex 你好,(href = "([^ ")*))?它可以匹配一个空字符串(遇到不匹配符号时不会失败),匹配开始 <之前的位置。然后,*进入游戏并匹配到最后,并开始回溯。因此,表达式找到最后一个> ,然后(.*)捕获第3组中的其余行。

So, potentially, you could match your values with (href="([^"]*))?(?:(?!href=")[^>])*>(.*) regex that has a (?:(?!href=")[^>])* tempered greedy token (that does not match href=" sequence), or split the task into 2 operations (yes, it is preferable):

所以,可能,你可以将你的价值观与(href = "([^]*))?(?(? ! href = "):[^ >])* >(. *)正则表达式有(?:? ! href = ")[^ >])*回火贪婪令牌(href = "不匹配序列),或将任务分成2操作(是的,它是可取的):

1) Grab all the links
2) Check for the optional values.




When dealing with optional subpatterns that are followed with .*, one must be very careful.


The point is that the .* after an optional pattern will almost always "take" the optional subpattern value. Your regex would work for a string like href="world.html">Hello. But not if it is preceded with other symbols.


Look: when you try your regex against <a href="world.html">Hello, the (href="([^"]*))? that can match an empty string (does not fail when non-matching symbol is encountered), matches the place before < at the beginning. Then, the .* comes into play and matches all up to the end, and starts backtracking. So, the expression finds the last > and then (.*) captures the rest of the line into Group 3.

看:当你尝试你的regex 你好,(href = "([^ ")*))?它可以匹配一个空字符串(遇到不匹配符号时不会失败),匹配开始 <之前的位置。然后,*进入游戏并匹配到最后,并开始回溯。因此,表达式找到最后一个> ,然后(.*)捕获第3组中的其余行。

So, potentially, you could match your values with (href="([^"]*))?(?:(?!href=")[^>])*>(.*) regex that has a (?:(?!href=")[^>])* tempered greedy token (that does not match href=" sequence), or split the task into 2 operations (yes, it is preferable):

所以,可能,你可以将你的价值观与(href = "([^]*))?(?(? ! href = "):[^ >])* >(. *)正则表达式有(?:? ! href = ")[^ >])*回火贪婪令牌(href = "不匹配序列),或将任务分成2操作(是的,它是可取的):

1) Grab all the links
2) Check for the optional values.
