I want to match all text following >
, and optionally match links on the same line:
我想匹配>后的所有文本,并可选择匹配同一行的链接:
preg_match('#(href="([^"]*))?.*>(.*)#', '<a href="world.html">Hello', $m);
print_r($m);
Input examples:
输入例子:
<a href="#catch-me" style="nice">Capture this text
This text should be ignored <a href="#me-too">Other text to capture
<p>This line has no link, but should be matched anyway.
Expected result:
预期结果:
[2] => world.html
[3] => Hello
Actual result:
实际结果:
[2] =>
[3] => Hello
It works if I remove the question mark, but then the link obviously isn't optional anymore.
如果我删除了问号,它就会工作,但是链接显然不再是可选的了。
Why is this happening and how do I fix it?
为什么会发生这种情况,我该如何修复呢?
1 个解决方案
#1
2
When dealing with optional subpatterns that are followed with .*
, one must be very careful.
在处理跟随的可选子模式时,必须非常小心。*。
The point is that the .*
after an optional pattern will almost always "take" the optional subpattern value. Your regex would work for a string like href="world.html">Hello
. But not if it is preceded with other symbols.
关键是,在可选模式之后的.*几乎总是“采取”可选子模式值。您的regex将用于href=“world.html”>之类的字符串。但如果它前面有其他符号,就不会这样。
Look: when you try your regex against <a href="world.html">Hello
, the (href="([^"]*))?
that can match an empty string (does not fail when non-matching symbol is encountered), matches the place before <
at the beginning. Then, the .*
comes into play and matches all up to the end, and starts backtracking. So, the expression finds the last >
and then (.*)
captures the rest of the line into Group 3.
看:当你尝试你的regex 你好,(href = "([^ ")*))?它可以匹配一个空字符串(遇到不匹配符号时不会失败),匹配开始 <之前的位置。然后,*进入游戏并匹配到最后,并开始回溯。因此,表达式找到最后一个> ,然后(.*)捕获第3组中的其余行。
So, potentially, you could match your values with (href="([^"]*))?(?:(?!href=")[^>])*>(.*)
regex that has a (?:(?!href=")[^>])*
tempered greedy token (that does not match href="
sequence), or split the task into 2 operations (yes, it is preferable):
所以,可能,你可以将你的价值观与(href = "([^]*))?(?(? ! href = "):[^ >])* >(. *)正则表达式有(?:? ! href = ")[^ >])*回火贪婪令牌(href = "不匹配序列),或将任务分成2操作(是的,它是可取的):
1) Grab all the links
2) Check for the optional values.
1)获取所有的链接,检查可选的值。
#1
2
When dealing with optional subpatterns that are followed with .*
, one must be very careful.
在处理跟随的可选子模式时,必须非常小心。*。
The point is that the .*
after an optional pattern will almost always "take" the optional subpattern value. Your regex would work for a string like href="world.html">Hello
. But not if it is preceded with other symbols.
关键是,在可选模式之后的.*几乎总是“采取”可选子模式值。您的regex将用于href=“world.html”>之类的字符串。但如果它前面有其他符号,就不会这样。
Look: when you try your regex against <a href="world.html">Hello
, the (href="([^"]*))?
that can match an empty string (does not fail when non-matching symbol is encountered), matches the place before <
at the beginning. Then, the .*
comes into play and matches all up to the end, and starts backtracking. So, the expression finds the last >
and then (.*)
captures the rest of the line into Group 3.
看:当你尝试你的regex 你好,(href = "([^ ")*))?它可以匹配一个空字符串(遇到不匹配符号时不会失败),匹配开始 <之前的位置。然后,*进入游戏并匹配到最后,并开始回溯。因此,表达式找到最后一个> ,然后(.*)捕获第3组中的其余行。
So, potentially, you could match your values with (href="([^"]*))?(?:(?!href=")[^>])*>(.*)
regex that has a (?:(?!href=")[^>])*
tempered greedy token (that does not match href="
sequence), or split the task into 2 operations (yes, it is preferable):
所以,可能,你可以将你的价值观与(href = "([^]*))?(?(? ! href = "):[^ >])* >(. *)正则表达式有(?:? ! href = ")[^ >])*回火贪婪令牌(href = "不匹配序列),或将任务分成2操作(是的,它是可取的):
1) Grab all the links
2) Check for the optional values.
1)获取所有的链接,检查可选的值。