regex中的问号没有按预期工作

时间:2021-04-09 22:25:02

I want to match all text following >, and optionally match links on the same line:

我想匹配>后的所有文本,并可选择匹配同一行的链接:

preg_match('#(href="([^"]*))?.*>(.*)#', '<a href="world.html">Hello', $m);
print_r($m);

Input examples:

输入例子:

<a href="#catch-me" style="nice">Capture this text
This text should be ignored <a href="#me-too">Other text to capture
<p>This line has no link, but should be matched anyway.

Expected result:

预期结果:

[2] => world.html
[3] => Hello

Actual result:

实际结果:

[2] => 
[3] => Hello

It works if I remove the question mark, but then the link obviously isn't optional anymore.

如果我删除了问号,它就会工作,但是链接显然不再是可选的了。

Why is this happening and how do I fix it?

为什么会发生这种情况,我该如何修复呢?

1 个解决方案

#1


2  

When dealing with optional subpatterns that are followed with .*, one must be very careful.

在处理跟随的可选子模式时,必须非常小心。*。

The point is that the .* after an optional pattern will almost always "take" the optional subpattern value. Your regex would work for a string like href="world.html">Hello. But not if it is preceded with other symbols.

关键是,在可选模式之后的.*几乎总是“采取”可选子模式值。您的regex将用于href=“world.html”>之类的字符串。但如果它前面有其他符号,就不会这样。

Look: when you try your regex against <a href="world.html">Hello, the (href="([^"]*))? that can match an empty string (does not fail when non-matching symbol is encountered), matches the place before < at the beginning. Then, the .* comes into play and matches all up to the end, and starts backtracking. So, the expression finds the last > and then (.*) captures the rest of the line into Group 3.

看:当你尝试你的regex 你好,(href = "([^ ")*))?它可以匹配一个空字符串(遇到不匹配符号时不会失败),匹配开始 <之前的位置。然后,*进入游戏并匹配到最后,并开始回溯。因此,表达式找到最后一个> ,然后(.*)捕获第3组中的其余行。

So, potentially, you could match your values with (href="([^"]*))?(?:(?!href=")[^>])*>(.*) regex that has a (?:(?!href=")[^>])* tempered greedy token (that does not match href=" sequence), or split the task into 2 operations (yes, it is preferable):

所以,可能,你可以将你的价值观与(href = "([^]*))?(?(? ! href = "):[^ >])* >(. *)正则表达式有(?:? ! href = ")[^ >])*回火贪婪令牌(href = "不匹配序列),或将任务分成2操作(是的,它是可取的):

1) Grab all the links
2) Check for the optional values.

1)获取所有的链接,检查可选的值。

#1


2  

When dealing with optional subpatterns that are followed with .*, one must be very careful.

在处理跟随的可选子模式时,必须非常小心。*。

The point is that the .* after an optional pattern will almost always "take" the optional subpattern value. Your regex would work for a string like href="world.html">Hello. But not if it is preceded with other symbols.

关键是,在可选模式之后的.*几乎总是“采取”可选子模式值。您的regex将用于href=“world.html”>之类的字符串。但如果它前面有其他符号,就不会这样。

Look: when you try your regex against <a href="world.html">Hello, the (href="([^"]*))? that can match an empty string (does not fail when non-matching symbol is encountered), matches the place before < at the beginning. Then, the .* comes into play and matches all up to the end, and starts backtracking. So, the expression finds the last > and then (.*) captures the rest of the line into Group 3.

看:当你尝试你的regex 你好,(href = "([^ ")*))?它可以匹配一个空字符串(遇到不匹配符号时不会失败),匹配开始 <之前的位置。然后,*进入游戏并匹配到最后,并开始回溯。因此,表达式找到最后一个> ,然后(.*)捕获第3组中的其余行。

So, potentially, you could match your values with (href="([^"]*))?(?:(?!href=")[^>])*>(.*) regex that has a (?:(?!href=")[^>])* tempered greedy token (that does not match href=" sequence), or split the task into 2 operations (yes, it is preferable):

所以,可能,你可以将你的价值观与(href = "([^]*))?(?(? ! href = "):[^ >])* >(. *)正则表达式有(?:? ! href = ")[^ >])*回火贪婪令牌(href = "不匹配序列),或将任务分成2操作(是的,它是可取的):

1) Grab all the links
2) Check for the optional values.

1)获取所有的链接,检查可选的值。