有人可以解释如何设计一个无论HTML标签之前/之前是否有空格都可以正常工作的正则表达式

时间:2021-07-07 08:20:18

Need to use Regex instead of parser to lift attributes from HMTL/XML page, but can't make the Regex <span class='street-address'> (?<Street>.*) lift 2346 21st Ave NE from the following text (spaced exactly like that), in Rubular.

需要使用正则表达式而不是解析器来提升HMTL / XML页面中的属性,但不能使正则表达式(? 。*)从以下文本中提升2346 21st Ave NE (用这个间隔完全相同),在Rubular中。

<span class='street-address'>
2346 21st Ave NE
</span>

Also the Regex I have only works if I condense the text and there are spaces after the first HTML tag and before the last HTML tag. If I change the Regex to eliminate those spaces, then spaced HTML tags are skipped. I want to make the Regex as dynamic as possible.

此外,正则表达式只有在我压缩文本时才有效,并且在第一个HTML标记之后和最后一个HTML标记之前有空格。如果我更改正则表达式以消除这些空格,则会跳过间隔开的HTML标记。我想让Regex尽可能地动态化。

How can I construct a Regex that works regardless whether there is a space or not after/before HTML tags or line breaks?

如何在HTML标记或换行符之后/之前构建一个无论是否有空格的正则表达式?

1 个解决方案

#1


As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.

正如您在几乎所有与xhtml和regex相关的答案中都能找到的那样,除非您真正知道涉及哪些html内容,否则不应使用正则表达式来解析html。我会改用html解析器。

You have just to use the s (single line flag) and also use a lazy quantifier

你只需要使用s(单行标志)并使用惰性量词

<span class='street-address'>(?<Street>.*?)<\/span>

Working demo

You can also use the inline s flag like this:

你也可以使用这样的内联标志:

(?s)<span class='street-address'>(?<Street>.*?)<\/span>
 ^--- here

On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S] like this:

另一方面,如果你不想使用正则表达式标志,你可以使用一个众所周知的技巧,使用两个相反的集合,如[\ s \ S],如下所示:

<span class='street-address'>(?<Street>[\s\S]*?)<\/span>

Just for you to know, this trick means:

只是为了让你知道,这个技巧意味着:

\s     --> matches whitespace (spaces, tabs). 
\S     --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)

You can use this trick with whatever set you want, like:

您可以使用您想要的任何设置,例如:

[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary

#1


As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.

正如您在几乎所有与xhtml和regex相关的答案中都能找到的那样,除非您真正知道涉及哪些html内容,否则不应使用正则表达式来解析html。我会改用html解析器。

You have just to use the s (single line flag) and also use a lazy quantifier

你只需要使用s(单行标志)并使用惰性量词

<span class='street-address'>(?<Street>.*?)<\/span>

Working demo

You can also use the inline s flag like this:

你也可以使用这样的内联标志:

(?s)<span class='street-address'>(?<Street>.*?)<\/span>
 ^--- here

On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S] like this:

另一方面,如果你不想使用正则表达式标志,你可以使用一个众所周知的技巧,使用两个相反的集合,如[\ s \ S],如下所示:

<span class='street-address'>(?<Street>[\s\S]*?)<\/span>

Just for you to know, this trick means:

只是为了让你知道,这个技巧意味着:

\s     --> matches whitespace (spaces, tabs). 
\S     --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)

You can use this trick with whatever set you want, like:

您可以使用您想要的任何设置,例如:

[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary