
时间:2021-07-07 08:20:18

Need to use Regex instead of parser to lift attributes from HMTL/XML page, but can't make the Regex <span class='street-address'> (?<Street>.*) lift 2346 21st Ave NE from the following text (spaced exactly like that), in Rubular.

需要使用正则表达式而不是解析器来提升HMTL / XML页面中的属性,但不能使正则表达式(? 。*)从以下文本中提升2346 21st Ave NE (用这个间隔完全相同),在Rubular中。

<span class='street-address'>
2346 21st Ave NE

Also the Regex I have only works if I condense the text and there are spaces after the first HTML tag and before the last HTML tag. If I change the Regex to eliminate those spaces, then spaced HTML tags are skipped. I want to make the Regex as dynamic as possible.


How can I construct a Regex that works regardless whether there is a space or not after/before HTML tags or line breaks?


1 个解决方案


As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.


You have just to use the s (single line flag) and also use a lazy quantifier


<span class='street-address'>(?<Street>.*?)<\/span>

Working demo

You can also use the inline s flag like this:


(?s)<span class='street-address'>(?<Street>.*?)<\/span>
 ^--- here

On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S] like this:

另一方面,如果你不想使用正则表达式标志,你可以使用一个众所周知的技巧,使用两个相反的集合,如[\ s \ S],如下所示:

<span class='street-address'>(?<Street>[\s\S]*?)<\/span>

Just for you to know, this trick means:


\s     --> matches whitespace (spaces, tabs). 
\S     --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)

You can use this trick with whatever set you want, like:


[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary


As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.


You have just to use the s (single line flag) and also use a lazy quantifier


<span class='street-address'>(?<Street>.*?)<\/span>

Working demo

You can also use the inline s flag like this:


(?s)<span class='street-address'>(?<Street>.*?)<\/span>
 ^--- here

On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S] like this:

另一方面,如果你不想使用正则表达式标志,你可以使用一个众所周知的技巧,使用两个相反的集合,如[\ s \ S],如下所示:

<span class='street-address'>(?<Street>[\s\S]*?)<\/span>

Just for you to know, this trick means:


\s     --> matches whitespace (spaces, tabs). 
\S     --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)

You can use this trick with whatever set you want, like:


[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary