Need to use Regex instead of parser to lift attributes from HMTL/XML page, but can't make the Regex <span class='street-address'> (?<Street>.*)
lift 2346 21st Ave NE
from the following text (spaced exactly like that), in Rubular.
需要使用正则表达式而不是解析器来提升HMTL / XML页面中的属性,但不能使正则表达式(?
<span class='street-address'>
2346 21st Ave NE
</span>
Also the Regex I have only works if I condense the text and there are spaces after the first HTML tag and before the last HTML tag. If I change the Regex to eliminate those spaces, then spaced HTML tags are skipped. I want to make the Regex as dynamic as possible.
此外,正则表达式只有在我压缩文本时才有效,并且在第一个HTML标记之后和最后一个HTML标记之前有空格。如果我更改正则表达式以消除这些空格,则会跳过间隔开的HTML标记。我想让Regex尽可能地动态化。
How can I construct a Regex that works regardless whether there is a space or not after/before HTML tags or line breaks?
如何在HTML标记或换行符之后/之前构建一个无论是否有空格的正则表达式?
1 个解决方案
#1
As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.
正如您在几乎所有与xhtml和regex相关的答案中都能找到的那样,除非您真正知道涉及哪些html内容,否则不应使用正则表达式来解析html。我会改用html解析器。
You have just to use the s
(single line flag) and also use a lazy quantifier
你只需要使用s(单行标志)并使用惰性量词
<span class='street-address'>(?<Street>.*?)<\/span>
You can also use the inline s
flag like this:
你也可以使用这样的内联标志:
(?s)<span class='street-address'>(?<Street>.*?)<\/span>
^--- here
On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S]
like this:
另一方面,如果你不想使用正则表达式标志,你可以使用一个众所周知的技巧,使用两个相反的集合,如[\ s \ S],如下所示:
<span class='street-address'>(?<Street>[\s\S]*?)<\/span>
Just for you to know, this trick means:
只是为了让你知道,这个技巧意味着:
\s --> matches whitespace (spaces, tabs).
\S --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)
You can use this trick with whatever set you want, like:
您可以使用您想要的任何设置,例如:
[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary
#1
As you can find in almost all the answers related to xhtml and regex, you should not use regex to parse html unless you really know what html content is involved. I would use a html parser instead.
正如您在几乎所有与xhtml和regex相关的答案中都能找到的那样,除非您真正知道涉及哪些html内容,否则不应使用正则表达式来解析html。我会改用html解析器。
You have just to use the s
(single line flag) and also use a lazy quantifier
你只需要使用s(单行标志)并使用惰性量词
<span class='street-address'>(?<Street>.*?)<\/span>
You can also use the inline s
flag like this:
你也可以使用这样的内联标志:
(?s)<span class='street-address'>(?<Street>.*?)<\/span>
^--- here
On the other hand, if you don't want to use regex flags, you could use a well know trick by using two opposite sets like [\s\S]
like this:
另一方面,如果你不想使用正则表达式标志,你可以使用一个众所周知的技巧,使用两个相反的集合,如[\ s \ S],如下所示:
<span class='street-address'>(?<Street>[\s\S]*?)<\/span>
Just for you to know, this trick means:
只是为了让你知道,这个技巧意味着:
\s --> matches whitespace (spaces, tabs).
\S --> matches non whitespace (same as: [^\s])
[\s\S] --> matches whitespace or non whitespace (so... everything)
You can use this trick with whatever set you want, like:
您可以使用您想要的任何设置,例如:
[\s\S] whitespace or non whitespace
[\w\W] word or non word
[\d\D] digit or non digit
[\b\B] word boundary or non word boundary