I've run into this problems several times before when trying to do some html scraping with php and the preg* functions.
在尝试使用php和preg *函数进行一些html抓取之前,我曾多次遇到过这个问题。
Most of the time I've to capture structures like that:
大部分时间我都要捕捉这样的结构:
<!-- comment -->
<tag1>lorem ipsum</tag>
<p>just more text with several html tags in it, sometimes CDATA encapsulated…</p>
<!-- /comment -->
In particular I want something like this:
特别是我想要这样的东西:
/<tag1>(.*?)<\/tag1>\n\n<p>(.*?)<\/p>/mi
but the \n\n doesn't look like it would work.
但\ n \ n看起来不会起作用。
Is there a general line-break switch?
是否有一般的换行开关?
3 个解决方案
#1
I think you could replace the \n\n
with (\r?\n){2}
this way you capture the CRLF
pair instead of just the LF
char.
我认为您可以用(\ r?\ n){2}替换\ n \ n,这样就可以捕获CRLF对,而不仅仅是LF char。
#2
Are you sure you want to parse HTML using regexps ? HTML isn't regular and there are too many corner cases.
您确定要使用regexps解析HTML吗? HTML不常规,并且有太多的极端情况。
I would investigate some form of HTML parser (perhaps this one ?), and then identify the pattern you're interested in via the returned HTML data structure.
我会调查某种形式的HTML解析器(也许是这个?),然后通过返回的HTML数据结构识别您感兴趣的模式。
#3
Or you could look at the Dom Extension to php. It has a function to load html from a string or a file. You can then use the php dom methods to traverse the dom and find the data you are interested in.
或者你可以看看php扩展到Dom。它具有从字符串或文件加载html的功能。然后,您可以使用php dom方法遍历dom并找到您感兴趣的数据。
#1
I think you could replace the \n\n
with (\r?\n){2}
this way you capture the CRLF
pair instead of just the LF
char.
我认为您可以用(\ r?\ n){2}替换\ n \ n,这样就可以捕获CRLF对,而不仅仅是LF char。
#2
Are you sure you want to parse HTML using regexps ? HTML isn't regular and there are too many corner cases.
您确定要使用regexps解析HTML吗? HTML不常规,并且有太多的极端情况。
I would investigate some form of HTML parser (perhaps this one ?), and then identify the pattern you're interested in via the returned HTML data structure.
我会调查某种形式的HTML解析器(也许是这个?),然后通过返回的HTML数据结构识别您感兴趣的模式。
#3
Or you could look at the Dom Extension to php. It has a function to load html from a string or a file. You can then use the php dom methods to traverse the dom and find the data you are interested in.
或者你可以看看php扩展到Dom。它具有从字符串或文件加载html的功能。然后,您可以使用php dom方法遍历dom并找到您感兴趣的数据。