如何从具有正则表达式的网页中提取数据？

I am writing a curl script for collecting information about some sex offenders, i have developed the script that is picking up links like given below:

我正在写一个卷曲脚本来收集关于一些性犯罪者的信息,我已经开发了一个脚本,它正在拾取如下所示的链接:

http://criminaljustice.state.ny.us/cgi/internet/nsor/... (snipped URL)

http://criminaljustice.state.ny.us/cgi/internet/nsor / ...(剪切的URL)

Now when we go on this link I want to get information under all the fields on this page like Offender Id:, last name etc. into my own variables. I am very weak in regex that is why I am here. Or is there another way?

现在,当我们继续这个链接时,我希望在此页面的所有字段下获取信息,例如罪犯标识:,姓氏等到我自己的变量中。我在正则表达式上非常弱,这就是我在这里的原因。或者还有另一种方式吗?

Can anybody help me in doing that?

有人可以帮我这么做吗?

3 个解决方案

#1

phpQuery is very nice for screen-scraping in PHP. It lets you access the DOM using the same methods jQuery has.

phpQuery非常适合PHP中的屏幕抓取。它允许您使用jQuery具有的相同方法访问DOM。

#2

You don't want regexes (see Can you provide some examples of why it is hard to parse XML and HTML with a regex?, look for an HTML Parser for PHP. See this answer to Can you provide an example of parsing HTML with your favorite parser?

你不想要正则表达式(参见你能提供一些为什么难以用正则表达式解析XML和HTML的例子吗?请查找PHP的HTML解析器。请参阅此答案你能提供一个解析HTML的例子吗?最喜欢的解析器?

#3

I tend to agree with the previous poster about RegEx not being the right tool for the job. If you just want a quick and dirty expression, here goes:

我倾向于同意之前关于RegEx不适合这项工作的海报。如果你只想要一个快速而肮脏的表达,这里是:

Offender Id:.*
.*&amp;nbsp;[0-9]*

NOTE: You must include the newline in this expression. Also note that this is very fragile as it will break if the source that your are parsing changes much at all.

注意:您必须在此表达式中包含换行符。另请注意,这是非常脆弱的,因为如果您正在解析的源发生很大变化,它将会中断。

#1