Possible Duplicate:
How to parse and process HTML with PHP?可能重复:如何使用PHP解析和处理HTML?
How do I go about pulling specific content from a given live online HTML page?
如何从指定的在线HTML页面中提取特定内容?
For example: http://www.gumtree.com/p/for-sale/ovation-semi-acoustic-guitar/93991967
例如:http://www.gumtree.com/p/for-sale/ovation-semi-acoustic-guitar/93991967
I want to retrieve the text description, the path to the main image and the price only. So basically, I want to retrieve content which is inside specific divs with maybe specific IDs or classes inside a html page.
我想检索文本描述,主图像的路径和价格。所以基本上,我想检索特定div内的内容,可能是html页面内的特定ID或类。
Psuedo code
$page = load_html_contents('http://www.gumtr..');
$price = getPrice($page);
$description = getDescription($page);
$title = getTitle($page);
Please note I do not intend to steal any content from gumtree, or anywhere else for that matter, I am just providing an example.
请注意我不打算从gumtree或其他任何地方窃取任何内容,我只是提供一个例子。
3 个解决方案
#1
1
The tutorial Easy web scraping with PHP recommended by robotrobert is good to start, I have made several comments in it. For a better performance use curl. Among other things handles HTTP headers, SSL, cookies, proxies, etc. Cookies is something that you must pay attention.
使用robotrobert推荐的使用PHP轻松抓取网页的教程很有帮助,我已经在其中做了几点评论。为了更好的性能使用卷曲。除此之外还处理HTTP标头,SSL,cookie,代理等.Cookie是你必须注意的事情。
I just found HTML Parsing and Screen Scraping with the Simple HTML DOM Library. Is more advanced, facilitates and speed up the page parsing through a DOM parser (instead regular expressions --enough hard to master and resources consuming). I recommend you this last one 100%.
我刚刚使用Simple HTML DOM Library找到了HTML Parsing和Screen Scraping。通过DOM解析器更加高级,便利和加速页面解析(而不是正则表达式 - 很难掌握和消耗资源)。我推荐你最后一个100%。
#2
2
First of all, what u wanna do, is called WEBSCRAPING. Basically, u load into the html content into one variable, so u will need to use regexps to search for specific ids..etc. Search after webscraping.
首先,你想做什么,叫做WEBSCRAPING。基本上,你将html内容加载到一个变量中,所以你需要使用regexp来搜索特定的ids..etc。 webscraping后搜索。
这是一个基础教程
THIS book should be useful too.
这本书也应该有用。
#3
2
something like this would be a good starting point if you wanted tabular output
如果您想要表格输出,这样的事情将是一个很好的起点
$raw=file_get_contents($url) or die('could not select');
$newlines=array("\t","\n","\r","\x20\x20","\0","\x0B","<br/>");
$content=str_replace($newlines, "", html_entity_decode($raw));
$start=strpos($content,'<some id> ');
$end = strpos($content,'</ending id>');
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
// array to vars
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$var1= strip_tags($cells[0][0]);
$var2= strip_tags($cells[0][1]);
etc etc
#1
1
The tutorial Easy web scraping with PHP recommended by robotrobert is good to start, I have made several comments in it. For a better performance use curl. Among other things handles HTTP headers, SSL, cookies, proxies, etc. Cookies is something that you must pay attention.
使用robotrobert推荐的使用PHP轻松抓取网页的教程很有帮助,我已经在其中做了几点评论。为了更好的性能使用卷曲。除此之外还处理HTTP标头,SSL,cookie,代理等.Cookie是你必须注意的事情。
I just found HTML Parsing and Screen Scraping with the Simple HTML DOM Library. Is more advanced, facilitates and speed up the page parsing through a DOM parser (instead regular expressions --enough hard to master and resources consuming). I recommend you this last one 100%.
我刚刚使用Simple HTML DOM Library找到了HTML Parsing和Screen Scraping。通过DOM解析器更加高级,便利和加速页面解析(而不是正则表达式 - 很难掌握和消耗资源)。我推荐你最后一个100%。
#2
2
First of all, what u wanna do, is called WEBSCRAPING. Basically, u load into the html content into one variable, so u will need to use regexps to search for specific ids..etc. Search after webscraping.
首先,你想做什么,叫做WEBSCRAPING。基本上,你将html内容加载到一个变量中,所以你需要使用regexp来搜索特定的ids..etc。 webscraping后搜索。
这是一个基础教程
THIS book should be useful too.
这本书也应该有用。
#3
2
something like this would be a good starting point if you wanted tabular output
如果您想要表格输出,这样的事情将是一个很好的起点
$raw=file_get_contents($url) or die('could not select');
$newlines=array("\t","\n","\r","\x20\x20","\0","\x0B","<br/>");
$content=str_replace($newlines, "", html_entity_decode($raw));
$start=strpos($content,'<some id> ');
$end = strpos($content,'</ending id>');
$table = substr($content,$start,$end-$start);
preg_match_all("|<tr(.*)</tr>|U",$table,$rows);
foreach ($rows[0] as $row){
if ((strpos($row,'<th')===false)){
// array to vars
preg_match_all("|<td(.*)</td>|U",$row,$cells);
$var1= strip_tags($cells[0][0]);
$var2= strip_tags($cells[0][1]);
etc etc