I'm trying to scrape a webpage and I want to grab the text and all HTML tags inside a div
tag.
我正在尝试抓取一个网页,我想抓取div标签内的文本和所有HTML标签。
The webpage looks like this:
该网页如下所示:
<div class="class">
<p>A little paragraph</p>
<a href="#"><img src="/test.jpg"/></a>
<p>Another paragraph</p>
<ul>
<li>1</li>
<li>2</li>
</ul>
</div>
Using cURL I have succeeded in extracting all text but the tags are absent.
使用cURL我已成功提取所有文本,但标签不存在。
My code:
$content = $xpath->query('//div[@class="class"]');
3 个解决方案
#1
It's pretty easy:
这很简单:
<?php
$html = '
<div class="class">
<p>A little paragraph</p>
<a href="#"><img src="/test.jpg"/></a>
<p>Another paragraph</p>
<ul>
<li>1</li>
<li>2</li>
</ul>
</div>';
$dom = new DomDocument();
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$masterNode = $xpath->query('//div[@class="class"]'); #It returns DOMNodeList
# Now from master node we gonna pick what we want.
# Also, $masterNode->item(0) is context node for "P" tags.
$paragraphNodes = $xpath->query('p', $masterNode->item(0));
foreach ($paragraphNodes as $paragraphElement) {
print $paragraphElement->nodeValue . "\n";
}
The above code returns:
上面的代码返回:
A little paragraph
Another paragraph
And here is a runnable sample: http://3v4l.org/9CYCs
这是一个可运行的样本:http://3v4l.org/9CYCs
Grabing all childs nodes from div
<?php
// ...
$dom = new DomDocument();
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
Here is all child node from div.class (it returns a DOMNodeList instance):
这是div.class中的所有子节点(它返回一个DOMNodeList实例):
$allChildNodesFromDiv = $xpath->query('//div[@class="class"]/*');
# Do somenthing with child nodes (see DOMElement)
foreach($allChildNodesFromDiv as $nodeElement) {
# Do something with $nodeElement, for instance:
print $nodeElement->nodeName;
print $nodeElement->nodeValue;
// ...
}
Note DOMNodeList is a set of DOMElement objects.
注意DOMNodeList是一组DOMElement对象。
Related doc links:
相关文档链接:
- DOMXPath::query
- The DOMNodeList class
- The DOMElement class
DOMNodeList类
DOMElement类
#2
For crawling i would recommend using php_query, below is the link. It provides jquery like selector to the pages. HTML pages are not necessary proper xml unless they are XHTML.
对于抓取我建议使用php_query,下面是链接。它为页面提供类似jquery的选择器。 HTML页面不是必需的正确xml,除非它们是XHTML。
#3
Use PHP to do it easily.
使用PHP轻松完成。
$all_data = file_get_contents("link of the url");
Now use: PHP regular expression, explode, implode etc to achieve your desired data.
现在使用:PHP正则表达式,爆炸,内爆等来实现您想要的数据。
#1
It's pretty easy:
这很简单:
<?php
$html = '
<div class="class">
<p>A little paragraph</p>
<a href="#"><img src="/test.jpg"/></a>
<p>Another paragraph</p>
<ul>
<li>1</li>
<li>2</li>
</ul>
</div>';
$dom = new DomDocument();
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$masterNode = $xpath->query('//div[@class="class"]'); #It returns DOMNodeList
# Now from master node we gonna pick what we want.
# Also, $masterNode->item(0) is context node for "P" tags.
$paragraphNodes = $xpath->query('p', $masterNode->item(0));
foreach ($paragraphNodes as $paragraphElement) {
print $paragraphElement->nodeValue . "\n";
}
The above code returns:
上面的代码返回:
A little paragraph
Another paragraph
And here is a runnable sample: http://3v4l.org/9CYCs
这是一个可运行的样本:http://3v4l.org/9CYCs
Grabing all childs nodes from div
<?php
// ...
$dom = new DomDocument();
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
Here is all child node from div.class (it returns a DOMNodeList instance):
这是div.class中的所有子节点(它返回一个DOMNodeList实例):
$allChildNodesFromDiv = $xpath->query('//div[@class="class"]/*');
# Do somenthing with child nodes (see DOMElement)
foreach($allChildNodesFromDiv as $nodeElement) {
# Do something with $nodeElement, for instance:
print $nodeElement->nodeName;
print $nodeElement->nodeValue;
// ...
}
Note DOMNodeList is a set of DOMElement objects.
注意DOMNodeList是一组DOMElement对象。
Related doc links:
相关文档链接:
- DOMXPath::query
- The DOMNodeList class
- The DOMElement class
DOMNodeList类
DOMElement类
#2
For crawling i would recommend using php_query, below is the link. It provides jquery like selector to the pages. HTML pages are not necessary proper xml unless they are XHTML.
对于抓取我建议使用php_query,下面是链接。它为页面提供类似jquery的选择器。 HTML页面不是必需的正确xml,除非它们是XHTML。
#3
Use PHP to do it easily.
使用PHP轻松完成。
$all_data = file_get_contents("link of the url");
Now use: PHP regular expression, explode, implode etc to achieve your desired data.
现在使用:PHP正则表达式,爆炸,内爆等来实现您想要的数据。