I'm trying to scrape a webpage and I want to grab the text and all HTML tags inside a div tag.

我正在尝试抓取一个网页,我想抓取div标签内的文本和所有HTML标签。

The webpage looks like this:

该网页如下所示:

<div class="class">
  <p>A little paragraph</p>
  <a href="#"><img src="/test.jpg"/></a>
  <p>Another paragraph</p>
  <ul>
    <li>1</li>
    <li>2</li>
  </ul>
</div>

Using cURL I have succeeded in extracting all text but the tags are absent.

使用cURL我已成功提取所有文本,但标签不存在。

My code:

$content = $xpath->query('//div[@class="class"]');

3 个解决方案

#1

It's pretty easy:

这很简单:

<?php

$html = '
<div class="class">
  <p>A little paragraph</p>
  <a href="#"><img src="/test.jpg"/></a>
  <p>Another paragraph</p>
  <ul>
    <li>1</li>
    <li>2</li>
  </ul>
</div>';

$dom = new DomDocument();
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$masterNode = $xpath->query('//div[@class="class"]'); #It returns DOMNodeList

# Now from master node we gonna pick what we want.
# Also, $masterNode->item(0) is context node for "P" tags.
$paragraphNodes = $xpath->query('p', $masterNode->item(0)); 

foreach ($paragraphNodes as $paragraphElement) {
    print $paragraphElement->nodeValue . "\n";
}

The above code returns:

上面的代码返回:

 A little paragraph
 Another paragraph

And here is a runnable sample: http://3v4l.org/9CYCs

这是一个可运行的样本:http://3v4l.org/9CYCs

Grabing all childs nodes from div

<?php
// ...
$dom = new DomDocument();
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);

Here is all child node from div.class (it returns a DOMNodeList instance):

这是div.class中的所有子节点(它返回一个DOMNodeList实例):

$allChildNodesFromDiv = $xpath->query('//div[@class="class"]/*');
# Do somenthing with child nodes (see DOMElement)
foreach($allChildNodesFromDiv as $nodeElement) {
    # Do something with $nodeElement, for instance:
    print $nodeElement->nodeName;       
    print $nodeElement->nodeValue;  
   // ...   
}

Note DOMNodeList is a set of DOMElement objects.

注意DOMNodeList是一组DOMElement对象。

Related doc links:

#2

For crawling i would recommend using php_query, below is the link. It provides jquery like selector to the pages. HTML pages are not necessary proper xml unless they are XHTML.

对于抓取我建议使用php_query,下面是链接。它为页面提供类似jquery的选择器。 HTML页面不是必需的正确xml,除非它们是XHTML。

https://code.google.com/p/phpquery/

#3

Use PHP to do it easily.

使用PHP轻松完成。

$all_data = file_get_contents("link of the url");

Now use: PHP regular expression, explode, implode etc to achieve your desired data.

现在使用:PHP正则表达式,爆炸,内爆等来实现您想要的数据。

#1

It's pretty easy:

这很简单:

<?php

$html = '
<div class="class">
  <p>A little paragraph</p>
  <a href="#"><img src="/test.jpg"/></a>
  <p>Another paragraph</p>
  <ul>
    <li>1</li>
    <li>2</li>
  </ul>
</div>';

$dom = new DomDocument();
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);
$masterNode = $xpath->query('//div[@class="class"]'); #It returns DOMNodeList

# Now from master node we gonna pick what we want.
# Also, $masterNode->item(0) is context node for "P" tags.
$paragraphNodes = $xpath->query('p', $masterNode->item(0)); 

foreach ($paragraphNodes as $paragraphElement) {
    print $paragraphElement->nodeValue . "\n";
}

The above code returns:

上面的代码返回:

 A little paragraph
 Another paragraph

And here is a runnable sample: http://3v4l.org/9CYCs

这是一个可运行的样本:http://3v4l.org/9CYCs

Grabing all childs nodes from div

<?php
// ...
$dom = new DomDocument();
@$dom->loadHTML($html);
$xpath = new DOMXpath($dom);

Here is all child node from div.class (it returns a DOMNodeList instance):

这是div.class中的所有子节点(它返回一个DOMNodeList实例):

$allChildNodesFromDiv = $xpath->query('//div[@class="class"]/*');
# Do somenthing with child nodes (see DOMElement)
foreach($allChildNodesFromDiv as $nodeElement) {
    # Do something with $nodeElement, for instance:
    print $nodeElement->nodeName;       
    print $nodeElement->nodeValue;  
   // ...   
}

Note DOMNodeList is a set of DOMElement objects.

注意DOMNodeList是一组DOMElement对象。

Related doc links:

#2

For crawling i would recommend using php_query, below is the link. It provides jquery like selector to the pages. HTML pages are not necessary proper xml unless they are XHTML.

对于抓取我建议使用php_query,下面是链接。它为页面提供类似jquery的选择器。 HTML页面不是必需的正确xml,除非它们是XHTML。

https://code.google.com/p/phpquery/

#3

Use PHP to do it easily.

使用PHP轻松完成。

$all_data = file_get_contents("link of the url");

Now use: PHP regular expression, explode, implode etc to achieve your desired data.

现在使用:PHP正则表达式,爆炸,内爆等来实现您想要的数据。

秒客网

PHP XPath：如何获取div的内容和html标签？

3 个解决方案

#1

Grabing all childs nodes from div

#2

#3

#1

Grabing all childs nodes from div

#2

#3

相关文章