python爬虫（10）--PyQuery的用法

简介

pyquery 可让你用 jQuery 的语法来对 xml 进行操作。这I和 jQuery 十分类似。如果利用 lxml，pyquery 对 xml 和 html 的处理将更快。

初始化

在这里介绍四种初始化方式。

（1）直接字符串

from pyquery import PyQuery as pq

doc = pq("<html></html>")

pq 参数可以直接传入 HTML 代码，doc 现在就相当于 jQuery 里面的 $ 符号了。

（2）lxml.etree

from lxml import etree

doc = pq(etree.fromstring("<html></html>"))

可以首先用 lxml 的 etree 处理一下代码，这样如果你的 HTML 代码出现一些不完整或者疏漏，都会自动转化为完整清晰结构的 HTML代码。

（3）直接传URL

from pyquery import PyQuery as pq

doc = pq('http://www.baidu.com')

这里就像直接请求了一个网页一样，类似用 urllib2 来直接请求这个链接，得到 HTML 代码。

（4）传文件

from pyquery import PyQuery as pq

doc = pq(filename='hello.html')

可以直接传某个路径的文件名。

快速体验

现在我们以本地文件为例，传入一个名字为 hello.html 的文件，文件内容为

<div>

    <ul>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

 </div>

编写如下程序

from pyquery import PyQuery as pq

doc = pq(filename='hello.html')

print doc.html()

print type(doc)

li = doc('li')

print type(li)

print li.text()

运行结果

    <ul>

         <li class="item-0">first item</li>

         <li class="item-1"><a href="link2.html">second item</a></li>

         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

         <li class="item-0"><a href="link5.html">fifth item</a></li>

     </ul>

<class 'pyquery.pyquery.PyQuery'>

<class 'pyquery.pyquery.PyQuery'>

first item second item third item fourth item fifth item

属性操作

你可以完全按照 jQuery 的语法来进行 PyQuery 的操作。

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.attr("id")

print p.attr("id", "plop")

print p.attr("id", "hello")

运行结果

hello

<p id="plop" class="hello"/>

<p id="hello" class="hello"/>

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.addClass('beauty')

print p.removeClass('hello')

print p.css('font-size', '16px')

print p.css({'background-color': 'yellow'})

运行结果

<p id="hello" class="hello beauty"/>

<p id="hello" class="beauty"/>

<p id="hello" class="beauty" style="font-size: 16px"/>

<p id="hello" class="beauty" style="font-size: 16px; background-color: yellow"/>

DOM操作

from pyquery import PyQuery as pq

p = pq('<p id="hello" class="hello"></p>')('p')

print p.append(' check out <a href="http://reddit.com/r/python"><span>reddit</span></a>')

print p.prepend('Oh yes!')

d = pq('<div class="wrap"><div id="test"><a href="http://cuiqingcai.com">Germy</a></div></div>')

p.prependTo(d('#test'))

print p

print d

d.empty()

print d

运行结果

<p id="hello" class="hello"> check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

<p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p>

<div class="wrap"><div id="test"><p id="hello" class="hello">Oh yes! check out <a href="http://reddit.com/r/python"><span>reddit</span></a></p><a href="http://cuiqingcai.com">Germy</a></div></div>

<div class="wrap"/>

遍历

遍历用到 items 方法返回对象列表，或者用 lambda

from pyquery import PyQuery as pq

doc = pq(filename='hello.html')

lis = doc('li')

for li in lis.items():

    print li.html()

print lis.each(lambda e: e)

运行结果

first item

<a href="link2.html">second item</a>

<a href="link3.html"><span class="bold">third item</span></a>

<a href="link4.html">fourth item</a>

<a href="link5.html">fifth item</a>

<li class="item-0">first item</li>

 <li class="item-1"><a href="link2.html">second item</a></li>

 <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

 <li class="item-1 active"><a href="link4.html">fourth item</a></li>

 <li class="item-0"><a href="link5.html">fifth item</a></li>

网页请求

PyQuery 本身还有网页请求功能，而且会把请求下来的网页代码转为 PyQuery 对象。

from pyquery import PyQuery as pq

print pq('http://cuiqingcai.com/', headers={'user-agent': 'pyquery'})

print pq('http://httpbin.org/post', {'foo': 'bar'}, method='post', verify=True)

秒客网

python爬虫（10）--PyQuery的用法

简介

初始化

快速体验

属性操作

DOM操作

遍历

网页请求

相关文章