Xsoup 0.2.0

Xsoup 的详细介绍：请点这里

Xsoup 的下载地址：请点这里 https://github.com/code4craft/xsoup

http://www.oschina.net/question/tag/xsoup?show=hot

使用过程存在一些问题。主要问题是XPath出错定位不准确，并且其不太合理的代码结构，也难以进行定制。实现了Xsoup。Xsoup的性能比HtmlCleaner要快一倍以上。

Xsoup发展到现在，已经支持爬虫常用的语法，以下是一些已支持的语法对照表：

Name	Expression	Support
nodename	nodename	yes
immediate parent	/	yes
parent	//	yes
attribute	[@key=value]	yes
nth child	tag[n]	yes
attribute	/@key	yes
wildcard in tagname	/	yes
wildcard in attribute	/[@]	yes
function	function()	part
or	a \| b	yes since 0.2.0
parent in path	. or ..	no
predicates	price>35	no
predicates logic	@class=a or @class=b	yes since 0.2.0

定义几个很方便的XPath函数。但是请注意，这些函数式标准XPath没有的。

Expression	Description	XPath1.0
text(n)	第n个直接文本子节点，为0表示所有	text() only
allText()	所有的直接和间接文本子节点	not support
tidyText()	所有的直接和间接文本子节点，并将一些标签替换为换行，使纯文本显示更整洁	not support
html()	内部html，不包括标签的html本身	not support
outerHtml()	内部html，包括标签的html本身	not support
regex(@attr,expr,group)	这里@attr和group均可选，默认是group0	not support

xsoup０.２.０的一些XPath语法的支持。

１.　contains：
//div[contains(@id,'test')]

２.　筛选条件的逻辑运算支持(and/or) #4：

//div[@id='test' or @class='test']
//div[@id='test' and @class='test']
//div[@id='test' and @class='test' or @id='test1']
//div[@id='test' and (@class='test' or @id='test1')]

３.　整个XPath的或支持 #6：

//div[@id='test']/text() | //div[@class='test']/div/text()

４.　此次升级与Xsoup 0.1.0 API兼容，WebMagic 0.3.0以上的用户可以直接在项目添加依赖即可使用新语法。

<dependency>
<groupId>us.codecraft</groupId>
<artifactId>xsoup</artifactId>
<version>0.2.0</version>
</dependency>

５.　用Jsoup 解析不到 <tr> 下面的<td> http://www.oschina.net/question/1271820_131887

得到<td></td>后在外围加上<table></table>.

６.　总结:　

css使用nth-child(n)选第几个,使用nth-last-child(n)选择倒数第n个,xpath第几个用attr[n]被黄亿华改没了
xsoup集成了css/jsoup, xpath的函数:　text(n),allText(),tidyText()有换行,html()不含标签本身,outerHtml()含标签本身,regex(@attr,expr,group)前两个可选,,,,
xsoup集成了css/jsoup, xpath的语法:　tag[n],　function(),　a|b,　@class=a or @class=b
xsoup不支持的:　.or..不支持, price>35也不支持,　　
xpath用text()选文本,css用innerHtml,text,allText选文本,如:css(String Selector, "text").toString;

7. xpath模糊匹配:

所有的属性选择器都被写成和XPath极其相似（因为所有的属性都以@符号开始）。
    E[@foo] 拥有foo属性的E元素
    E[@foo=bar] foo属性的值为bar的E元素
    E[@foo^=bar] foo属性的值以字符串"bar"开始的E元素
    E[@foo$=bar] foo属性的值以字符串"bar"结尾的E元素
    E[@foo*=bar] foo属性的值包含有字符串"bar"结尾的E元素

8. 无属性的<tr>和<tr class='time'>的区分:

tr[@class!='time']

9.xpath其它

html.xpath("/a[@href]/@href") 和 html.xpath("/a/@href'"),  前者只取含href属性的标签, 后者不限定是否含href标签.

10. xpath其它2

html.xpath("//div[@class='tBorderTop_box']").all();会匹配到class='tBorderTop_box'和class='tBorderTop_box bt'两种结果,html.xpath("//div[@class$='tBorderTop_box']").all(

xpath带空格的属性值必须要打上小括号，否则出错；相反，css带空格的不能打小括号，否则出错。

11.

同一个网站每一个request的header不同，在每一个request中添加header，修改Request类，覆盖掉全局site的header。

12.延时使用同一参数,过大速度会变慢很多。
.setConnectionRequestTimeout(site.getTimeOut())
.setSocketTimeout(site.getTimeOut())
.setConnectTimeout(site.getTimeOut())

秒客网

xsoup,Jsoup

Xsoup 0.2.0

相关文章