一、基础语法
1.直接以字符串形式获取nokogiri对象:
1
2
|
html_doc = Nokogiri:: HTML ( "<html><body><h1>Mr. Belvedere Fan Club</h1></body></html>" )
xml_doc = Nokogiri:: XML ( "<root><aliens><alien><name>Alf</name></alien></aliens></root>" )
|
这里的html_doc和xml_doc就是nokogiri文件
2.也可以通过文件句柄获取nokogiri对象:
1
2
3
|
f = File .open( "blossom.xml" )
doc = Nokogiri:: XML (f)
f.close
|
3.还可以直接从网站获取:
1
2
|
require 'open-uri'
doc = Nokogiri:: HTML (open( "http://www.xxx.com/" ))
|
二、XML文件解析实例
从XML/HTML文件里抓取字段的常用方法:
现在有一个名为shows.xml的文件,内容如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
< root >
< sitcoms >
< sitcom >
< name >Married with Children</ name >
< characters >
< character >Al Bundy</ character >
< character >Bud Bundy</ character >
< character >Marcy Darcy</ character >
</ characters >
</ sitcom >
< sitcom >
< name >Perfect Strangers</ name >
< characters >
< character >Larry Appleton</ character >
< character >Balki Bartokomous</ character >
</ characters >
</ sitcom >
</ sitcoms >
< dramas >
< drama >
< name >The A-Team</ name >
< characters >
< character >John "Hannibal" Smith</ character >
< character >Templeton "Face" Peck</ character >
< character >"B.A." Baracus</ character >
< character >"Howling Mad" Murdock</ character >
</ characters >
</ drama >
</ dramas >
</ root >
|
如果想把所有character标签的内容查找出来,可以这样处理:
1
2
|
@doc = Nokogiri:: XML ( File .open( "shows.xml" ))
@doc .xpath( "//character" )
|
xpath和css方法,返回的是一个结点列表,类似于一个数组,它的内容就是从文件中查找出来的符合匹配规则的结点.
把dramas结点里的character结点列表查出来:
1
|
@doc .xpath( "//dramas//character" )
|
更有可读性的css方法:
1
2
|
characters = @doc .css( "sitcoms name" )
# => ["<name>Married with Children</name>", "<name>Perfect Strangers</name>"]
|
当已知查询结果唯一时,如果想直接返回这个结果,而不是列表,可以直接使用at_xpath或at_css:
1
2
|
@doc .css( "dramas name" ).first # => "<name>The A-Team</name>"
@doc .at_css( "dramas name" ) # => "<name>The A-Team</name>"
|
三、Namespaces
对于有多个标签的情况,命名空间就起到非常大的作用了.
例如有这样一个parts.xml文件:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
< parts >
<!-- Alice's Auto Parts Store -->
< inventory xmlns = "http://alicesautoparts.com/" >
< tire >all weather</ tire >
< tire >studded</ tire >
< tire >extra wide</ tire >
</ inventory >
<!-- Bob's Bike Shop -->
< inventory xmlns = "http://bobsbikes.com/" >
< tire >street</ tire >
< tire >mountain</ tire >
</ inventory >
</ parts >
|
可以使用唯一的URL作为namespaces,以区分不同的tires标签:
1
2
3
|
@doc = Nokogiri:: XML ( File .read( "parts.xml" ))
car_tires = @doc .xpath( '//car:tire' , 'car' => 'http://alicesautoparts.com/' )
bike_tires = @doc .xpath( '//bike:tire' , 'bike' => 'http://bobsbikes.com/' )
|
为了让namespace的使用更方便,nokogiri会自动绑定在根结点上找到的合适的任何namespace.
nokogiri会自动关联提供的URL,这个惯例可以减少代码量.
例如有这样一个atom.xml文件:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
< feed xmlns = "http://www.w3.org/2005/Atom" >
< title >Example Feed</ title >
< link href = "http://example.org/" />
< updated >2003-12-13T18:30:02Z</ updated >
< author >
< name >John Doe</ name >
</ author >
< id >urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</ id >
< entry >
< title >Atom-Powered Robots Run Amok</ title >
< link href = "http://example.org/2003/12/13/atom03" />
< id >urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</ id >
< updated >2003-12-13T18:30:02Z</ updated >
< summary >Some text.</ summary >
</ entry >
</ feed >
|
遵循上面提到的惯例,xmlns已被自动绑定,不用再手动为xmlns赋值:
1
2
|
@doc .xpath( '//xmlns:title' )
# => ["<title>Example Feed</title>", "<title>Atom-Powered Robots Run Amok</title>"]
|
同样情况,css的用法:
1
|
@doc .css( 'xmlns|title' )
|
并且在使用css方式时,如果namespaces名字是xmlns,那么连这个词本身都可以忽略掉:
1
|
@doc .css( 'title' )
|