I have a doubt about nokogiri, I need to get the HTML elements from a page, and get the xpath for each one. The problem is that I can't realize how to do it with nokogiri. The HTML code is random, because I've to parse several pages, from different websites.
我对nokogiri有疑问,我需要从页面获取HTML元素,并为每个元素获取xpath。问题是我无法实现如何使用nokogiri。 HTML代码是随机的,因为我要从不同的网站解析几个页面。
2 个解决方案
#1
If you are asking how to search for a node, you may use either CSS or XPath expressions, like so:
如果您询问如何搜索节点,可以使用CSS或XPath表达式,如下所示:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://slashdot.com/"))
node_found_by_css = doc.css("h1").first
node_found_by_xpath = doc.xpath("/html/body//h1").first
If you are asking how, once you've found a node, you can retrieve the canonical XPath expression for it, you may use Node#path
like so:
如果你问一下,一旦你找到了一个节点,就可以为它检索规范的XPath表达式,你可以像这样使用Node #path:
puts node_found_by_css.path # => "/html/body/div[3]/div[1]/div[1]/h1"
#2
If you are asking how to get the XPath for each HTML element in a page, then the following should help. This will open and parse a page and then print out the XPath for each element.
如果您询问如何获取页面中每个HTML元素的XPath,那么以下内容应该有所帮助。这将打开并解析页面,然后打印出每个元素的XPath。
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://slashdot.com/"))
doc.traverse {|node| puts node.path }
#1
If you are asking how to search for a node, you may use either CSS or XPath expressions, like so:
如果您询问如何搜索节点,可以使用CSS或XPath表达式,如下所示:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://slashdot.com/"))
node_found_by_css = doc.css("h1").first
node_found_by_xpath = doc.xpath("/html/body//h1").first
If you are asking how, once you've found a node, you can retrieve the canonical XPath expression for it, you may use Node#path
like so:
如果你问一下,一旦你找到了一个节点,就可以为它检索规范的XPath表达式,你可以像这样使用Node #path:
puts node_found_by_css.path # => "/html/body/div[3]/div[1]/div[1]/h1"
#2
If you are asking how to get the XPath for each HTML element in a page, then the following should help. This will open and parse a page and then print out the XPath for each element.
如果您询问如何获取页面中每个HTML元素的XPath,那么以下内容应该有所帮助。这将打开并解析页面,然后打印出每个元素的XPath。
require 'rubygems'
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://slashdot.com/"))
doc.traverse {|node| puts node.path }