Nokogiri并按名称查找元素

时间:2021-07-07 23:31:35

I am parsing an XML file using Nokogiri with the following snippet:

我正在使用Nokogiri解析XML文件,其中包含以下代码段:

doc.xpath('//root').each do |root|
  puts "# ROOT found"
  root.xpath('//page').each do |page|
    puts "## PAGE found / #{page['id']} / #{page['name']} / #{page['width']} / #{page['height']}"
    page.children.each do |content|
      ...
    end
  end
end

How can I parse through all elements in the page element? There are three different elements: image, text and video. How can I make a case statement for each element?

如何解析页面元素中的所有元素?有三个不同的元素:图像,文本和视频。如何为每个元素创建一个case语句?

2 个解决方案

#1


10  

Honestly, you look pretty close to me..

老实说,你看起来非常接近我..

doc.xpath('//root').each do |root|
  puts "# ROOT found"
  root.xpath('//page').each do |page|
    puts "## PAGE found / #{page['id']} / #{page['name']} / #{page['width']} / #{page['height']}"
    page.children.each do |child|
      case child.name
       when 'image'  
          do_image_stuff
       when 'text'
          do_text_stuff
       when 'video'
          do_video_stuff
       end
    end
  end
end

#2


5  

Both Nokogiri's CSS and XPath accessors allow multiple tags to be specified, which can be useful for this sort of problem. Rather than walk through every tag in the document's page tag:

Nokogiri的CSS和XPath访问器都允许指定多个标签,这对于这类问题很有用。而不是遍历文档页面标记中的每个标记:

require 'nokogiri'

doc = Nokogiri::XML('
  <xml>
  <body>
  <image>image</image>
  <text>text</text>
  <video>video</video>
  <other>other</other>
  <image>image</image>
  <text>text</text>
  <video>video</video>
  <other>other</other>
  </body>
  </xml>')

This is a search using CSS:

这是使用CSS的搜索:

doc.search('image, text, video').each do |node|
  case node.name
  when 'image'
    puts node.text
  when 'text'
    puts node.text
  when 'video'
    puts node.text
  else
    puts 'should never get here'
  end
end

# >> image
# >> image
# >> text
# >> text
# >> video
# >> video

Notice it returns the tags in the order that the CSS accessor specifies it. If you need the order of the tags in the document, you can use XPath:

请注意,它按CSS访问者指定的顺序返回标记。如果您需要文档中的标记顺序,则可以使用XPath:

doc.search('//image | //text | //video').each do |node|
  puts node.text
end

# >> image
# >> text
# >> video
# >> image
# >> text
# >> video

In either case, the program should run faster because all the searching occurs in libXML, returning only the nodes you need for Ruby's processing.

在任何一种情况下,程序都应该运行得更快,因为所有搜索都发生在libXML中,只返回Ruby处理所需的节点。

If you need to restrict the search to within a <page> tag you can do a search up front to find the page node, then search underneath it:

如果您需要将搜索限制在 标记内,您可以预先搜索以查找页面节点,然后在其下搜索:

doc.at('page').search('image, text, video').each do |node|
  ...
end

or

doc.at('//page').search('//image | //text | //video').each do |node|
  ...
end

#1


10  

Honestly, you look pretty close to me..

老实说,你看起来非常接近我..

doc.xpath('//root').each do |root|
  puts "# ROOT found"
  root.xpath('//page').each do |page|
    puts "## PAGE found / #{page['id']} / #{page['name']} / #{page['width']} / #{page['height']}"
    page.children.each do |child|
      case child.name
       when 'image'  
          do_image_stuff
       when 'text'
          do_text_stuff
       when 'video'
          do_video_stuff
       end
    end
  end
end

#2


5  

Both Nokogiri's CSS and XPath accessors allow multiple tags to be specified, which can be useful for this sort of problem. Rather than walk through every tag in the document's page tag:

Nokogiri的CSS和XPath访问器都允许指定多个标签,这对于这类问题很有用。而不是遍历文档页面标记中的每个标记:

require 'nokogiri'

doc = Nokogiri::XML('
  <xml>
  <body>
  <image>image</image>
  <text>text</text>
  <video>video</video>
  <other>other</other>
  <image>image</image>
  <text>text</text>
  <video>video</video>
  <other>other</other>
  </body>
  </xml>')

This is a search using CSS:

这是使用CSS的搜索:

doc.search('image, text, video').each do |node|
  case node.name
  when 'image'
    puts node.text
  when 'text'
    puts node.text
  when 'video'
    puts node.text
  else
    puts 'should never get here'
  end
end

# >> image
# >> image
# >> text
# >> text
# >> video
# >> video

Notice it returns the tags in the order that the CSS accessor specifies it. If you need the order of the tags in the document, you can use XPath:

请注意,它按CSS访问者指定的顺序返回标记。如果您需要文档中的标记顺序,则可以使用XPath:

doc.search('//image | //text | //video').each do |node|
  puts node.text
end

# >> image
# >> text
# >> video
# >> image
# >> text
# >> video

In either case, the program should run faster because all the searching occurs in libXML, returning only the nodes you need for Ruby's processing.

在任何一种情况下,程序都应该运行得更快,因为所有搜索都发生在libXML中,只返回Ruby处理所需的节点。

If you need to restrict the search to within a <page> tag you can do a search up front to find the page node, then search underneath it:

如果您需要将搜索限制在 标记内,您可以预先搜索以查找页面节点,然后在其下搜索:

doc.at('page').search('image, text, video').each do |node|
  ...
end

or

doc.at('//page').search('//image | //text | //video').each do |node|
  ...
end