解析/读取大型XML文件，占用内存最少

I have a very large XML file (300mb) of the following format:

我有一个非常大的XML文件(300mb),格式如下:

<data>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
 <point>
  <id><![CDATA[1371308]]></id>
  <time><![CDATA[15:36]]></time>
 </point>
</data>

Now I need to read it and iterate through the point nodes doing something for each. Currently I'm doing it with Nokogiri like this:

现在我需要阅读它并迭代点节点为每个节点做一些事情。目前我正在和Nokogiri这样做:

require 'nokogiri'
xmlfeed = Nokogiri::XML(open("large_file.xml"))
xmlfeed.xpath("./data/point").each do |item|
  save_id(item.xpath("./id").text)
end

However that's not very efficient, since it parses everything whole hug, and hence creating a huge memory footprint (several GB).

然而,这并不是非常有效,因为它解析了整个拥抱的一切,因此创造了巨大的内存占用(几GB)。

Is there a way to do this in chunks instead? Might be called streaming if I'm not mistaken?

有没有办法在块中执行此操作?如果我没弄错的话可能会被称为流式传输?

EDIT

The suggested answer using nokogiris sax parser might be okay, but it gets very messy when there is several nodes within each point that I need to extract content from and process differently. Instead of returning a huge array of entries for later processing, I would much rather prefer if I could access one point at a time, process it, and then move on to the next "forgetting" the previous.

使用nokogiris sax解析器的建议答案可能没问题,但是当每个点中有多个节点需要从中提取内容并以不同方式处理时,它会变得非常混乱。我宁愿选择一次访问一个点,处理它,然后继续下一个“遗忘”前一个点,而不是返回大量的条目供以后处理。

3 个解决方案

#1

Given this little-known (but AWESOME) gist using Nokogiri's Reader interface, you should be able to do this:

鉴于使用Nokogiri的Reader界面这个鲜为人知(但很棒)的要点,你应该能够做到这一点:

Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
  inside_element 'point' do
    for_element 'id' do puts "ID: #{inner_xml}" end
    for_element 'time' do puts "Time: #{inner_xml}" end
  end
end

Someone should make this a gem, perhaps me ;)

有人应该把它变成宝石,也许是我;)

#2

Use Nokogiri::XML::SAX::Parser (event-driven parser) and Nokogiri::XML::SAX::Document:

使用Nokogiri :: XML :: SAX :: Parser(事件驱动的解析器)和Nokogiri :: XML :: SAX :: Document:

require 'nokogiri'

class IDCollector < Nokogiri::XML::SAX::Document
  attr :ids

  def initialize
    @ids = []
    @inside_id = false
  end

  def start_element(name, attrs)
    # NOTE: This is simplified. You need some kind of stack manipulations
    #                           (push in start_element / pop in end_element)
    #    to correctly pick `.//data/point/id` elements.
    @inside_id = true if name == 'id'
  end
  def end_element(name)
    @inside_id = false
  end

  def cdata_block string
    @ids << string if @inside_id
  end
end

collector = IDCollector.new
parser = Nokogiri::XML::SAX::Parser.new(collector)
parser.parse(File.open('large_file.xml'))
p collector.ids # => ["1371308", "1371308", "1371308"]

According to the documentation,

根据文件,

Nokogiri::XML::SAX::Parser: is a SAX style parser that reads its input as it deems necessary.

Nokogiri :: XML :: SAX :: Parser:是一个SAX样式的解析器,在它认为必要时读取它的输入。

You can also use Nokogiri::XML::SAX::PushParser if you need more control over the file input.

如果需要更多控制文件输入,也可以使用Nokogiri :: XML :: SAX :: PushParser。

#3

If you use jruby, you can take advantage of vtd-xml, which has the most efficient in memory model, 3~5x more efficient than DOM..

如果你使用jruby,你可以利用vtd-xml,它具有最高效的内存模型,比DOM效率高3~5倍。

http://vtd-xml.sf.net

#1