如何使用Nokogiri来编写一个大型XML文件?

时间:2021-03-13 09:17:47

I have a Rails application that uses delayed_job in a reporting feature to run some very large reports. One of these generates a massive XML file and it can take literally days in the bad, old way the code is written. I thought that, having seen impressive benchmarks on the internet, Nokogiri could afford us some nontrivial performance gains.

我有一个Rails应用程序,它在报表特性中使用delayed_job来运行一些非常大的报表。其中一个生成了大量的XML文件,它可能需要花费数天的时间来编写代码。我认为,在互联网上看到令人印象深刻的基准之后,Nokogiri能够为我们带来一些不小的性能提升。

However, the only examples I can find involve using the Nokogiri Builder to create an xml object, then using .to_xml to write the whole thing. But there isn't enough memory in my zip code to handle that for a file of this size.

但是,我所能找到的唯一示例包括使用Nokogiri Builder创建一个xml对象,然后使用.to_xml来编写整个东西。但是我的邮政编码中没有足够的内存来处理这么大的文件。

So can I use Nokogiri to stream or write this data out to file?

那么,我可以使用Nokogiri来将这些数据流到文件中吗?

1 个解决方案

#1


4  

Nokogiri is designed to build in memory because you build a DOM and it converts it to XML on the fly. It's easy to use, but there are trade-offs, and doing it in memory is one of them.

Nokogiri被设计成在内存中构建,因为您构建了一个DOM,它动态地将其转换为XML。它很容易使用,但也有权衡取舍,在内存中执行是其中之一。

You might want to look into using Erubis to generate the XML. Rather than gather all the data before processing and keeping the logic in a controller, like we'd do with Rails, to save memory you can put your logic in the template and have it iterate over your data, which should help with the resource demands.

您可能希望使用Erubis来生成XML。与其像Rails那样在处理和保持控制器中的逻辑之前收集所有数据,还不如将逻辑放在模板中并让它遍历数据,这应该有助于满足资源需求。

If you need the XML in a file you might need to do that using redirection:

如果在文件中需要XML,您可能需要使用重定向:

erubis options templatefile.erb > xmlfile

This is a very simple example, but it shows you could easily define a template to generate XML:

这是一个非常简单的示例,但它表明您可以很容易地定义生成XML的模板:

<% 
asdf = (1..5).to_a 
%>
<xml>
  <element>
<% asdf.each do |i| %>
    <subelement><%= i %></subelement>
<% end %>
  </element>
</xml>

which, when I call erubis test.erb outputs:

我称之为博学测试。erb输出:

<xml>
  <element>
    <subelement>1</subelement>
    <subelement>2</subelement>
    <subelement>3</subelement>
    <subelement>4</subelement>
    <subelement>5</subelement>
  </element>
</xml>

EDIT:

编辑:

The string concatenation was taking forever...

字符串连接将永远……

Yes, it can simply because of garbage collection. You don't show any code example of how you're building your strings, but Ruby works better when you use << to append one string to another than when using +.

是的,这仅仅是因为垃圾收集。您不会显示如何构建字符串的任何代码示例,但是使用< <将一个字符串附加到另一个字符串时,ruby比使用+时工作得更好。< p>

It also might work better to not try to keep everything in a string, but instead to write it immediately to disk, appending to an open file as you go.

最好不要将所有内容都保存在字符串中,而是立即将其写入磁盘,并在运行时附加到打开的文件中。

Again, without code examples I'm shooting in the dark about what you might be doing or why things run slow.

同样,没有代码示例,我也不知道您可能在做什么或为什么事情运行缓慢。

#1


4  

Nokogiri is designed to build in memory because you build a DOM and it converts it to XML on the fly. It's easy to use, but there are trade-offs, and doing it in memory is one of them.

Nokogiri被设计成在内存中构建,因为您构建了一个DOM,它动态地将其转换为XML。它很容易使用,但也有权衡取舍,在内存中执行是其中之一。

You might want to look into using Erubis to generate the XML. Rather than gather all the data before processing and keeping the logic in a controller, like we'd do with Rails, to save memory you can put your logic in the template and have it iterate over your data, which should help with the resource demands.

您可能希望使用Erubis来生成XML。与其像Rails那样在处理和保持控制器中的逻辑之前收集所有数据,还不如将逻辑放在模板中并让它遍历数据,这应该有助于满足资源需求。

If you need the XML in a file you might need to do that using redirection:

如果在文件中需要XML,您可能需要使用重定向:

erubis options templatefile.erb > xmlfile

This is a very simple example, but it shows you could easily define a template to generate XML:

这是一个非常简单的示例,但它表明您可以很容易地定义生成XML的模板:

<% 
asdf = (1..5).to_a 
%>
<xml>
  <element>
<% asdf.each do |i| %>
    <subelement><%= i %></subelement>
<% end %>
  </element>
</xml>

which, when I call erubis test.erb outputs:

我称之为博学测试。erb输出:

<xml>
  <element>
    <subelement>1</subelement>
    <subelement>2</subelement>
    <subelement>3</subelement>
    <subelement>4</subelement>
    <subelement>5</subelement>
  </element>
</xml>

EDIT:

编辑:

The string concatenation was taking forever...

字符串连接将永远……

Yes, it can simply because of garbage collection. You don't show any code example of how you're building your strings, but Ruby works better when you use << to append one string to another than when using +.

是的,这仅仅是因为垃圾收集。您不会显示如何构建字符串的任何代码示例,但是使用< <将一个字符串附加到另一个字符串时,ruby比使用+时工作得更好。< p>

It also might work better to not try to keep everything in a string, but instead to write it immediately to disk, appending to an open file as you go.

最好不要将所有内容都保存在字符串中,而是立即将其写入磁盘,并在运行时附加到打开的文件中。

Again, without code examples I'm shooting in the dark about what you might be doing or why things run slow.

同样,没有代码示例,我也不知道您可能在做什么或为什么事情运行缓慢。