为具有内存限制的大XML优化的XML库

时间:2022-08-04 17:00:24

I need to handle big XML files, but I want to make relatively small set of changes to it. I also want the program to adhere strict memory constraints. We must never use more than, say, 300Mb of ram.

我需要处理大的XML文件,但是我想对它做相对小的更改。我还希望程序遵守严格的内存约束。我们不能使用超过300Mb的内存。

Is there a library that allows me not to keep all the DOM in memory, and parse the XML on the go, while I traverse the DOM?

是否有一个库允许我在遍历DOM时不将所有DOM保存在内存中并在运行时解析XML ?

I know you can do that with call-back based approach, but I don't want that. I want to have my cake and eat it too. I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.

我知道您可以使用基于回调的方法来实现这一点,但我不希望这样。我也想吃我的蛋糕。我希望使用DOM API,但是要延迟解析每个元素,这样使用DOM API的现有代码就不必改变了。

There are two possible approaches I thought of for this problem:

对于这个问题,有两种可能的方法:

  1. Parse the lazily XML, each call to getChildren() will parse the next bit of XML.
  2. 解析惰性的XML,每个调用getChildren()将解析XML的下一点。
  3. Parse the entire XML tree, but cache whatever you're not using right now on the disk.
  4. 解析整个XML树,但是缓存磁盘上当前没有使用的任何东西。

Two of the approaches are acceptable, is there an existing solution.

有两种方法是可以接受的,是否存在现有的解决方案。

I'm looking for a native solution, but I'll be interested with hearing about libraries in other languages.

我正在寻找一种本地的解决方案,但是我对听说其他语言的库很感兴趣。

3 个解决方案

#1


2  

It sounds like what you want is something similar to the Streaming API for XML (StAX).

它听起来像您想要的类似于XML (StAX)的流API。

While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.

虽然它不使用标准的DOM API,但原则上它与“getChildren()”方法相似。它没有DOM方法的内存开销,也没有回调(SAX)方法的复杂性。

There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.

在StAX的Wikipedia页面上有许多实现链接,其中大多数是用于Java的,但是也有一些是用于c++的——Ambiera irrXML和Llamagraphics LlamaXML。


edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this XML.com introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.

编辑:既然您提到了对文档的“小更改”,如果您不需要将文档内容用于其他任何内容,您也可以考虑XML (STX)的流转换(在XML.com对STX的介绍中描述)。STX对XSLT的作用就像SAX/StAX对DOM的作用一样。

#2


2  

I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.

我希望使用DOM API,但是要延迟解析每个元素,这样使用DOM API的现有代码就不必改变了。

You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.

您想要流媒体dom样式的API吗?这样的东西通常不存在,而且有充分的理由:如果不是不可能使它实际工作,那将是困难的。

XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.

XML通常是单向读取的:从前面读到后面。您的建议要求能够随机访问XML文件。

I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.

我认为您可以在构建元素表的地方做一些事情,文件偏移量指向该元素在文件中的位置。但此时,您已经或多或少地读取和解析了文件。除非您的大多数数据是在文本元素中(这是完全可能的),否则您最好使用DOM。

Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.

实际上,最好重写现有代码,使用xmlReader或sax样式的API。

#3


1  

How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see http://www.saxonica.com/html/documentation/sourcedocs/streaming.html. Also, as already mentioned, there is STX (though implementations are not especially mature).

如何进行流转换是一个大的、开放的、未解决的问题。根据您准备接受的限制,有许多部分解决方案。例如,Saxon-EE的当前版本可以以流方式进行一些XSLT转换:参见http://www.saxonica.com/html/documentation/sourcedocs/streaming.html。另外,如前所述,还有STX(尽管实现不是特别成熟)。

Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.

您的标题建议您使用c++编写转换。这是非常有限的,因为这意味着程序员必须处理复杂性,而不是把它留给转换引擎。当然,您可以使用类sax或类stax的解析器api手工编写流代码转换,但这两者都是很困难的工作,并且需要从头开始处理每个情况。

Google for "streaming XML transformation"

谷歌用于“流XML转换”

#1


2  

It sounds like what you want is something similar to the Streaming API for XML (StAX).

它听起来像您想要的类似于XML (StAX)的流API。

While it does not use the standard DOM API, it is similar in principle to your "getChildren()" approach. It does not have the memory overheads of the DOM approach, nor the complexity of the callback (SAX) approach.

虽然它不使用标准的DOM API,但原则上它与“getChildren()”方法相似。它没有DOM方法的内存开销,也没有回调(SAX)方法的复杂性。

There are a number of implementations linked on the Wikipedia page for StAX most of which are for Java, but there are a couple for C++ too - Ambiera irrXML and Llamagraphics LlamaXML.

在StAX的Wikipedia页面上有许多实现链接,其中大多数是用于Java的,但是也有一些是用于c++的——Ambiera irrXML和Llamagraphics LlamaXML。


edit: Since you mention "small changes" to the document, if you don't need to use the document contents for anything else, you might also consider Streaming Transformations for XML (STX) (described in this XML.com introduction to STX). STX is to XSLT something like what SAX/StAX is to DOM.

编辑:既然您提到了对文档的“小更改”,如果您不需要将文档内容用于其他任何内容,您也可以考虑XML (STX)的流转换(在XML.com对STX的介绍中描述)。STX对XSLT的作用就像SAX/StAX对DOM的作用一样。

#2


2  

I want to use the DOM API, but to parse each element lazily, so that existing code that use the DOM API won't have to change.

我希望使用DOM API,但是要延迟解析每个元素,这样使用DOM API的现有代码就不必改变了。

You want a streaming DOM-style API? Such a thing generally does not exist, and for good reason: it would be difficult if not impossible to make it actually work.

您想要流媒体dom样式的API吗?这样的东西通常不存在,而且有充分的理由:如果不是不可能使它实际工作,那将是困难的。

XML is generally intended to be read one-way: from front to back. What you're suggesting would require being able to random-access an XML file.

XML通常是单向读取的:从前面读到后面。您的建议要求能够随机访问XML文件。

I suppose you could do something where you build a table of elements, with file offsets pointing to where that element is in the file. But at that point, you've already read and parsed the file more or less. Unless most of your data is in text elements (which is entirely possible), you may as well be using a DOM.

我认为您可以在构建元素表的地方做一些事情,文件偏移量指向该元素在文件中的位置。但此时,您已经或多或少地读取和解析了文件。除非您的大多数数据是在文本元素中(这是完全可能的),否则您最好使用DOM。

Really, you would be much better off just rewriting your existing code to use an xmlReader or SAX-style API.

实际上,最好重写现有代码,使用xmlReader或sax样式的API。

#3


1  

How to do streaming transformations is a big, open, unsolved problem. There are numerous partial solutions, depending on what restrictions you are prepared to accept. Current releases of Saxon-EE, for example, have the capability to do some XSLT transformations in a streaming fashion: see http://www.saxonica.com/html/documentation/sourcedocs/streaming.html. Also, as already mentioned, there is STX (though implementations are not especially mature).

如何进行流转换是一个大的、开放的、未解决的问题。根据您准备接受的限制,有许多部分解决方案。例如,Saxon-EE的当前版本可以以流方式进行一些XSLT转换:参见http://www.saxonica.com/html/documentation/sourcedocs/streaming.html。另外,如前所述,还有STX(尽管实现不是特别成熟)。

Your title suggests you want to write the transformation in C++. That's severely limiting, because it pretty well means the programmer has to cope with the complexities rather than leaving it to the transformation engine. You can of course hand-code streaming transformations using SAX-like or StAX-like parser APIs, but both are hard work, and each case will need to be approached from scratch.

您的标题建议您使用c++编写转换。这是非常有限的,因为这意味着程序员必须处理复杂性,而不是把它留给转换引擎。当然,您可以使用类sax或类stax的解析器api手工编写流代码转换,但这两者都是很困难的工作,并且需要从头开始处理每个情况。

Google for "streaming XML transformation"

谷歌用于“流XML转换”