DOM与SAX XML解析大文件

时间:2023-01-15 09:41:15

Background:

I have a large OWL (Web Ontology Language) file (approximately 125MB or 1.5 million lines long) that I would like to parse into a set of tab delimited values. I have been researching about the SAX and DOM XML parsers, and found the following:

我有一个大的OWL(Web Ontology Language)文件(大约125MB或150万行),我想解析为一组制表符分隔值。我一直在研究SAX和DOM XML解析器,并发现了以下内容:

  • SAX allows for the document to be read node by node, so the whole document is not in memory.
  • SAX允许逐个节点读取文档,因此整个文档不在内存中。

  • DOM allows for the whole document to be placed in memory at once, but has a ridiculous amount of overhead.
  • DOM允许将整个文档同时放在内存中,但是有一个荒谬的开销。

SAX vs DOM for large files:

As far as I understand it,

据我了解,

  • If I use SAX, I would have to iterate through 1.5 millions lines of code, node by node.
  • 如果我使用SAX,我将不得不逐个节点地迭代150万行代码。

  • If I use DOM, I would have a big overhead, but then the results would be returned rapidly.
  • 如果我使用DOM,我会有很大的开销,但结果会很快返回。

Problem:

I need to be able to use this parser multiple times on similar files of the same length.

我需要能够在相同长度的类似文件上多次使用此解析器。

Therefore, which parser should I use?

因此,我应该使用哪个解析器?

Bonus points: Does anyone know any good parsers for JavaScript. I realize many are made for Java, but I am much more comfortable with JavaScript.

加分点:有没有人知道JavaScript的任何好的解析器。我意识到很多都是为Java而制作的,但我对JavaScript更加满意。

3 个解决方案

#1


5  

Meet StAX

Just like SAX, StAX follows a Streaming programming model for parsing XML. But, it's a cross between DOM's bidirectional read/write support, its ease of use and SAX's CPU and memory efficiency.

就像SAX一样,StAX遵循Streaming编程模型来解析XML。但是,它是DOM的双向读/写支持,易用性和SAX的CPU和内存效率之间的交叉。

SAX is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.

SAX是只读的,并且推送解析强制您在那里处理事件和错误,然后解析输入。另一方面,StAX是一个pull解析器,它允许客户端在需要时调用解析器上的方法。这也意味着应用程序可以同时读取多个XML文件。

JAXP API comparison

╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗
║          JAXP API Property           ║          StAX           ║           SAX           ║          DOM          ║           TrAX            ║
╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣
║ API Style                            ║ Pull events; streaming  ║ Push events; streaming  ║ In memory tree based  ║ XSLT Rule based templates ║
║ Ease of Use                          ║ High                    ║ Medium                  ║ High                  ║ Medium                    ║
║ XPath Capability                     ║ No                      ║ No                      ║ Yes                   ║ Yes                       ║
║ CPU and Memory Utilization           ║ Good                    ║ Good                    ║ Depends               ║ Depends                   ║
║ Forward Only                         ║ Yes                     ║ Yes                     ║ No                    ║ No                        ║
║ Reading                              ║ Yes                     ║ Yes                     ║ Yes                   ║ Yes                       ║
║ Writing                              ║ Yes                     ║ No                      ║ Yes                   ║ Yes                       ║
║ Create, Read, Update, Delete (CRUD)  ║ No                      ║ No                      ║ Yes                   ║ No                        ║
╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝

Reference:
Does StAX Belong in Your XML Toolbox?

参考:StAX属于您的XML工具箱吗?

StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.

StAX是一种“拉”型API。如上所述,有Cursor和Event Iterator API。 API有读写两面。它比SAX更适合开发人员。与SAX一样,StAX不需要将整个文档保存在内存中。但是,与SAX不同,无需读取整个文档。部分可以跳过。这可能导致甚至比SAX提高性能。

#2


2  

You want SAX, most likely.

你最想要SAX。

DOM is not necessarily faster; it might well me slower, if it works at all, and, as you say, you would need to hold a LOT in memory, probably needlessly.

DOM不一定更快;它可能会慢一点,如果它可以工作,并且,正如你所说,你需要在记忆中保持很多,可能是不必要的。

#3


2  

OWL XML syntax is reasonably flat, but contains lots of cross-references.

OWL XML语法相当平坦,但包含许多交叉引用。

If you need to resolve the cross-references, then a streaming approach (like SAX or StAX) isn't feasible; you will need to build a data structure in memory that holds the whole tree. If you're going to use an in-memory tree, don't use DOM, use one of the more modern models such as JDOM2 or XOM - they are more efficient and more usable.

如果您需要解决交叉引用,那么流式方法(如SAX或StAX)是不可行的;您需要在内存中构建一个包含整个树的数据结构。如果你打算使用内存中的树,不要使用DOM,使用一个更现代的模型,如JDOM2或XOM - 它们更高效,更实用。

If a streaming approach is feasible - that is, if there's a very direct correspondence between your input and output, then StAX is easier to work with than SAX because you can save the current state in variables on the Java stack, rather than needing complex data structures to maintain state between calls.

如果流式方法是可行的 - 也就是说,如果输入和输出之间存在非常直接的对应关系,那么StAX比SAX更容易使用,因为您可以将当前状态保存在Java堆栈中的变量中,而不是需要复杂的数据结构以维持调用之间的状态。

However, there's an alternative; you could write the whole thing in streaming XSLT 3.0. To be honest, this is bleeding edge and your learning time would probably be a lot greater; and it's not open-source; but you might well end up with a solution in 10 lines of code rather than 300.

但是,还有另一种选择;你可以在流式XSLT 3.0中编写全部内容。说实话,这是前沿,你的学习时间可能要大得多;而且它不是开源的;但你最终可能会得到10行代码而不是300行代码的解决方案。

There are other streaming technologies I haven't tried, like XStream.

我还没有尝试过其他流媒体技术,比如XStream。

#1


5  

Meet StAX

Just like SAX, StAX follows a Streaming programming model for parsing XML. But, it's a cross between DOM's bidirectional read/write support, its ease of use and SAX's CPU and memory efficiency.

就像SAX一样,StAX遵循Streaming编程模型来解析XML。但是,它是DOM的双向读/写支持,易用性和SAX的CPU和内存效率之间的交叉。

SAX is read-only and does push parsing forcing you to handle events and errors right there and then while parsing the input. StAX on the other hand is a pull parser that lets the client call methods on the parser when needed. This also means that the application can read multiple XML files simultaneously.

SAX是只读的,并且推送解析强制您在那里处理事件和错误,然后解析输入。另一方面,StAX是一个pull解析器,它允许客户端在需要时调用解析器上的方法。这也意味着应用程序可以同时读取多个XML文件。

JAXP API comparison

╔══════════════════════════════════════╦═════════════════════════╦═════════════════════════╦═══════════════════════╦═══════════════════════════╗
║          JAXP API Property           ║          StAX           ║           SAX           ║          DOM          ║           TrAX            ║
╠══════════════════════════════════════╬═════════════════════════╬═════════════════════════╬═══════════════════════╬═══════════════════════════╣
║ API Style                            ║ Pull events; streaming  ║ Push events; streaming  ║ In memory tree based  ║ XSLT Rule based templates ║
║ Ease of Use                          ║ High                    ║ Medium                  ║ High                  ║ Medium                    ║
║ XPath Capability                     ║ No                      ║ No                      ║ Yes                   ║ Yes                       ║
║ CPU and Memory Utilization           ║ Good                    ║ Good                    ║ Depends               ║ Depends                   ║
║ Forward Only                         ║ Yes                     ║ Yes                     ║ No                    ║ No                        ║
║ Reading                              ║ Yes                     ║ Yes                     ║ Yes                   ║ Yes                       ║
║ Writing                              ║ Yes                     ║ No                      ║ Yes                   ║ Yes                       ║
║ Create, Read, Update, Delete (CRUD)  ║ No                      ║ No                      ║ Yes                   ║ No                        ║
╚══════════════════════════════════════╩═════════════════════════╩═════════════════════════╩═══════════════════════╩═══════════════════════════╝

Reference:
Does StAX Belong in Your XML Toolbox?

参考:StAX属于您的XML工具箱吗?

StAX is a "pull" type of API. As discussed, there are Cursor and Event Iterator APIs. There are both reading and writing sides of the API. It is more developer friendly than SAX. StAX, like SAX, does not require an entire document to be held in memory. However, unlike SAX, an entire document need not be read. Portions can be skipped. This may result in even improved performance over SAX.

StAX是一种“拉”型API。如上所述,有Cursor和Event Iterator API。 API有读写两面。它比SAX更适合开发人员。与SAX一样,StAX不需要将整个文档保存在内存中。但是,与SAX不同,无需读取整个文档。部分可以跳过。这可能导致甚至比SAX提高性能。

#2


2  

You want SAX, most likely.

你最想要SAX。

DOM is not necessarily faster; it might well me slower, if it works at all, and, as you say, you would need to hold a LOT in memory, probably needlessly.

DOM不一定更快;它可能会慢一点,如果它可以工作,并且,正如你所说,你需要在记忆中保持很多,可能是不必要的。

#3


2  

OWL XML syntax is reasonably flat, but contains lots of cross-references.

OWL XML语法相当平坦,但包含许多交叉引用。

If you need to resolve the cross-references, then a streaming approach (like SAX or StAX) isn't feasible; you will need to build a data structure in memory that holds the whole tree. If you're going to use an in-memory tree, don't use DOM, use one of the more modern models such as JDOM2 or XOM - they are more efficient and more usable.

如果您需要解决交叉引用,那么流式方法(如SAX或StAX)是不可行的;您需要在内存中构建一个包含整个树的数据结构。如果你打算使用内存中的树,不要使用DOM,使用一个更现代的模型,如JDOM2或XOM - 它们更高效,更实用。

If a streaming approach is feasible - that is, if there's a very direct correspondence between your input and output, then StAX is easier to work with than SAX because you can save the current state in variables on the Java stack, rather than needing complex data structures to maintain state between calls.

如果流式方法是可行的 - 也就是说,如果输入和输出之间存在非常直接的对应关系,那么StAX比SAX更容易使用,因为您可以将当前状态保存在Java堆栈中的变量中,而不是需要复杂的数据结构以维持调用之间的状态。

However, there's an alternative; you could write the whole thing in streaming XSLT 3.0. To be honest, this is bleeding edge and your learning time would probably be a lot greater; and it's not open-source; but you might well end up with a solution in 10 lines of code rather than 300.

但是,还有另一种选择;你可以在流式XSLT 3.0中编写全部内容。说实话,这是前沿,你的学习时间可能要大得多;而且它不是开源的;但你最终可能会得到10行代码而不是300行代码的解决方案。

There are other streaming technologies I haven't tried, like XStream.

我还没有尝试过其他流媒体技术,比如XStream。