使用dom4j从流中读取单个XML文档

时间:2022-10-16 21:25:40

I'm trying to read a single XML document from stream at a time using dom4j, process it, then proceed to the next document on the stream. Unfortunately, dom4j's SAXReader (using JAXP under the covers) keeps reading and chokes on the following document element.

我正在尝试使用dom4j一次从流中读取单个XML文档,处理它,然后继续到流上的下一个文档。不幸的是,dom4j的SAXReader(在封面下使用JAXP)会在以下文档元素上继续读取和阻塞。

Is there a way to get the SAXReader to stop reading the stream once it finds the end of the document element? Is there a better way to accomplish this?

有没有办法让SAXReader在找到文档元素的结尾后停止读取流?有没有更好的方法来实现这一目标?

6 个解决方案

#1


1  

I was able to get this to work with some gymnastics using some internal JAXP classes:

我能够使用一些内部JAXP类来使用一些体操:

  • Create a custom scanner, a subclass of XMLNSDocumentScannerImpl
    • Create a custom driver, an implementation of XMLNSDocumentScannerImpl.Driver, inside the custom scanner that returns END_DOCUMENT when it sees an declaration or an element. Get the ScannedEntity from fElementScanner.getCurrentEntity(). If the entity has a PushbackReader, push back the remaining unread characters in the entity buffer onto the reader.
    • 在自定义扫描程序内创建自定义驱动程序,即XMLNSDocumentScannerImpl.Driver的实现,当它看到声明或元素时返回END_DOCUMENT。从fElementScanner.getCurrentEntity()获取ScannedEntity。如果实体具有PushbackReader,则将实体缓冲区中剩余的未读字符推回到读取器上。

    • In the constructor, replaces the fTrailingMiscDriver with an instance of this custom driver.
    • 在构造函数中,将fTrailingMiscDriver替换为此自定义驱动程序的实例。

  • 创建自定义扫描程序,XMLNSDocumentScannerImpl的子类在自定义扫描程序内创建自定义驱动程序,XMLNSDocumentScannerImpl.Driver的实现,当它看到声明或元素时返回END_DOCUMENT。从fElementScanner.getCurrentEntity()获取ScannedEntity。如果实体具有PushbackReader,则将实体缓冲区中剩余的未读字符推回到读取器上。在构造函数中,将fTrailingMiscDriver替换为此自定义驱动程序的实例。

  • Create a custom configuration class, a subclass of XIncludeAwareParserConfiguration, that replaces the stock DOCUMENT_SCANNER with an instance of this custom scanner in its constructor.
  • 创建自定义配置类,XIncludeAwareParserConfiguration的子类,在其构造函数中将库存DOCUMENT_SCANNER替换为此自定义扫描程序的实例。

  • Install an instance of this custom configuration class as the "com.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration" property so it will be instantiated when dom4j's SAXReader class tries to create a JAXP XMLReader.
  • 将此自定义配置类的实例安装为“com.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration”属性,以便在dom4j的SAXReader类尝试创建JAXP XMLReader时实例化它。

  • When passing a Reader to dom4j's SAXReader.read() method, supply a PushbackReader with a buffer size considerably larger than the one-character default. At least 8192 should be enough to support the default buffer size of the XMLEntityManager inside JAXP's copy of Apache2.
  • 将Reader传递给dom4j的SAXReader.read()方法时,提供一个缓冲区大小远大于单字符默认值的PushbackReader。至少8192应该足以支持JAXP的Apache2副本中的XMLEntityManager的默认缓冲区大小。

This isn't the cleanest solution, as it involves subclassing internal JAXP classes, but it does work.

这不是最干净的解决方案,因为它涉及子类化内部JAXP类,但它确实有效。

#2


0  

Most likely, you don't want to have more than one document in the same stream at the same time. I don't think that the SAXReader is smart enough to stop when it gets to the end of the first document. Why is it necessary to have multiple documents in the same stream like this?

最有可能的是,您不希望同时在同一个流中同时拥有多个文档。我不认为SAXReader足够智能,当它到达第一个文档的末尾时就停止了。为什么需要在同一个流中包含多个文档?

#3


0  

I think you'd have to add an adapter, something to wrap the stream and have that thing return end of file when it sees the beginning of the next document. As far as I know ,the parsers as written, will go until the end of the file or an error... and seeing another <?xml version="1.0"?> would certainly be an error.

我认为你必须添加一个适配器,一些东西来包装流并让它在看到下一个文档的开头时返回文件末尾。据我所知,编写的解析器将一直运行到文件末尾或错误...并且看到另一个 肯定会出错。

#4


0  

Assuming you are responsible for placing documents into the stream in the first place should be easy to delimit the documents in some fashion. For example:

假设您负责首先将文档放入流中,应该很容易以某种方式划分文档。例如:

// Any value that is invalid for an XML character will do.
static final char DOC_TERMINATOR=4;

BOOL addDocumentToStream(BufferedWriter streamOut, char xmlData[])
{
  streamOut.write(xmlData);
  streamOut.write(DOC_TERMINATOR);
}

Then when reading from the stream read into a array until DOC_TERMINATOR is encountered.

然后从流读取读入数组直到遇到DOC_TERMINATOR。

char *getNextDocuument(BufferedReader streamIn)
{
  StringBuffer buffer = new StringBuffer();
  int character;

  while (true)
  {
    character = streamIn.read();
    if (character == DOC_TERMINATOR)
      break;

    buffer.append(character);
  }
  return buffer.toString().toCharArray();
}

Since 4 is an invalid character value you won't encounter except where you explicitly add it. Thus allowing you to split the documents. Now just wrap the resuling char array for input into SAX and your good to go.

由于4是无效的字符值,除非您明确添加它,否则不会遇到。因此允许您拆分文档。现在只需将结果char数组包装输入SAX即可。

...
  XMLReader xmlReader = XMLReaderFactory.createXMLReader();
...
  while (true)
  {
    char xmlDoc = getNextDocument(streamIn);

    if (xmlDoc.length == 0)
      break;

    InputSource saxInputSource = new InputSource(new CharArrayReader(xmlDoc));
    xmlReader.parse(saxInputSource);
  }
...

Note that the loop terminates when it gets a doc of length 0. This means that you should either add a second DOC_TERMINATOR after the last document of you need to add something to detect the end of the stream in getNextDocument().

请注意,循环在获得长度为0的doc时终止。这意味着您应该在最后一个文档之后添加第二个DOC_TERMINATOR,以便在getNextDocument()中添加一些内容来检测流的结尾。

#5


0  

I have done this before by wrappering the base reader with another reader of my own creation that had very simple parsing capability. Assuming you know the closing tag for the document, the wrapper simply parses for a match, e.g. for "</MyDocument>". When it detects that it returns EOF. The wrapper can be made adaptive by parsing out the first opening tag and returning EOF on the matching closing tag. I found it was not necessary to actually detect the level for the closing tag since no document I had used the document tag within itself, so it was guaranteed that the first occurrence of the closing tag ended the document.

之前我已经通过使用我自己创建的另一个具有非常简单的解析功能的读者包装基本读取器来完成此操作。假设您知道文档的结束标记,则包装器只需解析匹配项,例如为“ ”。当它检测到它返回EOF时。通过解析出第一个开始标记并在匹配的结束标记上返回EOF,可以使包装器自适应。我发现没有必要实际检测结束标记的级别,因为没有文档我自己使用了文档标记,所以保证了第一次出现的结束标记结束了文档。

As I recall, one of the tricks was to have the wrapper block close(), since the DOM reader closes the input source.

我记得,其中一个技巧是让封装器块close(),因为DOM读取器会关闭输入源。

So, given Reader input, your code might look like:

因此,给定Reader输入,您的代码可能如下所示:

SubdocReader sdr=new SubdocReader(input);
while(!sdr.eof()) {
    sdr.next();
    // read doc here using DOM
    // then process document
    }
input.close();

The eof() method returns true if EOF is encountered. The next() method flags the reader to stop returning -1 for read().

如果遇到EOF,则eof()方法返回true。 next()方法标记读者停止为read()返回-1。

Hopefully this points you in a useful direction.

希望这能为您指明一个有用的方向。

-- Kiwi.

#6


0  

I would read the input stream into an internal buffer. Depending on the expected total stream size I would either read the entire stream and then parse it or detect the boundary between one xml and the next (look for

我会将输入流读入内部缓冲区。根据预期的总流大小,我会读取整个流,然后解析它或检测一个xml和下一个xml之间的边界(查找

The only real difference then between handling a stream with one xml and a stream with multiple xmls is the buffer and split logic.

处理具有一个xml的流和具有多个xmls的流之间唯一真正的区别是缓冲区和分割逻辑。

#1


1  

I was able to get this to work with some gymnastics using some internal JAXP classes:

我能够使用一些内部JAXP类来使用一些体操:

  • Create a custom scanner, a subclass of XMLNSDocumentScannerImpl
    • Create a custom driver, an implementation of XMLNSDocumentScannerImpl.Driver, inside the custom scanner that returns END_DOCUMENT when it sees an declaration or an element. Get the ScannedEntity from fElementScanner.getCurrentEntity(). If the entity has a PushbackReader, push back the remaining unread characters in the entity buffer onto the reader.
    • 在自定义扫描程序内创建自定义驱动程序,即XMLNSDocumentScannerImpl.Driver的实现,当它看到声明或元素时返回END_DOCUMENT。从fElementScanner.getCurrentEntity()获取ScannedEntity。如果实体具有PushbackReader,则将实体缓冲区中剩余的未读字符推回到读取器上。

    • In the constructor, replaces the fTrailingMiscDriver with an instance of this custom driver.
    • 在构造函数中,将fTrailingMiscDriver替换为此自定义驱动程序的实例。

  • 创建自定义扫描程序,XMLNSDocumentScannerImpl的子类在自定义扫描程序内创建自定义驱动程序,XMLNSDocumentScannerImpl.Driver的实现,当它看到声明或元素时返回END_DOCUMENT。从fElementScanner.getCurrentEntity()获取ScannedEntity。如果实体具有PushbackReader,则将实体缓冲区中剩余的未读字符推回到读取器上。在构造函数中,将fTrailingMiscDriver替换为此自定义驱动程序的实例。

  • Create a custom configuration class, a subclass of XIncludeAwareParserConfiguration, that replaces the stock DOCUMENT_SCANNER with an instance of this custom scanner in its constructor.
  • 创建自定义配置类,XIncludeAwareParserConfiguration的子类,在其构造函数中将库存DOCUMENT_SCANNER替换为此自定义扫描程序的实例。

  • Install an instance of this custom configuration class as the "com.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration" property so it will be instantiated when dom4j's SAXReader class tries to create a JAXP XMLReader.
  • 将此自定义配置类的实例安装为“com.sun.org.apache.xerces.internal.xni.parser.XMLParserConfiguration”属性,以便在dom4j的SAXReader类尝试创建JAXP XMLReader时实例化它。

  • When passing a Reader to dom4j's SAXReader.read() method, supply a PushbackReader with a buffer size considerably larger than the one-character default. At least 8192 should be enough to support the default buffer size of the XMLEntityManager inside JAXP's copy of Apache2.
  • 将Reader传递给dom4j的SAXReader.read()方法时,提供一个缓冲区大小远大于单字符默认值的PushbackReader。至少8192应该足以支持JAXP的Apache2副本中的XMLEntityManager的默认缓冲区大小。

This isn't the cleanest solution, as it involves subclassing internal JAXP classes, but it does work.

这不是最干净的解决方案,因为它涉及子类化内部JAXP类,但它确实有效。

#2


0  

Most likely, you don't want to have more than one document in the same stream at the same time. I don't think that the SAXReader is smart enough to stop when it gets to the end of the first document. Why is it necessary to have multiple documents in the same stream like this?

最有可能的是,您不希望同时在同一个流中同时拥有多个文档。我不认为SAXReader足够智能,当它到达第一个文档的末尾时就停止了。为什么需要在同一个流中包含多个文档?

#3


0  

I think you'd have to add an adapter, something to wrap the stream and have that thing return end of file when it sees the beginning of the next document. As far as I know ,the parsers as written, will go until the end of the file or an error... and seeing another <?xml version="1.0"?> would certainly be an error.

我认为你必须添加一个适配器,一些东西来包装流并让它在看到下一个文档的开头时返回文件末尾。据我所知,编写的解析器将一直运行到文件末尾或错误...并且看到另一个 肯定会出错。

#4


0  

Assuming you are responsible for placing documents into the stream in the first place should be easy to delimit the documents in some fashion. For example:

假设您负责首先将文档放入流中,应该很容易以某种方式划分文档。例如:

// Any value that is invalid for an XML character will do.
static final char DOC_TERMINATOR=4;

BOOL addDocumentToStream(BufferedWriter streamOut, char xmlData[])
{
  streamOut.write(xmlData);
  streamOut.write(DOC_TERMINATOR);
}

Then when reading from the stream read into a array until DOC_TERMINATOR is encountered.

然后从流读取读入数组直到遇到DOC_TERMINATOR。

char *getNextDocuument(BufferedReader streamIn)
{
  StringBuffer buffer = new StringBuffer();
  int character;

  while (true)
  {
    character = streamIn.read();
    if (character == DOC_TERMINATOR)
      break;

    buffer.append(character);
  }
  return buffer.toString().toCharArray();
}

Since 4 is an invalid character value you won't encounter except where you explicitly add it. Thus allowing you to split the documents. Now just wrap the resuling char array for input into SAX and your good to go.

由于4是无效的字符值,除非您明确添加它,否则不会遇到。因此允许您拆分文档。现在只需将结果char数组包装输入SAX即可。

...
  XMLReader xmlReader = XMLReaderFactory.createXMLReader();
...
  while (true)
  {
    char xmlDoc = getNextDocument(streamIn);

    if (xmlDoc.length == 0)
      break;

    InputSource saxInputSource = new InputSource(new CharArrayReader(xmlDoc));
    xmlReader.parse(saxInputSource);
  }
...

Note that the loop terminates when it gets a doc of length 0. This means that you should either add a second DOC_TERMINATOR after the last document of you need to add something to detect the end of the stream in getNextDocument().

请注意,循环在获得长度为0的doc时终止。这意味着您应该在最后一个文档之后添加第二个DOC_TERMINATOR,以便在getNextDocument()中添加一些内容来检测流的结尾。

#5


0  

I have done this before by wrappering the base reader with another reader of my own creation that had very simple parsing capability. Assuming you know the closing tag for the document, the wrapper simply parses for a match, e.g. for "</MyDocument>". When it detects that it returns EOF. The wrapper can be made adaptive by parsing out the first opening tag and returning EOF on the matching closing tag. I found it was not necessary to actually detect the level for the closing tag since no document I had used the document tag within itself, so it was guaranteed that the first occurrence of the closing tag ended the document.

之前我已经通过使用我自己创建的另一个具有非常简单的解析功能的读者包装基本读取器来完成此操作。假设您知道文档的结束标记,则包装器只需解析匹配项,例如为“ ”。当它检测到它返回EOF时。通过解析出第一个开始标记并在匹配的结束标记上返回EOF,可以使包装器自适应。我发现没有必要实际检测结束标记的级别,因为没有文档我自己使用了文档标记,所以保证了第一次出现的结束标记结束了文档。

As I recall, one of the tricks was to have the wrapper block close(), since the DOM reader closes the input source.

我记得,其中一个技巧是让封装器块close(),因为DOM读取器会关闭输入源。

So, given Reader input, your code might look like:

因此,给定Reader输入,您的代码可能如下所示:

SubdocReader sdr=new SubdocReader(input);
while(!sdr.eof()) {
    sdr.next();
    // read doc here using DOM
    // then process document
    }
input.close();

The eof() method returns true if EOF is encountered. The next() method flags the reader to stop returning -1 for read().

如果遇到EOF,则eof()方法返回true。 next()方法标记读者停止为read()返回-1。

Hopefully this points you in a useful direction.

希望这能为您指明一个有用的方向。

-- Kiwi.

#6


0  

I would read the input stream into an internal buffer. Depending on the expected total stream size I would either read the entire stream and then parse it or detect the boundary between one xml and the next (look for

我会将输入流读入内部缓冲区。根据预期的总流大小,我会读取整个流,然后解析它或检测一个xml和下一个xml之间的边界(查找

The only real difference then between handling a stream with one xml and a stream with multiple xmls is the buffer and split logic.

处理具有一个xml的流和具有多个xmls的流之间唯一真正的区别是缓冲区和分割逻辑。