JAXB能否以块的形式解析大型XML文件

时间:2021-12-16 21:33:24

I need to parse potentially large XML files, of which the schema is already provided to me in several XSD files, so XML binding is highly favored. I'd like to know if I can use JAXB to parse the file in chunks and if so, how.

我需要解析可能很大的XML文件,其中的模式已经在几个XSD文件中提供给我,因此XML绑定非常受欢迎。我想知道我是否可以使用JAXB以块的形式解析文件,如果是,那么如何。

3 个解决方案

#1


Because code matters, here is a PartialUnmarshaller who reads a big file into chunks. It can be used that way new PartialUnmarshaller<YourClass>(stream, YourClass.class)

因为代码很重要,所以这里有一个PartialUnmarshaller,它将一个大文件读入块中。它可以这样使用新的PartialUnmarshaller (stream,YourClass.class)

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.*;
import java.io.InputStream;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

import static javax.xml.stream.XMLStreamConstants.*;

public class PartialUnmarshaller<T> {
    XMLStreamReader reader;
    Class<T> clazz;
    Unmarshaller unmarshaller;

    public PartialUnmarshaller(InputStream stream, Class<T> clazz) throws XMLStreamException, FactoryConfigurationError, JAXBException {
        this.clazz = clazz;
        this.unmarshaller = JAXBContext.newInstance(clazz).createUnmarshaller();
        this.reader = XMLInputFactory.newInstance().createXMLStreamReader(stream);

        /* ignore headers */
        skipElements(START_DOCUMENT, DTD);
        /* ignore root element */
        reader.nextTag();
        /* if there's no tag, ignore root element's end */
        skipElements(END_ELEMENT);
    }

    public T next() throws XMLStreamException, JAXBException {
        if (!hasNext())
            throw new NoSuchElementException();

        T value = unmarshaller.unmarshal(reader, clazz).getValue();

        skipElements(CHARACTERS, END_ELEMENT);
        return value;
    }

    public boolean hasNext() throws XMLStreamException {
        return reader.hasNext();
    }

    public void close() throws XMLStreamException {
        reader.close();
    }

    void skipElements(int... elements) throws XMLStreamException {
        int eventType = reader.getEventType();

        List<Integer> types = asList(elements);
        while (types.contains(eventType))
            eventType = reader.next();
    }
}

#2


This is detailed in the user guide. The JAXB download from http://jaxb.java.net/ includes an example of how to parse one chunk at a time.

这在用户指南中有详细说明。来自http://jaxb.java.net/的JAXB下载包含一个如何一次解析一个块的示例。

When a document is large, it's usually because there's repetitive parts in it. Perhaps it's a purchase order with a large list of line items, or perhaps it's an XML log file with large number of log entries.

当文档很大时,通常是因为文档中有重复的部分。也许这是一个包含大量订单项的采购订单,或者它可能是包含大量日志条目的XML日志文件。

This kind of XML is suitable for chunk-processing; the main idea is to use the StAX API, run a loop, and unmarshal individual chunks separately. Your program acts on a single chunk, and then throws it away. In this way, you'll be only keeping at most one chunk in memory, which allows you to process large documents.

这种XML适用于块处理;主要思想是使用StAX API,运行循环,并单独解组各个块。你的程序作用于一个块,然后扔掉它。通过这种方式,您只能在内存中保留最多一个块,这样您就可以处理大型文档。

See the streaming-unmarshalling example and the partial-unmarshalling example in the JAXB RI distribution for more about how to do this. The streaming-unmarshalling example has an advantage that it can handle chunks at arbitrary nest level, yet it requires you to deal with the push model --- JAXB unmarshaller will "push" new chunk to you and you'll need to process them right there.

有关如何执行此操作的更多信息,请参阅JAXB RI分发中的streaming-unmarshalling示例和partial-unmarshalling示例。流式解组示例的优势在于它可以处理任意嵌套级别的块,但它需要您处理推送模型--- JAXB unmarshaller将“推送”新块并且您需要正确处理它们那里。

In contrast, the partial-unmarshalling example works in a pull model (which usually makes the processing easier), but this approach has some limitations in databinding portions other than the repeated part.

相比之下,部分解组示例在拉模型中工作(通常使处理更容易),但是这种方法在重复部分以外的数据绑定部分中具有一些限制。

#3


Yves Amsellem's answer is pretty good, but only works if all elements are of exactly the same type. Otherwise your unmarshall will throw an exception, but the reader will have already consumed the bytes, so you would be unable to recover. Instead, we should follow Skaffman's advice and look at the sample from the JAXB jar.

Yves Amsellem的答案非常好,但只有当所有元素的类型完全相同时才有效。否则你的unmarshall会抛出一个异常,但是读者已经消耗了这些字节,所以你将无法恢复。相反,我们应该遵循Skaffman的建议并查看JAXB jar中的示例。

To explain how it works:

解释它是如何工作的:

  1. Create a JAXB unmarshaller.
  2. 创建一个JAXB unmarshaller。

  3. Add a listener to the unmarshaller for intercepting the appropriate elements. This is done by "hacking" the ArrayList to ensure the elements are not stored in memory after being unmarshalled.
  4. 向unmarshaller添加一个侦听器以拦截相应的元素。这是通过“黑客”ArrayList来完成的,以确保元素在被解组后不会存储在内存中。

  5. Create a SAX parser. This is where the streaming happens.
  6. 创建一个SAX解析器。这是流媒体发生的地方。

  7. Use the unmarshaller to generate a handler for the SAX parser.
  8. 使用unmarshaller为SAX解析器生成处理程序。

  9. Stream!

I modified the solution to be generic*. However, it required some reflection. If this is not OK, please look at the code samples in the JAXB jars.

我将解决方案修改为通用*。但是,它需要一些反思。如果不行,请查看JAXB jar中的代码示例。

ArrayListAddInterceptor.java

import java.lang.reflect.Field;
import java.util.ArrayList;

public class ArrayListAddInterceptor<T> extends ArrayList<T> {
    private static final long serialVersionUID = 1L;

    private AddInterceptor<T> interceptor;

    public ArrayListAddInterceptor(AddInterceptor<T> interceptor) {
        this.interceptor = interceptor;
    }

    @Override
    public boolean add(T t) {
        interceptor.intercept(t);
        return false;
    }

    public static interface AddInterceptor<T> {
        public void intercept(T t);
    }

    public static void apply(AddInterceptor<?> interceptor, Object o, String property) {
        try {
            Field field = o.getClass().getDeclaredField(property);
            field.setAccessible(true);
            field.set(o, new ArrayListAddInterceptor(interceptor));
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

}

Main.java

public class Main {
  public void parsePurchaseOrders(AddInterceptor<PurchaseOrder> interceptor, List<File> files) {
        try {
            // create JAXBContext for the primer.xsd
            JAXBContext context = JAXBContext.newInstance("primer");

            Unmarshaller unmarshaller = context.createUnmarshaller();

            // install the callback on all PurchaseOrders instances
            unmarshaller.setListener(new Unmarshaller.Listener() {
                public void beforeUnmarshal(Object target, Object parent) {
                    if (target instanceof PurchaseOrders) {
                        ArrayListAddInterceptor.apply(interceptor, target, "purchaseOrder");
                    }
                }
            });

            // create a new XML parser
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setNamespaceAware(true);
            XMLReader reader = factory.newSAXParser().getXMLReader();
            reader.setContentHandler(unmarshaller.getUnmarshallerHandler());

            for (File file : files) {
                reader.parse(new InputSource(new FileInputStream(file)));
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

*This code has not been tested and is for illustrative purposes only.

*此代码尚未经过测试,仅供说明之用。

#1


Because code matters, here is a PartialUnmarshaller who reads a big file into chunks. It can be used that way new PartialUnmarshaller<YourClass>(stream, YourClass.class)

因为代码很重要,所以这里有一个PartialUnmarshaller,它将一个大文件读入块中。它可以这样使用新的PartialUnmarshaller (stream,YourClass.class)

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Unmarshaller;
import javax.xml.stream.*;
import java.io.InputStream;
import java.util.List;
import java.util.NoSuchElementException;
import java.util.stream.Collectors;
import java.util.stream.IntStream;

import static javax.xml.stream.XMLStreamConstants.*;

public class PartialUnmarshaller<T> {
    XMLStreamReader reader;
    Class<T> clazz;
    Unmarshaller unmarshaller;

    public PartialUnmarshaller(InputStream stream, Class<T> clazz) throws XMLStreamException, FactoryConfigurationError, JAXBException {
        this.clazz = clazz;
        this.unmarshaller = JAXBContext.newInstance(clazz).createUnmarshaller();
        this.reader = XMLInputFactory.newInstance().createXMLStreamReader(stream);

        /* ignore headers */
        skipElements(START_DOCUMENT, DTD);
        /* ignore root element */
        reader.nextTag();
        /* if there's no tag, ignore root element's end */
        skipElements(END_ELEMENT);
    }

    public T next() throws XMLStreamException, JAXBException {
        if (!hasNext())
            throw new NoSuchElementException();

        T value = unmarshaller.unmarshal(reader, clazz).getValue();

        skipElements(CHARACTERS, END_ELEMENT);
        return value;
    }

    public boolean hasNext() throws XMLStreamException {
        return reader.hasNext();
    }

    public void close() throws XMLStreamException {
        reader.close();
    }

    void skipElements(int... elements) throws XMLStreamException {
        int eventType = reader.getEventType();

        List<Integer> types = asList(elements);
        while (types.contains(eventType))
            eventType = reader.next();
    }
}

#2


This is detailed in the user guide. The JAXB download from http://jaxb.java.net/ includes an example of how to parse one chunk at a time.

这在用户指南中有详细说明。来自http://jaxb.java.net/的JAXB下载包含一个如何一次解析一个块的示例。

When a document is large, it's usually because there's repetitive parts in it. Perhaps it's a purchase order with a large list of line items, or perhaps it's an XML log file with large number of log entries.

当文档很大时,通常是因为文档中有重复的部分。也许这是一个包含大量订单项的采购订单,或者它可能是包含大量日志条目的XML日志文件。

This kind of XML is suitable for chunk-processing; the main idea is to use the StAX API, run a loop, and unmarshal individual chunks separately. Your program acts on a single chunk, and then throws it away. In this way, you'll be only keeping at most one chunk in memory, which allows you to process large documents.

这种XML适用于块处理;主要思想是使用StAX API,运行循环,并单独解组各个块。你的程序作用于一个块,然后扔掉它。通过这种方式,您只能在内存中保留最多一个块,这样您就可以处理大型文档。

See the streaming-unmarshalling example and the partial-unmarshalling example in the JAXB RI distribution for more about how to do this. The streaming-unmarshalling example has an advantage that it can handle chunks at arbitrary nest level, yet it requires you to deal with the push model --- JAXB unmarshaller will "push" new chunk to you and you'll need to process them right there.

有关如何执行此操作的更多信息,请参阅JAXB RI分发中的streaming-unmarshalling示例和partial-unmarshalling示例。流式解组示例的优势在于它可以处理任意嵌套级别的块,但它需要您处理推送模型--- JAXB unmarshaller将“推送”新块并且您需要正确处理它们那里。

In contrast, the partial-unmarshalling example works in a pull model (which usually makes the processing easier), but this approach has some limitations in databinding portions other than the repeated part.

相比之下,部分解组示例在拉模型中工作(通常使处理更容易),但是这种方法在重复部分以外的数据绑定部分中具有一些限制。

#3


Yves Amsellem's answer is pretty good, but only works if all elements are of exactly the same type. Otherwise your unmarshall will throw an exception, but the reader will have already consumed the bytes, so you would be unable to recover. Instead, we should follow Skaffman's advice and look at the sample from the JAXB jar.

Yves Amsellem的答案非常好,但只有当所有元素的类型完全相同时才有效。否则你的unmarshall会抛出一个异常,但是读者已经消耗了这些字节,所以你将无法恢复。相反,我们应该遵循Skaffman的建议并查看JAXB jar中的示例。

To explain how it works:

解释它是如何工作的:

  1. Create a JAXB unmarshaller.
  2. 创建一个JAXB unmarshaller。

  3. Add a listener to the unmarshaller for intercepting the appropriate elements. This is done by "hacking" the ArrayList to ensure the elements are not stored in memory after being unmarshalled.
  4. 向unmarshaller添加一个侦听器以拦截相应的元素。这是通过“黑客”ArrayList来完成的,以确保元素在被解组后不会存储在内存中。

  5. Create a SAX parser. This is where the streaming happens.
  6. 创建一个SAX解析器。这是流媒体发生的地方。

  7. Use the unmarshaller to generate a handler for the SAX parser.
  8. 使用unmarshaller为SAX解析器生成处理程序。

  9. Stream!

I modified the solution to be generic*. However, it required some reflection. If this is not OK, please look at the code samples in the JAXB jars.

我将解决方案修改为通用*。但是,它需要一些反思。如果不行,请查看JAXB jar中的代码示例。

ArrayListAddInterceptor.java

import java.lang.reflect.Field;
import java.util.ArrayList;

public class ArrayListAddInterceptor<T> extends ArrayList<T> {
    private static final long serialVersionUID = 1L;

    private AddInterceptor<T> interceptor;

    public ArrayListAddInterceptor(AddInterceptor<T> interceptor) {
        this.interceptor = interceptor;
    }

    @Override
    public boolean add(T t) {
        interceptor.intercept(t);
        return false;
    }

    public static interface AddInterceptor<T> {
        public void intercept(T t);
    }

    public static void apply(AddInterceptor<?> interceptor, Object o, String property) {
        try {
            Field field = o.getClass().getDeclaredField(property);
            field.setAccessible(true);
            field.set(o, new ArrayListAddInterceptor(interceptor));
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }

}

Main.java

public class Main {
  public void parsePurchaseOrders(AddInterceptor<PurchaseOrder> interceptor, List<File> files) {
        try {
            // create JAXBContext for the primer.xsd
            JAXBContext context = JAXBContext.newInstance("primer");

            Unmarshaller unmarshaller = context.createUnmarshaller();

            // install the callback on all PurchaseOrders instances
            unmarshaller.setListener(new Unmarshaller.Listener() {
                public void beforeUnmarshal(Object target, Object parent) {
                    if (target instanceof PurchaseOrders) {
                        ArrayListAddInterceptor.apply(interceptor, target, "purchaseOrder");
                    }
                }
            });

            // create a new XML parser
            SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setNamespaceAware(true);
            XMLReader reader = factory.newSAXParser().getXMLReader();
            reader.setContentHandler(unmarshaller.getUnmarshallerHandler());

            for (File file : files) {
                reader.parse(new InputSource(new FileInputStream(file)));
            }
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
    }
}

*This code has not been tested and is for illustrative purposes only.

*此代码尚未经过测试,仅供说明之用。