如何针对xsd架构验证大xml?

时间:2022-09-08 17:15:34

I need to validate big xml with limited memory usage. With every code i've found so far i get out of memory error.

我需要使用有限的内存使用来验证大xml。到目前为止我找到的每个代码都会出现内存错误。

Methods i tried:

方法我试过:

 //method 1
        SAXParserFactory factory = SAXParserFactory.newInstance();
        factory.setValidating(false);
        factory.setNamespaceAware(true);

        SchemaFactory schemaFactory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
        factory.setSchema(schemaFactory.newSchema(new Source[] {new StreamSource(Thread.currentThread().getContextClassLoader().getResource("xmlresource/XSD_final2.xsd").getFile())}));
        SAXParser parser = factory.newSAXParser();
        XMLReader reader = parser.getXMLReader();
        reader.setErrorHandler(new SimpleErrorHandler());
        reader.parse(new InputSource(inputXml));
//method2 

XMLValidationSchemaFactory sf = XMLValidationSchemaFactory.newInstance(XMLValidationSchema.SCHEMA_ID_W3C_SCHEMA);
            XMLValidationSchema vs = sf.createSchema(Thread.currentThread().getContextClassLoader().getResource("xmlresource/XSD_final2.xsd"));
            XMLStreamReader2 sr = (XMLStreamReader2) XMLInputFactory2.newInstance().createXMLStreamReader(new FileInputStream(inputXml));
            sr.validateAgainst(vs);
            try {
              while (sr.hasNext()) {
                sr.next();
              }
              System.out.println("Validated ok!");
            } catch (XMLValidationException ve) {
              System.err.println("Validation problem: "+ve);
              isValid = false;
            }
            sr.close();

//method 3

//方法3

      SchemaFactory factory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
          String fileName = Thread.currentThread().getContextClassLoader().getResource("xmlresource/XSD_final2.xsd").getFile();

          Schema schema = factory.newSchema(new File(fileName));
          Validator validator = schema.newValidator();

          // create a source from a file
          StreamSource source = new StreamSource(new File(inputXml));

          // check input

            validator.validate(source);

i get OutOfMemory every time

我每次都得到OutOfMemory

EDIT

编辑

with XOM

与XOM

SAXParserFactory factory = SAXParserFactory.newInstance();
            factory.setValidating(false);
            factory.setNamespaceAware(true);

            SchemaFactory schemaFactory = SchemaFactory.newInstance("http://www.w3.org/2001/XMLSchema");
            factory.setSchema(schemaFactory.newSchema(new Source[] {new StreamSource(Thread.currentThread().getContextClassLoader().getResource("xmlresource/XSD_final2.xsd").getFile())}));
            SAXParser parser = factory.newSAXParser();
            XMLReader reader = parser.getXMLReader();
            reader.setErrorHandler(new SimpleErrorHandler());

            Builder builder = new Builder(reader);
            builder.build(new FileInputStream(new File(inputXml)));

still memory usage is very high, for 15mb xml - 250mb of heap stacktrace:

仍然内存使用率非常高,对于15mb xml - 250mb的堆栈跟踪:

Exception in thread "AWT-EventQueue-0" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:535)
at java.lang.StringBuffer.append(StringBuffer.java:322)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.handleCharacters(XMLSchemaValidator.java:1574)
at com.sun.org.apache.xerces.internal.impl.xs.XMLSchemaValidator.characters(XMLSchemaValidator.java:789)
at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:441)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:835)
at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:764)
at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:123)
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1210)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
at nu.xom.Builder.build(Unknown Source)
at nu.xom.Builder.build(Unknown Source)

EDIT My xml has large base64 string

编辑我的xml有大的base64字符串

2 个解决方案

#1


3  

Look at this article on XML unmarshalling from Marco Tedone see here. Based on his conclusion I would recommend for low memory consumption STax:

看看这篇关于Marco Tedone的XML解组的文章,请看这里。基于他的结论,我建议低内存消耗STax:

    XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
    XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(fileInputStream);
    Validator validator = schema.newValidator();
    validator.validate(new StAXSource(xmlStreamReader));

#2


0  

It's possible that the memory is being used for the schema, not the source document. You haven't said anything about the schema. Some can use very high amounts of memory, for example if you have large finite values of minOccurs or maxOccurs in your content model. At what point does the out of memory exception occur?

内存可能用于架构,而不是源文档。您还没有说过架构。有些人可能会使用非常大量的内存,例如,如果内容模型中有大量有限值minOccurs或maxOccurs。在什么时候发生内存不足异常?

#1


3  

Look at this article on XML unmarshalling from Marco Tedone see here. Based on his conclusion I would recommend for low memory consumption STax:

看看这篇关于Marco Tedone的XML解组的文章,请看这里。基于他的结论,我建议低内存消耗STax:

    XMLInputFactory xmlInputFactory = XMLInputFactory.newInstance();
    XMLStreamReader xmlStreamReader = xmlInputFactory.createXMLStreamReader(fileInputStream);
    Validator validator = schema.newValidator();
    validator.validate(new StAXSource(xmlStreamReader));

#2


0  

It's possible that the memory is being used for the schema, not the source document. You haven't said anything about the schema. Some can use very high amounts of memory, for example if you have large finite values of minOccurs or maxOccurs in your content model. At what point does the out of memory exception occur?

内存可能用于架构,而不是源文档。您还没有说过架构。有些人可能会使用非常大量的内存,例如,如果内容模型中有大量有限值minOccurs或maxOccurs。在什么时候发生内存不足异常?