解析大型xml文件Java

I have big xml files (~1GB) with this structure:

我有大的xml文件(~1GB)，具有这种结构:

<?xml version="1.0" encoding="UTF-8"?>
<GenoExchange xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm.nih.gov/SNP/geno" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/geno ftp://ftp.ncbi.nlm.nih.gov/snp/specs/genoex_1_5.xsd" dbSNPBuildNo="146" reportId="MT" reportType="chromosome">
    <Population popId="638" handle="TSC-CSHL" locPopId="TSC_42_AA">
        <popClass self="NORTH AMERICA"/>
    </Population>
 <SnpInfo rsId="1041870" observed="C/T">
        <SnpLoc genomicAssembly="107:GRCh38.p2" geneId="4512" geneSymbol="COX1" chrom="MT" start="6150" locType="2" rsOrientToChrom="fwd" contigAllele="T" contig="NC_012920:1"/>
        <SsInfo ssId="1508548" locSnpId="TSC0349089" ssOrientToRs="fwd">
            <ByPop popId="1303" sampleSize="184">
                <AlleleFreq allele="T" freq="1"/>
                <AlleleFreq allele="C" freq="0"/>
            </ByPop>
        </SsInfo>
    </SnpInfo>
<SnpInfo rsId="1029293" observed="C/T">
        <SnpLoc genomicAssembly="107:GRCh38.p2" geneId="4512" geneSymbol="COX1" chrom="MT" start="6307" locType="2" rsOrientToChrom="fwd" contigAllele="C" contig="NC_012920:1"/>
        <SsInfo ssId="1494519" locSnpId="TSC0254145" ssOrientToRs="fwd">
            <ByPop popId="639" sampleSize="82">
                <AlleleFreq allele="T" freq="0"/>
                <AlleleFreq allele="C" freq="1"/>
            </ByPop>
            <ByPop popId="1303" sampleSize="184">
                <AlleleFreq allele="T" freq="0"/>
                <AlleleFreq allele="C" freq="1"/>
            </ByPop>
        </SsInfo>
    </SnpInfo>

I want to find a specific rsID, for example rsID="1029293" and extract all the information inside that node. I don't want to run all the file. I only want to find that ID, extract that information and end the iteration. From what I read it's better if I use SAX or Stax parsers. I'm using SAX, this is my code:

我希望找到一个特定的rsID，例如rsID=“1029293”，并提取该节点内部的所有信息。我不想运行所有的文件。我只想找到那个ID，提取那个信息，结束迭代。从我所读到的内容来看，使用SAX或Stax解析器会更好。我在使用SAX，这是我的代码:

class UserHandler extends DefaultHandler {

   String rsID = null;
   String i = "1029293";       

   @Override
   public void startElement(String uri, 
      String localName, String qName, Attributes attributes) throws SAXException {

      if (qName.equalsIgnoreCase("SnpInfo")) { 
         rsID = attributes.getValue("rsId"); 
          //System.out.println("value: " + rsID);
      }
      if((i).equals(rsID) &&
         qName.equalsIgnoreCase("SnpInfo")){
         System.out.println("Start Element: " + qName + " " + rsID);
      }      

      if ((i).equals(rsID) && qName.equalsIgnoreCase("SsInfo")) {
          String a = attributes.getValue("ssId");
          System.out.println("SSID: " + a);
      }

      if ((i).equals(rsID) && qName.equalsIgnoreCase("ByPop")) {
          String p = attributes.getValue("popId");
          System.out.println("POPID: " + p);
      } 
      if ((i).equals(rsID) && qName.equalsIgnoreCase("AlleleFreq")) {
          String p = attributes.getValue("allele");
          String f = attributes.getValue("freq"); 
          System.out.println("ALLELE: " + p + " FREQ: " + f);
      }  
      if ((i).equals(rsID) && qName.equalsIgnoreCase("GTypeFreq")) {
          String p = attributes.getValue("gtype");
          String f = attributes.getValue("freq"); 
          System.out.println("GTYPE: " + p + " FREQ: " + f);
      }  
   }

   @Override
   public void endElement(String uri, 
      String localName, String qName) throws SAXException {
      if (qName.equalsIgnoreCase("SnpInfo")) {
         if((i).equals(rsID) 
            && qName.equalsIgnoreCase("SnpInfo"))
            System.out.println("End Element: " + qName); 
         }
      }
}
public class XMLParser {

    public static void main(String argv[]) {
        try {   
            InputStream fileStream = new FileInputStream("/home/xml/gt_chr10.xml.gz");
            InputStream gzipStream = new GZIPInputStream(fileStream);
            SAXParserFactory factory = SAXParserFactory.newInstance();
            SAXParser saxParser = factory.newSAXParser();
            UserHandler userhandler = new UserHandler();
            saxParser.parse(gzipStream, userhandler);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

My problem is that my code searches the whole file for the ID and that takes more than 2 minutes each time. I can't have a code that takes so long. Is there a better approach for this?

我的问题是，我的代码在整个文件中搜索ID，每次花费超过2分钟。我不能用这么长的代码。有更好的方法吗?

5 个解决方案

#1

Using STAX gives you more control when parsing XML, since you actively pull elements from the stream. This way you can pull the next event, handle it and once you found your data, simply terminate the loop (using a flag or even a return statement if you must)

使用STAX可以在解析XML时提供更多的控制，因为您可以从流中主动提取元素。通过这种方式，您可以拉出下一个事件，处理它，一旦找到数据，只需终止循环(如果必须的话，使用标志甚至返回语句)

InputStream in = ...
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader = factory.createXMLEventReader(in);

boolean found = false;
while (!found && eventReader.hasNext()) {
    XMLEvent event = eventReader.nextEvent();
    switch (event.getEventType()) {
    case XMLStreamConstants.START_ELEMENT:
        // your logic here 
        // once you found your element, you can terminate the loop 
        found = true;
        break;
    case XMLStreamConstants.END_ELEMENT:
        // your logic here
        break;
    }
}

(omitted exception and resource handling for brevity)

(省略异常和资源处理)

On a side note, you will gain some performance by combining your if ((i).equals(rsID) && ... into a single one, with detail checks in nested ifs

顺便说一句，您将通过组合您的if (i).equals(rsID) &&…获得一些性能。在嵌套的ifs中进行详细检查，并将其放入一个单一的ifs中

if ((i).equals(rsID)) {
    if(qName.equalsIgnoreCase("GTypeFreq")) {
       ...
    }
}

#2

You can throw an exception in your end element handler, to indicate to the parser that it aborts parsing (http://www.ibm.com/developerworks/library/x-tipsaxstop/):

您可以在您的end元素处理程序中抛出一个异常，向解析器表明它终止解析(http://www.ibm.com/developerworks/library/x-tipsaxstop/):

   @Override
   public void endElement(String uri, 
      String localName, String qName) throws SAXException {
      if (qName.equalsIgnoreCase("SnpInfo")) {
         if((i).equals(rsID) 
            && qName.equalsIgnoreCase("SnpInfo"))
            System.out.println("End Element: " + qName); 
            throw SAXException("Element found.");
         }
      }

#3

The only way to avoid parsing the whole file every time you run this is to put the data in an XML database. Parsing a 1Gb file is going to take about a minute, plus or minus depending on the speed of your machine and what processing you do on each node.

避免每次运行时解析整个文件的惟一方法是将数据放入XML数据库。解析1Gb的文件需要大约一分钟的时间，这取决于您的机器的速度以及您在每个节点上做什么处理。

A streamed XSLT 3.0 solution is simply:

流式XSLT 3.0解决方案很简单:

<xsl:transform version="3.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
     xpath-default-namespace="http://www.ncbi.nlm.nih.gov/SNP/geno">
  <xsl:template name="xsl:initial-template">
    <xsl:stream href="input.xml">
       <xsl:copy-of select="/GenoExchange/SnpInfo[@rsId='1041870'][1]"/>
    </xsl:stream>
  </xsl:template>
</xsl:transform>

No need to write all that pesky SAX or StAX code.

不需要编写所有讨厌的SAX或StAX代码。

I put the "[1]" predicate in to allow the processor to abandon the search when it has found the first hit.

我将“[1]”谓词放入，以便处理器在发现第一次命中时放弃搜索。

#4

The best approach is to use vtd-xml and xpath... 1GB xml file takes about 1.5GB heap space and < 10 sec in a 3~4 year old intel processor.see code example below.. One more thing, if you want to eliminate parsing entirely, you can create a vtd+XML file format so any subsequent query can directly access the vtd index portion, which could easily triple or quadruple your app performance...

最好的方法是使用vtd-xml和xpath……1GB的xml文件在3~4年的intel处理器中占用1.5GB的堆空间和小于10秒。请参见下面的代码示例。还有一件事，如果您想完全消除解析，您可以创建vtd+XML文件格式，以便任何后续查询都可以直接访问vtd索引部分，这可以轻松地使您的应用程序性能提高三倍或四倍……

import com.ximpleware.*;

    public class simpleXpathSearch{
        public  static  void main(String s[]) throws VTDException,java.io.UnsupportedEncodingException,java.io.IOException{
            VTDGen vg = new VTDGen();
            vg.setLCLevel(5);
            if (!vg.parseFile("input.xml", false))
                return;
            VTDNav vn = vg.getNav();
            AutoPilot ap = new AutoPilot(vn);
            ap.selectXPath("/*/*[@rsID='1029293']");
            int i=0;
            while((i=ap.evalXPath())!=-1){
               // your code logic here
            }

#5

//Main class

/ /主类

public static void main(String[] args) {
    SAXReader.read();
}

//SAXReader

/ / SAXReader

public static void read(){
    try {
        XMLReader processor = XMLReaderFactory.createXMLReader();
        processor.setContentHandler(new SAXController());
        processor.parse(new InputSource("MyXML.xml"));
    } catch (SAXException | IOException e) {
        System.err.println(e.getMessage());
    }
}

//SAXController

/ / SAXController

// The SAXController extends DefaultHandler

// SAXController扩展DefaultHandler

private int tab = 0;

private void tabulation() {
    for (int i=0; i<tab; i++)
        System.out.print("  ");
}

@Override
public void startDocument() {
    tabulation();
    System.out.println("Starting XML Document");
    tab++;
}

@Override
public void endDocument() {
    tab--;
    tabulation();
    System.out.println("Ending XML Document");
}

@Override
public void startElement(String uri, String localName, String qName, Attributes attributes)
        throws SAXException {
    tabulation();
    System.out.print(localName);
    if (attributes.getLength()>0) {
        for (int i=0; i<attributes.getLength(); i++) {
            System.out.print(attributes.getLocalName(i)+": "+attributes.getValue(i));
        }
    }
    System.out.println();
    tab++;
}

@Override
public void endElement(String uri, String localName, String qName)
        throws SAXException {
    tab--;
    tabulation();
    System.out.println(localName);
}

@Override
public void characters(char[] ch, int start, int length)
        throws SAXException {
    String content= new String(ch, start, length);
    content= content.replaceAll("[\t\n]", "").trim();
    if (!content.equals("")) {
        tabulation();
        System.out.println(content);
    }
}

#1

InputStream in = ...
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader = factory.createXMLEventReader(in);

boolean found = false;
while (!found && eventReader.hasNext()) {
    XMLEvent event = eventReader.nextEvent();
    switch (event.getEventType()) {
    case XMLStreamConstants.START_ELEMENT:
        // your logic here 
        // once you found your element, you can terminate the loop 
        found = true;
        break;
    case XMLStreamConstants.END_ELEMENT:
        // your logic here
        break;
    }
}