I'm sure this might have been discussed at length or answered before, however I need a bit more information on the best approach for my situation...
我相信这可能已经被详细讨论或回答之前,但我需要更多的信息,关于我的情况最好的方法……
Problem:
We have some large XML data (anywhere from 100k to 5MB) which we need to inflate into Java objects. The issue is that the data doesn't really doesn't map onto an object very well at all, so we need to only pull certain parts of the data out and create the objects. Given that, solutions such as JAXB or XStream really aren't appropriate.
问题:我们有一些大的XML数据(从100k到5MB)需要膨胀成Java对象。问题是数据并不能很好地映射到对象上,所以我们只需要提取数据的某些部分并创建对象。鉴于此,诸如JAXB或XStream之类的解决方案确实不合适。
So, we need to pull XML data out and get it into java objects as efficiently as possible.
因此,我们需要提取XML数据并尽可能有效地将其放入java对象中。
Possible Solutions:
The way I see it, we have 3 possible solutions:
可能的解决方案:在我看来,我们有三种可能的解决方案:
- SAX parsing
- SAX解析
- DOM parsing
- DOM解析
- XSLT
- XSLT
We can load the XML into any JAXP implementation and pull the data out using one of the above methods.
我们可以将XML加载到任何JAXP实现中,并使用上述方法之一提取数据。
Question(s)
I have a few questions/concerns:
我有几个问题/顾虑:
- How does XSLT work under the hood? Is it just a DOM parser? I ask because XSLT seems like a good way to go, but I don't really want to consider it if it won't give us better performance than DOM.
- XSLT是如何工作的?它只是一个DOM解析器吗?我之所以问这个问题,是因为XSLT似乎是一个不错的选择,但如果它不能给我们带来比DOM更好的性能,我就不太愿意考虑它了。
- What are some popular libraries that provide DOM, XSLT, and SAX XML parsers?
- 提供DOM、XSLT和SAX XML解析器的流行库有哪些?
- In your experience, what are the reasons for picking DOM, SAX, or XSLT? Does the ease of use of DOM or XSLT totally dominate the performance improvements SAX offers?
- 根据您的经验,选择DOM、SAX或XSLT的原因是什么?DOM或XSLT的易用性是否完全主导了SAX提供的性能改进?
- Any benchmarks out there? The ones I've found are old (as in, 8 years old). So some recent benchmarks would be appreciated.
- 任何基准?我发现的那些是旧的(比如8岁)。因此,近期的一些基准将受到赞赏。
- Are there any other solutions besides those outlined above that I could be missing?
- 除了上面提到的以外,还有其他的解决方案吗?
Edit:
A few clarifications... You can use XSLT to directly inject values into a Java object... it is normally used to transform XML into some other XML, however I'm talking from the standpoint of calling a method from XSLT into java to inject the value.
I'm still not clear on how an XSLT processor works exactly... How is it feeding the XML into the XSLT code you write?
编辑:几个澄清……您可以使用XSLT直接向Java对象注入值……它通常用于将XML转换为其他XML,但是我是从将XSLT中的方法调用到java来注入值的角度进行讨论的。我还不清楚XSLT处理器是如何工作的。它如何将XML输入到您编写的XSLT代码中?
5 个解决方案
#1
3
Use XSLT to transform the large XML files into a local domain model that is mapped to java objets with JAXB.
使用XSLT将大型XML文件转换为本地域模型,该模型使用JAXB映射到java objets。
Start with the JDK 5+ built in XML libraries (unless you absolutely need XSLT 2.0, in which case use Saxon)
从构建在XML库中的JDK 5+开始(除非您绝对需要XSLT 2.0,在这种情况下使用Saxon)
Don't focus on relative performance of SAX/DOM, focus on learning how to write XPath expressions and use XSLT, and then worry about performance later if and only if you find it to be a problem.
不要把重点放在SAX/DOM的相对性能上,而要把重点放在如何编写XPath表达式和使用XSLT上,然后在发现问题时才考虑性能。
The Eclipse XML editors are decent, but if you can afford it, spring for Oxygen XML, which will let you do XPath evaluation in realtime.
Eclipse XML编辑器很不错,但是如果您负担得起的话,可以使用spring for Oxygen XML,它可以让您实时地执行XPath计算。
#2
2
We had a similar situation and I just threw together some XPath code that parsed the stuff I needed.
我们遇到了类似的情况,我只是拼凑了一些XPath代码来解析我需要的东西。
It was amazingly quick even on 100k+ XML files. We went as low tech as possible. We handle around 1000 files a day of that size and parsing time is very low. We have no memory issues, leaks etc.
即使是在100k+ XML文件上,它的速度也是惊人的快。我们尽可能降低技术含量。我们每天处理大约1000个这样大小的文件,解析时间非常短。我们没有内存问题,泄漏等。
We wrote a quick prototype in Groovy (if my memory is accurate) - proof of concept took me about 10 minutes
我们用Groovy(如果我的记忆是准确的)编写了一个快速原型——概念验证花了我大约10分钟
#3
2
JAXB, the Java API for XML Binding might be what you want. You use it to inflate an XML document into a Java object graph made up of "Java content objects". These content objects are instances of classes generated by JAXB to match the XML document's schema
JAXB,用于XML绑定的Java API可能正是您想要的。您可以使用它将XML文档膨胀为由“Java内容对象”组成的Java对象图。这些内容对象是JAXB生成的类的实例,以匹配XML文档的模式。
But if you already have a set of Java classes, or don't yet have a schema for the document, JAXB probably isn't the best way to go. I'd suggest doing a SAX parse and then building up your Java objects during the parse. Alternatively you could try a DOM parse and then walk the resulting Document tree to pull out the parts of interest (maybe with XPath) -- but 5MB of XML might turn into 50MB of DOM tree objects in Java.
但是,如果您已经有了一组Java类,或者还没有文档的模式,那么JAXB可能不是最好的方法。我建议进行SAX解析,然后在解析期间构建Java对象。或者,您可以尝试使用DOM解析,然后遍历结果文档树以提取感兴趣的部分(可能是XPath)——但是5MB的XML可能会在Java中变成50MB的DOM树对象。
#4
1
DOM, SAX and XSLT are different animals.
DOM、SAX和XSLT是不同的动物。
DOM parsing loads the entire document into memory, which for 100K to 5MB (very small by today's standards) would work.
DOM解析将整个文档加载到内存中,内存为100K到5MB(按现在的标准非常小)。
SAX is a stream parser which reads the XML and delivers events to your code for each tag.
SAX是一个流解析器,它读取XML并为每个标记向代码交付事件。
XSLT is a system for transforming one XML tree into another. Even if you wrote a transform that converts the input to a more suitable format, you'd still have to write something using DOM or SAX to convert it into Java objects.
XSLT是一种将一棵XML树转换成另一棵的系统。即使您编写了转换来将输入转换为更合适的格式,您仍然需要使用DOM或SAX编写一些东西来将输入转换为Java对象。
#5
1
You can use the @XmlPath extension in EclipseLink JAXB (MOXy) to easily handle this use case. For a detailed example see:
您可以在EclipseLink JAXB (MOXy)中使用@XmlPath扩展来轻松地处理这个用例。有关详细示例,请参见:
- http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
- http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
Sample Code:
示例代码:
package blog.geocode;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;
import org.eclipse.persistence.oxm.annotations.XmlPath;
@XmlRootElement(name="kml")
@XmlType(propOrder={"country", "state", "city", "street", "postalCode"})
public class Address {
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:Thoroughfare/ns:ThoroughfareName/text()")
private String street;
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:LocalityName/text()")
private String city;
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:AdministrativeAreaName/text()")
private String state;
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:CountryNameCode/text()")
private String country;
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:PostalCode/ns:PostalCodeNumber/text()")
private String postalCode;
}
#1
3
Use XSLT to transform the large XML files into a local domain model that is mapped to java objets with JAXB.
使用XSLT将大型XML文件转换为本地域模型,该模型使用JAXB映射到java objets。
Start with the JDK 5+ built in XML libraries (unless you absolutely need XSLT 2.0, in which case use Saxon)
从构建在XML库中的JDK 5+开始(除非您绝对需要XSLT 2.0,在这种情况下使用Saxon)
Don't focus on relative performance of SAX/DOM, focus on learning how to write XPath expressions and use XSLT, and then worry about performance later if and only if you find it to be a problem.
不要把重点放在SAX/DOM的相对性能上,而要把重点放在如何编写XPath表达式和使用XSLT上,然后在发现问题时才考虑性能。
The Eclipse XML editors are decent, but if you can afford it, spring for Oxygen XML, which will let you do XPath evaluation in realtime.
Eclipse XML编辑器很不错,但是如果您负担得起的话,可以使用spring for Oxygen XML,它可以让您实时地执行XPath计算。
#2
2
We had a similar situation and I just threw together some XPath code that parsed the stuff I needed.
我们遇到了类似的情况,我只是拼凑了一些XPath代码来解析我需要的东西。
It was amazingly quick even on 100k+ XML files. We went as low tech as possible. We handle around 1000 files a day of that size and parsing time is very low. We have no memory issues, leaks etc.
即使是在100k+ XML文件上,它的速度也是惊人的快。我们尽可能降低技术含量。我们每天处理大约1000个这样大小的文件,解析时间非常短。我们没有内存问题,泄漏等。
We wrote a quick prototype in Groovy (if my memory is accurate) - proof of concept took me about 10 minutes
我们用Groovy(如果我的记忆是准确的)编写了一个快速原型——概念验证花了我大约10分钟
#3
2
JAXB, the Java API for XML Binding might be what you want. You use it to inflate an XML document into a Java object graph made up of "Java content objects". These content objects are instances of classes generated by JAXB to match the XML document's schema
JAXB,用于XML绑定的Java API可能正是您想要的。您可以使用它将XML文档膨胀为由“Java内容对象”组成的Java对象图。这些内容对象是JAXB生成的类的实例,以匹配XML文档的模式。
But if you already have a set of Java classes, or don't yet have a schema for the document, JAXB probably isn't the best way to go. I'd suggest doing a SAX parse and then building up your Java objects during the parse. Alternatively you could try a DOM parse and then walk the resulting Document tree to pull out the parts of interest (maybe with XPath) -- but 5MB of XML might turn into 50MB of DOM tree objects in Java.
但是,如果您已经有了一组Java类,或者还没有文档的模式,那么JAXB可能不是最好的方法。我建议进行SAX解析,然后在解析期间构建Java对象。或者,您可以尝试使用DOM解析,然后遍历结果文档树以提取感兴趣的部分(可能是XPath)——但是5MB的XML可能会在Java中变成50MB的DOM树对象。
#4
1
DOM, SAX and XSLT are different animals.
DOM、SAX和XSLT是不同的动物。
DOM parsing loads the entire document into memory, which for 100K to 5MB (very small by today's standards) would work.
DOM解析将整个文档加载到内存中,内存为100K到5MB(按现在的标准非常小)。
SAX is a stream parser which reads the XML and delivers events to your code for each tag.
SAX是一个流解析器,它读取XML并为每个标记向代码交付事件。
XSLT is a system for transforming one XML tree into another. Even if you wrote a transform that converts the input to a more suitable format, you'd still have to write something using DOM or SAX to convert it into Java objects.
XSLT是一种将一棵XML树转换成另一棵的系统。即使您编写了转换来将输入转换为更合适的格式,您仍然需要使用DOM或SAX编写一些东西来将输入转换为Java对象。
#5
1
You can use the @XmlPath extension in EclipseLink JAXB (MOXy) to easily handle this use case. For a detailed example see:
您可以在EclipseLink JAXB (MOXy)中使用@XmlPath扩展来轻松地处理这个用例。有关详细示例,请参见:
- http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
- http://bdoughan.blogspot.com/2010/09/xpath-based-mapping-geocode-example.html
Sample Code:
示例代码:
package blog.geocode;
import javax.xml.bind.annotation.XmlRootElement;
import javax.xml.bind.annotation.XmlType;
import org.eclipse.persistence.oxm.annotations.XmlPath;
@XmlRootElement(name="kml")
@XmlType(propOrder={"country", "state", "city", "street", "postalCode"})
public class Address {
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:Thoroughfare/ns:ThoroughfareName/text()")
private String street;
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:LocalityName/text()")
private String city;
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:AdministrativeAreaName/text()")
private String state;
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:CountryNameCode/text()")
private String country;
@XmlPath("Response/Placemark/ns:AddressDetails/ns:Country/ns:AdministrativeArea/ns:SubAdministrativeArea/ns:Locality/ns:PostalCode/ns:PostalCodeNumber/text()")
private String postalCode;
}