This question already has an answer here:
这个问题在这里已有答案:
- How to parse invalid (bad / not well-formed) XML? 4 answers
- 如何解析无效(错误/不良格式)的XML? 4个答案
I have a process that parses an XML file using JDOM and xpath to parse the file as shown below:
我有一个进程使用JDOM和xpath解析XML文件来解析文件,如下所示:
private static SAXBuilder builder = null;
private static Document doc = null;
private static XPath xpathInstance = null;
builder = new SAXBuilder();
Text list = null;
try {
doc = builder.build(new StringReader(xmldocument));
} catch (JDOMException e) {
throw new Exception(e);
}
try {
xpathInstance = XPath.newInstance("//book[author='Neal Stephenson']/title/text()");
list = (Text) xpathInstance.selectSingleNode(doc);
} catch (JDOMException e) {
throw new Exception(e);
}
The above works fine. The xpath expressions are stored in a properties file so these can be changed anytime. Now i have to process some more xml files that come from a legacy system that will only send the xml files in chunks of 4000 bytes. The existing processing reads the 4000 byte chunks and stores them in an Oracle database with each chunk as one row in the database (Making any changes to the legacy system or the processing that stores the chunks as rows in the database is out of the question).
以上工作正常。 xpath表达式存储在属性文件中,因此可以随时更改这些表达式。现在我必须处理更多来自遗留系统的xml文件,该系统只会以4000字节的块发送xml文件。现有处理读取4000字节块并将它们存储在Oracle数据库中,每个块作为数据库中的一行(对遗留系统进行任何更改或将块存储为数据库中的行的处理是不可能的) 。
I can build the complete valid XML document by extracting all the rows related to a specific xml document and merging them and then use the existing processing (shown above) to parse the xml document.
我可以通过提取与特定xml文档相关的所有行并合并它们然后使用现有处理(如上所示)来解析xml文档来构建完整的有效XML文档。
The thing is though, the data i need to extract from the XML document will always be on the first 4000 bytes. This chunk ofcourse is not a valid XML document as it will be incomplete but will contain all the data i need. I cant parse just the one chunk as the JDOM builder will reject it.
但事实是,我需要从XML文档中提取的数据将始终位于前4000个字节上。这块课程不是一个有效的XML文档,因为它不完整但会包含我需要的所有数据。由于JDOM构建器拒绝它,我无法解析一个块。
I am wondering whether i can parse the malformed XML chunk without having to merge all parts (which could get to quite many) in order to get a valid XML document. This will save me several trips to the database to check if a chunk is available and i wont have to merge 100s of chunks only for being able to use the first 4000 bytes.
我想知道我是否可以解析格式错误的XML块而不必合并所有部分(可能会有很多部分)以获得有效的XML文档。这将节省我几次到数据库的行程,以检查一个块是否可用,并且我不必合并100个块只是为了能够使用前4000个字节。
I know i could probably use java's string functions to extract the relevant data but is this possible using a parser or even xpath? or do they both expect the xml document to be a well formed document before it can parse it?
我知道我可能会使用java的字符串函数来提取相关数据但这可能是使用解析器甚至xpath吗?或者他们都希望xml文档在解析之前是一个格式良好的文档?
1 个解决方案
#1
5
You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.
您可以尝试使用JSoup来解析无效的XML。根据定义,XML应该格式正确,否则它是无效的,不应该使用。
UPDATE - example:
更新 - 示例:
public static void main(String[] args) {
for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
new Element(Tag.valueOf("p"), ""),
"")) {
print(node, 0);
}
}
public static void print(Node node, int offset) {
for (int i = 0; i < offset; i++) {
System.out.print(" ");
}
System.out.print(node.nodeName());
for (Attribute attribute: node.attributes()) {
System.out.print(", ");
System.out.print(attribute.getKey() + "=" + attribute.getValue());
}
System.out.println();
for (Node child : node.childNodes()) {
print(child, offset + 4);
}
}
#1
5
You could try to use JSoup to parse the invalid XML. By definition XML should be well-formed, otherwise it's invalid and should not be used.
您可以尝试使用JSoup来解析无效的XML。根据定义,XML应该格式正确,否则它是无效的,不应该使用。
UPDATE - example:
更新 - 示例:
public static void main(String[] args) {
for (Node node : Parser.parseFragment("<test><author name=\"Vlad\"><book name=\"SO\"/>" ,
new Element(Tag.valueOf("p"), ""),
"")) {
print(node, 0);
}
}
public static void print(Node node, int offset) {
for (int i = 0; i < offset; i++) {
System.out.print(" ");
}
System.out.print(node.nodeName());
for (Attribute attribute: node.attributes()) {
System.out.print(", ");
System.out.print(attribute.getKey() + "=" + attribute.getValue());
}
System.out.println();
for (Node child : node.childNodes()) {
print(child, offset + 4);
}
}