is there a way to remove the comments from a huge xml file (>200 MB), parsed by vtd-xml ?
有没有办法从一个巨大的xml文件(> 200 MB)中删除注释,由vtd-xml解析?
Both, comments before the root element
两者,根元素之前的注释
<!-- comment -->
<rootElement>
.
.
.
</rootElement>
and comments within
和评论
<rootElement>
<book>
<!-- comment -->
</book>
</rootElement>
The best solution would be with xPath. I tried
最好的解决方案是使用xPath。我试过了
//comment()
which works with DOM but not with vtd-xml
它适用于DOM但不适用于vtd-xml
Here is my code for selecting comments
这是我选择评论的代码
String xPath = "//comment()"
XMLModifier xm = new XMLModifier();
VTDGen vg = new VTDGen();
if (vg.parseFile(fnIn,true)){
VTDNav vn = vg.getNav();
xm.bind(vn);
nodeXpath(xPath,vn);
}
private void nodeXpath(String xPath, VTDNav vn) throws Exception{
int result;
AutoPilot ap = new AutoPilot();
ap.selectXPath(xPath);
ap.bind(vn);
while((result = ap.evalXPath())!=-1){
int p = vn.getText();
if (p!=-1) {
System.out.println(vn.getText() + ", " + vn.toString(p));
}
}
}
But the nothing is printed to screen here.
但这里没有任何东西打印到屏幕上。
Is there a way to do that with vtd xml?
有没有办法用vtd xml做到这一点?
Thanks for your help.
谢谢你的帮助。
1 个解决方案
#1
0
You mentioned that your code prints nothing to the screen... not even commas? I wouldn't expect it to necessarily print anything from getText()
, since the doc for getText()
seems to indicate that it returns "the type character data or CDATA", which I don't think includes the content of a comment. (Thank you, @vtd-xml-author, for confirming that.)
你提到你的代码没有打印到屏幕上......甚至没有逗号?我不希望它必须从getText()打印任何东西,因为getText()的doc似乎表明它返回“类型字符数据或CDATA”,我不认为它包含注释的内容。 (谢谢你,@ vtd-xml-author,确认。)
A good test would be to print something in every iteration of your while loop before p = vn.getText()
, so you'll know whether it's finding the comments at all.
一个好的测试是在p = vn.getText()之前在while循环的每次迭代中打印一些东西,这样你就会知道它是否正在查找注释。
If it is finding the comments, I think you'll want to call xm.removeToken(result)
on each one.
如果它正在查找注释,我想你会想要在每个注释上调用xm.removeToken(result)。
#1
0
You mentioned that your code prints nothing to the screen... not even commas? I wouldn't expect it to necessarily print anything from getText()
, since the doc for getText()
seems to indicate that it returns "the type character data or CDATA", which I don't think includes the content of a comment. (Thank you, @vtd-xml-author, for confirming that.)
你提到你的代码没有打印到屏幕上......甚至没有逗号?我不希望它必须从getText()打印任何东西,因为getText()的doc似乎表明它返回“类型字符数据或CDATA”,我不认为它包含注释的内容。 (谢谢你,@ vtd-xml-author,确认。)
A good test would be to print something in every iteration of your while loop before p = vn.getText()
, so you'll know whether it's finding the comments at all.
一个好的测试是在p = vn.getText()之前在while循环的每次迭代中打印一些东西,这样你就会知道它是否正在查找注释。
If it is finding the comments, I think you'll want to call xm.removeToken(result)
on each one.
如果它正在查找注释,我想你会想要在每个注释上调用xm.removeToken(result)。