I am parsing an XML document in UTF-8 encoding with Java using VTD-XML.
我正在使用VTD-XML解析使用Java的UTF-8编码的XML文档。
A small excerpt looks like:
一小段摘录如下:
<literal>????</literal>
<literal>????</literal>
<literal>????</literal>
I want to iterate through each literal and print it out to the console. However, what I get is:
我想迭代每个文字并将其打印到控制台。但是,我得到的是:
¢
I am correctly navigating to each element. The way that I get the text value is by calling:
我正确导航到每个元素。我获取文本值的方法是调用:
private static String toNormalizedString(String name, int val, final VTDNav vn) throws NavException {
String strValue = null;
if (val != -1) {
strValue = vn.toNormalizedString(val);
}
return strValue;
}
I've also tried vn.getXPathStringVal();
, however it yields the same results.
我也尝试过vn.getXPathStringVal();但是它会产生相同的结果。
I know that each of the literals above aren't just strings of length one. Rather, they seem to be unicode "characters" composed of two characters. I am able to correctly parse and output the kanji characters if they're length is just one.
我知道上面的每个文字都不只是长度为1的字符串。相反,它们似乎是由两个字符组成的unicode“字符”。如果它们的长度只有一个,我能够正确地解析并输出汉字字符。
My question is - how can I correctly parse and output these characters using VTD-XML? Is there a way to get the underlying bytes of the text between the literal tags so that I can parse the bytes myself?
我的问题是 - 如何使用VTD-XML正确解析和输出这些字符?有没有办法在文字标签之间获取文本的基础字节,以便我可以自己解析字节?
EDIT
Code to process each line of the XML - converting it to a byte array and then back to a String.
用于处理XML的每一行的代码 - 将其转换为字节数组,然后再转换为String。
try (BufferedReader br = new BufferedReader(new FileReader("res/sample.xml"))) {
String line;
while ((line = br.readLine()) != null) {
byte[] myBytes = null;
try {
myBytes = line.getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
e.printStackTrace();
System.exit(-1);
}
System.out.println(new String(myBytes));
}
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
1 个解决方案
#1
2
You are probably trying to get the string involving characters that is greater than 0x10000. That bug is known and is in the process of being addressed... I will notify you once the fix is out. This question may be identical to this one... Map supplementary Unicode characters to BMP (if possible)
您可能正在尝试获取包含大于0x10000的字符的字符串。该错误已知,正在处理中......一旦修复完毕,我会通知您。这个问题可能与此问题相同...将补充Unicode字符映射到BMP(如果可能)
#1
2
You are probably trying to get the string involving characters that is greater than 0x10000. That bug is known and is in the process of being addressed... I will notify you once the fix is out. This question may be identical to this one... Map supplementary Unicode characters to BMP (if possible)
您可能正在尝试获取包含大于0x10000的字符的字符串。该错误已知,正在处理中......一旦修复完毕,我会通知您。这个问题可能与此问题相同...将补充Unicode字符映射到BMP(如果可能)