In .doc files, There is a function to get each character in paragraph by using
在.doc文件中,有一个函数可以通过使用获取段落中的每个字符
CharacterRun charrun = paragraph.getCharacterRun(k++);
and then I can use those character runs to inspect their attributes like
然后我可以使用那些字符运行来检查它们的属性
if ( charrun.isBold() == true) System.out.print(charrun.text());
or something like that. But with .docx files seems to have no characters run method that can read each word like that, I tried to use
或类似的东西。但是.docx文件似乎没有字符运行方法,可以读取每个单词,我尝试使用
XWPFParagraph item = paragraph.get(i);
List<XWPFRun> charrun = item.getRuns();
I found that when you call the character run in XWPF, it won't return one character to you but it will return some random-in-length strings in the document
我发现当你在XWPF中调用字符运行时,它不会向你返回一个字符,但它会在文档中返回一些随机字符串
XWPFRun temp = charrun.get(0);
System.out.println(temp.gettext(0));
This code won't return 1st character in the paragraph.
此代码不会返回段落中的第一个字符。
So how can I fix this?
那么我该如何解决这个问题呢?
1 个解决方案
#1
Assuming you want to iterate over all the (main) paragraphs in a word document (excluding tables, headers and the like), then iterate over the character runs in that paragraph, then iterate over the text of the run one character at a time, you'd want to do something like:
假设您要迭代word文档中的所有(主要)段落(不包括表格,标题等),然后迭代该段落中的字符运行,然后一次迭代运行一个字符的文本,你想要做的事情如下:
XWPFDocument doc = new XWPFDocument(OPCPackage.open("myfile.docx"));
for (XWPFParagraph paragraph : doc.getParagraphs()) {
int pos = 0;
for (XWPFRun run : paragraph.getRuns()) {
for (character c : run.text().toCharArray()) {
System.out.println("The character at " + pos + " is " + c);
pos++;
}
}
}
That will iterate over each character, and will have things like tabs and newlines represented as their character equivalents (things like w:tab
will be converted).
这将迭代每个字符,并将像制表符和换行符一样表示为它们的字符等价物(像w:tab这样的东西将被转换)。
For HWPF, the way of getting the paragraphs, and the way of getting the runs from a paragraph is similar but not identical, so there's no common interface. Both XWPFRun and HWPF's CharacterRun share a common interface though, so that part of the code can be re-used
对于HWPF,获取段落的方式以及从段落获取运行的方式类似但不完全相同,因此没有通用接口。 XWPFRun和HWPF的CharacterRun共享一个公共接口,因此部分代码可以重复使用
Note that all text in a given character run will share the same style / formatting information. Because of the strange ways that Word works, it's possible that two adjacent runs will also share the same styles, and Word hasn't merged them...
请注意,给定字符运行中的所有文本将共享相同的样式/格式信息。由于Word工作的奇怪方式,两个相邻的运行可能也会共享相同的样式,而Word没有合并它们......
#1
Assuming you want to iterate over all the (main) paragraphs in a word document (excluding tables, headers and the like), then iterate over the character runs in that paragraph, then iterate over the text of the run one character at a time, you'd want to do something like:
假设您要迭代word文档中的所有(主要)段落(不包括表格,标题等),然后迭代该段落中的字符运行,然后一次迭代运行一个字符的文本,你想要做的事情如下:
XWPFDocument doc = new XWPFDocument(OPCPackage.open("myfile.docx"));
for (XWPFParagraph paragraph : doc.getParagraphs()) {
int pos = 0;
for (XWPFRun run : paragraph.getRuns()) {
for (character c : run.text().toCharArray()) {
System.out.println("The character at " + pos + " is " + c);
pos++;
}
}
}
That will iterate over each character, and will have things like tabs and newlines represented as their character equivalents (things like w:tab
will be converted).
这将迭代每个字符,并将像制表符和换行符一样表示为它们的字符等价物(像w:tab这样的东西将被转换)。
For HWPF, the way of getting the paragraphs, and the way of getting the runs from a paragraph is similar but not identical, so there's no common interface. Both XWPFRun and HWPF's CharacterRun share a common interface though, so that part of the code can be re-used
对于HWPF,获取段落的方式以及从段落获取运行的方式类似但不完全相同,因此没有通用接口。 XWPFRun和HWPF的CharacterRun共享一个公共接口,因此部分代码可以重复使用
Note that all text in a given character run will share the same style / formatting information. Because of the strange ways that Word works, it's possible that two adjacent runs will also share the same styles, and Word hasn't merged them...
请注意,给定字符运行中的所有文本将共享相同的样式/格式信息。由于Word工作的奇怪方式,两个相邻的运行可能也会共享相同的样式,而Word没有合并它们......