I use Apache PDFBox
to parse text from pdf file. I tried to get a line after a specific line.
我使用Apache PDFBox来解析pdf文件中的文本。我尝试在特定线后获得一条线。
PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
System.out.println("Text from pdf:" + text);
} else{
log.info("File is encrypted!");
}
document.close();
Sample:
Sentence 1, nth line of file
句子1,第n行文件
Needed line
Sentence 3, n+2th line of file
句子3,n + 2行文件
I tried to get all the lines from file in an array, but it is unstable, because unable to filter to a specific text. It is problem also in second solution, that is why I am looking for a PDFBox
based solution. Solution 1:
我试图从数组中获取文件中的所有行,但它不稳定,因为无法过滤到特定文本。在第二个解决方案中也存在问题,这就是为什么我在寻找基于PDFBox的解决方案。解决方案1:
String[] lines = myString.split(System.getProperty("line.separator"));
Solution 2:
String neededline = (String) FileUtils.readLines(file).get("n+2th")
1 个解决方案
#1
2
In fact, the source code for the PDFTextStripper
class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.
事实上,PDFTextStripper类的源代码使用与您相同的完全相同的行,因此您的第一次尝试尽可能使用PDFBox尽可能接近正确。
You see, the PDFTextStripper
getText
method calls the writeText
method which just writes to an output buffer line by line with the writeString
method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().
你看,PDFTextStripper getText方法调用writeText方法,该方法只是按照与你已经尝试过的完全相同的方式逐行写入输出缓冲区。从此方法返回的结果是buffer.toString()。
Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:
因此,给定格式良好的PDF,您真正想问的问题是如何过滤特定文本的数组。以下是一些想法:
First, you captures lines in an array like you said.
首先,你像你说的那样捕获数组中的行。
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
static String[] lines;
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("my2.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
lines = text.split(System.getProperty("line.separator"));
document.close();
}
}
Here's a method to get a complete String by any line number index, easy:
这是一个通过任何行号索引获取完整String的方法,很简单:
// returns a full String line by number n
static String getLine(int n) {
return lines[n];
}
Here's a linear search method that finds a string match and returns the first line number where found.
这是一个线性搜索方法,它找到一个字符串匹配并返回找到的第一个行号。
// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
int n = 0;
for(String line : lines) {
if(line.indexOf(filter) != -1) {
return n;
}
n++;
}
return -1;
}
With the above, it possible to get only the line number for your matched search:
通过上述内容,您可以只获取匹配搜索的行号:
System.out.println(getLine(8)); // line 8 for example
Or, the entire String line that contains your matched search:
或者,包含匹配搜索的整个String行:
System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);
This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.
这一切似乎都非常简单,并且只能在线条分隔符可以将线条拆分成数组的假设下工作。如果解决方案不像上述想法那么简单,我相信您的问题的根源可能不在您使用PDFBox的实现中,而是在您尝试发布文本的PDF源中。
Here's a link to a tutorial that also does what you are trying to do:
这是一个教程的链接,它也可以执行您要执行的操作:
https://www.tutorialkart.com/pdfbox/extract-text-line-by-line-from-pdf/
Again, same approach...
同样,同样的方法......
#1
2
In fact, the source code for the PDFTextStripper
class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.
事实上,PDFTextStripper类的源代码使用与您相同的完全相同的行,因此您的第一次尝试尽可能使用PDFBox尽可能接近正确。
You see, the PDFTextStripper
getText
method calls the writeText
method which just writes to an output buffer line by line with the writeString
method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().
你看,PDFTextStripper getText方法调用writeText方法,该方法只是按照与你已经尝试过的完全相同的方式逐行写入输出缓冲区。从此方法返回的结果是buffer.toString()。
Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:
因此,给定格式良好的PDF,您真正想问的问题是如何过滤特定文本的数组。以下是一些想法:
First, you captures lines in an array like you said.
首先,你像你说的那样捕获数组中的行。
import java.io.File;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
public class Main {
static String[] lines;
public static void main(String[] args) throws Exception {
PDDocument document = PDDocument.load(new File("my2.pdf"));
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
lines = text.split(System.getProperty("line.separator"));
document.close();
}
}
Here's a method to get a complete String by any line number index, easy:
这是一个通过任何行号索引获取完整String的方法,很简单:
// returns a full String line by number n
static String getLine(int n) {
return lines[n];
}
Here's a linear search method that finds a string match and returns the first line number where found.
这是一个线性搜索方法,它找到一个字符串匹配并返回找到的第一个行号。
// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
int n = 0;
for(String line : lines) {
if(line.indexOf(filter) != -1) {
return n;
}
n++;
}
return -1;
}
With the above, it possible to get only the line number for your matched search:
通过上述内容,您可以只获取匹配搜索的行号:
System.out.println(getLine(8)); // line 8 for example
Or, the entire String line that contains your matched search:
或者,包含匹配搜索的整个String行:
System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);
This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.
这一切似乎都非常简单,并且只能在线条分隔符可以将线条拆分成数组的假设下工作。如果解决方案不像上述想法那么简单,我相信您的问题的根源可能不在您使用PDFBox的实现中,而是在您尝试发布文本的PDF源中。
Here's a link to a tutorial that also does what you are trying to do:
这是一个教程的链接,它也可以执行您要执行的操作:
https://www.tutorialkart.com/pdfbox/extract-text-line-by-line-from-pdf/
Again, same approach...
同样,同样的方法......