
时间:2022-09-13 10:23:48

I use Apache PDFBox to parse text from pdf file. I tried to get a line after a specific line.

我使用Apache PDFBox来解析pdf文件中的文本。我尝试在特定线后获得一条线。

PDDocument document = PDDocument.load(new File("my.pdf"));
if (!document.isEncrypted()) {
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    System.out.println("Text from pdf:" + text);
} else{
    log.info("File is encrypted!");


Sentence 1, nth line of file


Needed line

Sentence 3, n+2th line of file

句子3,n + 2行文件

I tried to get all the lines from file in an array, but it is unstable, because unable to filter to a specific text. It is problem also in second solution, that is why I am looking for a PDFBox based solution. Solution 1:


String[] lines = myString.split(System.getProperty("line.separator"));

Solution 2:

String neededline = (String) FileUtils.readLines(file).get("n+2th")

1 个解决方案



In fact, the source code for the PDFTextStripper class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.


You see, the PDFTextStripper getText method calls the writeText method which just writes to an output buffer line by line with the writeString method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().

你看,PDFTextStripper getText方法调用writeText方法,该方法只是按照与你已经尝试过的完全相同的方式逐行写入输出缓冲区。从此方法返回的结果是buffer.toString()。

Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:


First, you captures lines in an array like you said.


import java.io.File;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class Main {

    static String[] lines;

    public static void main(String[] args) throws Exception {
        PDDocument document = PDDocument.load(new File("my2.pdf"));
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);
        lines = text.split(System.getProperty("line.separator"));

Here's a method to get a complete String by any line number index, easy:


// returns a full String line by number n
static String getLine(int n) {
    return lines[n];

Here's a linear search method that finds a string match and returns the first line number where found.


// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
    int n = 0;
    for(String line : lines) {
        if(line.indexOf(filter) != -1) {
            return n;
    return -1;

With the above, it possible to get only the line number for your matched search:


System.out.println(getLine(8)); // line 8 for example

Or, the entire String line that contains your matched search:


System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);

This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.


Here's a link to a tutorial that also does what you are trying to do:



Again, same approach...




In fact, the source code for the PDFTextStripper class uses the exact same line ending as you, so your first attempt is as close to correct as possible using PDFBox.


You see, the PDFTextStripper getText method calls the writeText method which just writes to an output buffer line by line with the writeString method in the exact same way as you have already tried. The result returned from this method is the buffer.toString().

你看,PDFTextStripper getText方法调用writeText方法,该方法只是按照与你已经尝试过的完全相同的方式逐行写入输出缓冲区。从此方法返回的结果是buffer.toString()。

Therefore, given a well formatted PDF, it would seem the question you are really asking is how to filter an array for specific text. Here are some ideas:


First, you captures lines in an array like you said.


import java.io.File;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;

public class Main {

    static String[] lines;

    public static void main(String[] args) throws Exception {
        PDDocument document = PDDocument.load(new File("my2.pdf"));
        PDFTextStripper stripper = new PDFTextStripper();
        String text = stripper.getText(document);
        lines = text.split(System.getProperty("line.separator"));

Here's a method to get a complete String by any line number index, easy:


// returns a full String line by number n
static String getLine(int n) {
    return lines[n];

Here's a linear search method that finds a string match and returns the first line number where found.


// searches all lines for first line index containing `filter`
static int getLineNumberWithFilter(String filter) {
    int n = 0;
    for(String line : lines) {
        if(line.indexOf(filter) != -1) {
            return n;
    return -1;

With the above, it possible to get only the line number for your matched search:


System.out.println(getLine(8)); // line 8 for example

Or, the entire String line that contains your matched search:


System.out.println(lines[getLineNumberWithFilter("Cat dog mouse")]);

This all seems pretty straight forward and works only under the assumption that lines can be split into arrays by the line separator. If the solution is not as simple as the above ideas, I believe the source of your problem may not be in your implementation with PDFBox but rather with the PDF source you are trying to text mine.


Here's a link to a tutorial that also does what you are trying to do:



Again, same approach...
