如何使用Java阅读PDF文件?

时间:2021-05-23 22:21:50

I want to read some text data from a PDF file using Java. Please help me to do this.

我想使用Java从PDF文件中读取一些文本数据。请帮我这样做。

Any help is appreciated.

任何帮助表示赞赏。

4 个解决方案

#1


56  

PDFBox is the best library I've found for this purpose, it's comprehensive and really quite easy to use if you're just doing basic text extraction. Examples can be found here.

PDFBox是我为此目的找到的最好的库,如果您只是进行基本的文本提取,它是全面且非常容易使用的。可以在这里找到示例。

It explains it on the page, but one thing to watch out for is that the start and end indexes when using setStartPage() and setEndPage() are both inclusive. I skipped over that explanation first time round and then it took me a while to realise why I was getting more than one page back with each call!

它在页面上解释了它,但要注意的一点是使用setStartPage()和setEndPage()时的起始和结束索引都是包含的。我第一次跳过了那个解释然后我花了一段时间才意识到为什么每次通话我都会得到多个页面!

Itext is another alternative that also works with C#, though I've personally never used it. It's more low level than PDFBox, so less suited to the job if all you need is basic text extraction.

Itext是另一个也适用于C#的替代方案,尽管我个人从未使用它。它比PDFBox更低级,因此如果您只需要基本文本提取,则不太适合这项工作。

#2


16  

PDFBox contains tools for text extraction.

PDFBox包含用于文本提取的工具。

iText has more low-level support for text manipulation, but you'd have to write a considerable amount of code to get text extraction.

iText对文本操作有更多的低级支持,但是您必须编写大量代码才能获取文本。

iText in Action contains a good overview of the limitations of text extraction from PDF, regardless of the library used (Section 18.2: Extracting and editing text), and a convincing explanation why the library does not have text extraction support. In short, it's relatively easy to write a code that will handle simple cases, but it's basically impossible to extract text from PDF in general.

iText in Action包含对PDF文本提取限制的概述,无论使用何种库(第18.2节:提取和编辑文本),以及为什么库没有文本提取支持的令人信服的解释。简而言之,编写一个处理简单案例的代码相对容易,但基本上不可能从PDF中提取文本。

#3


12  

with Apache PDFBox it goes like this:

使用Apache PDFBox,它是这样的:

PDDocument document = PDDocument.load(new File("test.pdf"));
if (!document.isEncrypted()) {
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    System.out.println("Text:" + text);
}
document.close();

#4


2  

Use a PDF library such as iText.

使用PDF库,如iText。

#1


56  

PDFBox is the best library I've found for this purpose, it's comprehensive and really quite easy to use if you're just doing basic text extraction. Examples can be found here.

PDFBox是我为此目的找到的最好的库,如果您只是进行基本的文本提取,它是全面且非常容易使用的。可以在这里找到示例。

It explains it on the page, but one thing to watch out for is that the start and end indexes when using setStartPage() and setEndPage() are both inclusive. I skipped over that explanation first time round and then it took me a while to realise why I was getting more than one page back with each call!

它在页面上解释了它,但要注意的一点是使用setStartPage()和setEndPage()时的起始和结束索引都是包含的。我第一次跳过了那个解释然后我花了一段时间才意识到为什么每次通话我都会得到多个页面!

Itext is another alternative that also works with C#, though I've personally never used it. It's more low level than PDFBox, so less suited to the job if all you need is basic text extraction.

Itext是另一个也适用于C#的替代方案,尽管我个人从未使用它。它比PDFBox更低级,因此如果您只需要基本文本提取,则不太适合这项工作。

#2


16  

PDFBox contains tools for text extraction.

PDFBox包含用于文本提取的工具。

iText has more low-level support for text manipulation, but you'd have to write a considerable amount of code to get text extraction.

iText对文本操作有更多的低级支持,但是您必须编写大量代码才能获取文本。

iText in Action contains a good overview of the limitations of text extraction from PDF, regardless of the library used (Section 18.2: Extracting and editing text), and a convincing explanation why the library does not have text extraction support. In short, it's relatively easy to write a code that will handle simple cases, but it's basically impossible to extract text from PDF in general.

iText in Action包含对PDF文本提取限制的概述,无论使用何种库(第18.2节:提取和编辑文本),以及为什么库没有文本提取支持的令人信服的解释。简而言之,编写一个处理简单案例的代码相对容易,但基本上不可能从PDF中提取文本。

#3


12  

with Apache PDFBox it goes like this:

使用Apache PDFBox,它是这样的:

PDDocument document = PDDocument.load(new File("test.pdf"));
if (!document.isEncrypted()) {
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);
    System.out.println("Text:" + text);
}
document.close();

#4


2  

Use a PDF library such as iText.

使用PDF库,如iText。