使用C＃或经典ASP（VBScript）从PDF中提取文本的好方法是什么？

Is there a good library for extracting text from a PDF? I'm willing to pay for it if I have to.

是否有一个很好的库来从PDF中提取文本?如果必须,我愿意为此付出代价。

Something that works with C# or classic ASP (VBScript) would be ideal and I also need to be able to separate the pages from the PDF.

适用于C#或经典ASP(VBScript)的东西是理想的,我还需要能够将页面与PDF分开。

This question had some interesting stuff, especially pdftotext but I'd like to avoid calling to an external command-line app if I can.

这个问题有一些有趣的东西,特别是pdftotext,但如果可以,我想避免调用外部命令行应用程序。

5 个解决方案

#1

You can use the IFilter interface built into Windows to extract text and properties (author, title, etc.) from any supported file type. It's a COM interface so you would have use the .NET interop facilities.

您可以使用Windows内置的IFilter接口从任何支持的文件类型中提取文本和属性(作者,标题等)。它是一个COM接口,因此您可以使用.NET互操作工具。

You'd also have to download the free PDF IFilter driver from Adobe.

您还必须从Adobe下载免费的PDF IFilter驱动程序。

#2

Here is a good list: Open Source Libs for PDF/C#

这是一个很好的列表:PDF / C的开源库#

Most of these are geared toward creating PDFs, but they should have read capability as well.

其中大部分都是为了创建PDF,但它们也应该具有读取功能。

There is this one as well: iText

还有这个:iText

I have only played with iText before. Nothing major.

我之前只玩过iText。没什么大不了的。

#3

We've used Aspose with good results.

我们使用了Aspose,结果很好。

#4

Docotic.Pdf library can be used to extract formatted or plain text from PDF documents.

Docotic.Pdf库可用于从PDF文档中提取格式化或纯文本。

The library can read PDF documents of any version (up to the latest published standard). Extraction of pages is also supported by the library.

该库可以读取任何版本的PDF文档(最新发布的标准)。库也支持页面提取。

Links to sample code:

示例代码的链接:

How to extract text from PDF

如何从PDF中提取文本

How to extract PDF pages

如何提取PDF页面

Disclaimer: I work for the vendor of the library.

免责声明:我为图书馆的供应商工作。

#5

Addition to the to the approved answer: there are also alternative commercial solutions to replace Adobe IFilter for text indexing (providing the similar API but also offering additional premium functionality):

除了批准的答案之外:还有替代商业解决方案来取代Adobe IFilter进行文本索引(提供类似的API,但也提供额外的高级功能):

Foxit PDF IFilter: provides much faster text indexing comparing to Adobe's plugin.

Foxit PDF IFilter:与Adobe的插件相比,提供更快的文本索引。

PDFLib PDF iFilter: includes support for damaged PDF documents plus the additional API to run your own queries.

PDFLib PDF iFilter:包括对损坏的PDF文档的支持以及运行您自己的查询的附加API。

If you are looking for the single tool that can be used from both managed .NET apps and legacy programming languages like classic ASP or VB6 then this is where the commercial ByteScout PDF Extractor SDK would fit as it provides both .NET and ActiveX/COM API.

如果您正在寻找可以从托管.NET应用程序和传统编程语言(如经典ASP或VB6)使用的单一工具,那么商业ByteScout PDF Extractor SDK将适合它,因为它提供.NET和ActiveX / COM API 。

Disclaimer: I work for ByteScout

免责声明:我为ByteScout工作

#1