如何从Pdf,Word和Excel文档中提取文本?

时间:2021-06-01 00:01:26

I'd need a .NET library so that using which I can extract text data from PDF, Excel and Word files.

我需要一个.NET库,以便使用它从PDF,Excel和Word文件中提取文本数据。

Ideally, a free tool!

理想情况下,免费工具!

Would you recommend any?

你会推荐吗?

many thanks,

6 个解决方案

#1


28  

As someone who has spent many days looking for free solutions for (nearly) this exact problem, I can tell you fairly honestly that you will not find a free library that will be able to extract text from all of those formats well. The only library that I'm aware of that does a great job with all of those formats (and more) is a commercial library, and it's not actually native to .NET, it's a C++/COM library, with a C++/CLI .NET wrapper.

作为一个花了很多天寻找(几乎)这个确切问题的免费解决方案的人,我可以相当诚实地告诉你,你将找不到一个能够很好地从所有这些格式中提取文本的免费库。我所知道的唯一一个对所有这些格式都做得很好的库(以及更多)是一个商业库,它实际上不是.NET本机,它是一个C ++ / COM库,带有C ++ / CLI。 NET包装器。

What are some options?

有哪些选择?

  • iTextSharp -- This one is absolutely fantastic in extracting text from PDFs. While later versions of this library were commercial friendly (LGPL), the authors have decided instead that they want to charge for the software, so they've instead released it under the AGPL, so unless you want to release all of your source code, you probably don't want to use one of those versions. However, the last version (4.1.6) licensed under the LGPL can be found all over the internet. This SO question has a link to a version that is under the LGPL.

    iTextSharp - 这个从PDF中提取文本非常棒。虽然这个库的后续版本是商业友好的(LGPL),但作者已经决定他们想要为软件收费,所以他们反而在AGPL下发布它,所以除非你想要发布你的所有源代码,你可能不想使用其中一个版本。但是,LGPL许可的最新版本(4.1.6)可以在互联网上找到。这个SO问题链接到LGPL下的版本。

  • PdfBox -- Another PDF library. This one, IMO, is better because it's under the Apache 2.0 license. There are a few issues with it, as it sometimes (perhaps rarely) will not do as good of a job as iTextSharp. I attribute this more to the fact that it's a newer library than anything else. However, my experience with this library is from months ago. This project is actively developed, and just in the last month, 52 issues have been resolved. I would keep my eye on this one. Please note this is a java library. (Keep reading below for more information on why I've included this.)

    PdfBox - 另一个PDF库。这个,IMO,更好,因为它是在Apache 2.0许可下。它有一些问题,因为它有时(或许很少)不会像iTextSharp那样有效。我更多地将其归因于它是一个比其他任何东西更新的库。但是,我对这个库的经验来自几个月前。该项目积极开发,就在上个月,52个问题得到了解决。我会密切注意这个。请注意这是一个java库。 (请继续阅读以下内容,了解有关我为何包含此内容的详细信息。)

  • POI or NPOI -- These are libraries specifically written for Microsoft office documents, particularly the pre-2007 formats, OLE binary file formats. It does support the newer OpenXML formats, though I'm not sure how mature that part of the library is. POI is the java version (Keep reading below for more information on why I've included this.), where NPOI is a native .NET version. However, NPOI only supports excel documents, where POI can do text extraction on many more types.

    POI或NPOI - 这些是专门为Microsoft Office文档编写的库,特别是2007年之前的格式,OLE二进制文件格式。它确实支持较新的OpenXML格式,但我不确定该部分库的成熟程度。 POI是java版本(请继续阅读以下内容,了解有关我为何包含此内容的更多信息。),其中NPOI是本机.NET版本。但是,NPOI仅支持excel文档,其中POI可以在更多类型上进行文本提取。

  • Open XML SDK 2.0 -- A library for reading/modifying office 2007+ (unencrypted OpenXML) documents created my Microsoft themselves! This is an amazing library for working with these kinds of documents. However, it is a lower-level library and therefore doesn't actually (as far as I know of), have a it does everything text extraction class. There's a fairly good example, (I'm not sure it covers certain cases like text in tables, etc), of text extraction from a word document at this SO answer

    Open XML SDK 2.0 - 用于读取/修改office 2007+(未加密的OpenXML)文档的库自己创建了我的Microsoft!这是一个用于处理这些文档的惊人库。但是,它是一个较低级别的库,因此实际上并不是(据我所知),它有一个文本提取类。有一个相当不错的例子,(我不确定它涵盖某些情况,如表格中的文本等),在这个SO答案中从word文档中提取文本

  • Tika -- Once again, another Java library (I'm not telling you about java libraries for no reason. Keep on reading! :)), and this will be as close to "one library" for text extraction as you can get. Tika can extract metadata and structured text content from many different kinds of files, using existing parsing libraries. It actually uses POI and PdfBox under the hood for office and PDF documents.

    Tika - 再一次,另一个Java库(我没有理由告诉你有关java库的信息。继续阅读!:)),这将尽可能接近文本提取的“一个库”。 Tika可以使用现有的解析库从许多不同类型的文件中提取元数据和结构化文本内容。它实际上在办公室和PDF文档的引擎盖下使用POI和PdfBox。

Non-Commercial

  • dtSearch -- This is a library I'm very familiar with. It does a fantastic job, and can parse a ridiculous amount of file formats. However, it costs money and is probably overkill for what you need. It's actually exactly what we need, but we're trying to get rid of it ourselves, because we only use it for parsing (it's actually a full-text search engine), and there's plenty of parsing libraries out there that we can use or modify to suit our needs, but it honestly blows all these other libraries out of the water. As I mentioned before, it is also not native .NET code. A C++/CLI wrapper is used to intertop between the DLL and the .NET runtime.
  • dtSearch - 这是一个我非常熟悉的库。它做得很棒,可以解析大量的文件格式。然而,它需要花钱,而且可能对你需要的东西有点过分。它实际上正是我们所需要的,但我们正试图自己摆脱它,因为我们只使用它进行解析(它实际上是一个全文搜索引擎),并且有很多我们可以使用的解析库或者修改以满足我们的需求,但它诚实地吹掉所有这些其他库。正如我之前提到的,它也不是本机.NET代码。 C ++ / CLI包装器用于在DLL和.NET运行时之间进行交互。

iFilters can be used, and are mentioned in several other SO answers on different questions, but the text you will get back is unstructured. Sometimes it's just bad...unreadable for humans, at least. I believe that iFilters are also deprecated, and depending on license issues, you might not be able to redistribute them.

可以使用iFilters,并在不同问题的其他几个SO答案中提及,但您将获得的文本是非结构化的。有时它只是坏...至少对人类来说是不可读的。我相信iFilters也已被弃用,并且根据许可证问题,您可能无法重新分发它们。


Why did I mention all of those Java libraries? Well, for two reasons. First, there are no free .NET equivalents that come close to the quality of these Java libraries. Secondly, you can use these libraries in .NET (I've personally done this myself with these libraries, so I can at least vouch for that) using IKVM. It's an implementation of Java inside of .NET. Here is a good example on using IKVM to convert Tika into a .NET assembly that can be used in your project. Perhaps the scariest thing about IKVM, is that it just works!

为什么我提到所有这些Java库?好吧,有两个原因。首先,没有免费的.NET等价物接近这些Java库的质量。其次,您可以在.NET中使用这些库(我亲自使用这些库完成了这些库,因此我至少可以保证使用IKVM)。它是.NET内部的Java实现。以下是使用IKVM将Tika转换为可在项目中使用的.NET程序集的一个很好的示例。也许IKVM最可怕的事情就是它才有效!

EDIT: I forgot that the author of that blog had actually posted the code and converted libraries on a github project. So, if you want to quickly check it out, you can do so there. However, it's a much older version of Tika and over a year old. If the results aren't as you expected, I would suggest trying it yourself with the latest version.

编辑:我忘记了该博客的作者实际上已经在github项目上发布了代码并转换了库。所以,如果你想快速查看它,你可以在那里。然而,这是一个更老的Tika版本,已有一年多了。如果结果不符合您的预期,我建议您自己尝试使用最新版本。

#2


6  

You can take a look at toxy.codeplex.com. Toxy is a pure .NET text extraction framework.

你可以看看toxy.codeplex.com。 Toxy是一个纯.NET文本提取框架。

It's very simple to use Toxy. For example, to extract a Excel spreadsheet file called test.xlsx.

使用Toxy非常简单。例如,要提取名为test.xlsx的Excel电子表格文件。

ParserContext context = new ParserContext("test.xlsx");
ISpreadsheetParser parser = ParserFactory.CreateSpreadsheet(context);
ToxySpreadsheet ss = parser.Parse();
//then you can start handle the result - a ToxySpreadsheet object

#3


2  

Here's a link to extracting from word document:

这是从word文档中提取的链接:

How to extract text from MS office documents in C#

如何从C#中的MS office文档中提取文本

and for the pdf I would use PDFsharp, it is open source and has some good examples and such on their website:

对于pdf,我会使用PDFsharp,它是开源的,并且在他们的网站上有一些很好的例子等:

http://pdfsharp.com/PDFsharp/

#4


1  

For text extracting from pdf itextsharp is awesome. it is free and open source.

对于从pdf中提取文本,itextsharp非常棒。它是免费和开源的。

to read text from pdf it is very easy using this library.

要从pdf中读取文本,使用此库非常容易。

#5


1  

I would recommend Aspose Total for this. A few years ago I did a project on doing pretty much exactly what you are asking and compared to using the Office Interop stuff between different versions of Office (Prior to the change to XML) Aspose was the most robust library. You will probably have to do some OCR based on what you are talking about too. It's not cheap but I found their API's pretty solid and it works on most versions of the file types you are asking about. You should be able to use the free trial to see if it will fit for you project. I have no affiliation with Aspose other than that I used their tools in a production environment.

我会推荐Aspose Total。几年前,我做了一个项目,几乎完全按照你的要求进行操作,并与在不同版本的Office之间使用Office Interop的东西进行比较(在更改为XML之前)Aspose是最强大的库。您可能不得不根据您所说的内容进行一些OCR。它并不便宜,但我发现它们的API非常可靠,适用于您要询问的大多数文件类型。您应该可以使用免费试用版来查看它是否适合您的项目。除了在生产环境中使用他们的工具之外,我与Aspose没有任何关系。

Aspose Total

#6


0  

If you just need text then you can use iFilter. It is not a single product but it is free. iFilter is used to extract the text to support Microsoft Index Service. Search on iFilter .NET C# for examples on how to use it. If you need formatted text then not the right tool. It extracts raw text only with lot of line breaks.

如果您只需要文本,那么您可以使用iFilter。它不是单一产品,而是免费的。 iFilter用于提取文本以支持Microsoft Index Service。在iFilter .NET C#上搜索有关如何使用它的示例。如果您需要格式化文本,那么不是正确的工具。它只提取大量换行符的原始文本。

#1


28  

As someone who has spent many days looking for free solutions for (nearly) this exact problem, I can tell you fairly honestly that you will not find a free library that will be able to extract text from all of those formats well. The only library that I'm aware of that does a great job with all of those formats (and more) is a commercial library, and it's not actually native to .NET, it's a C++/COM library, with a C++/CLI .NET wrapper.

作为一个花了很多天寻找(几乎)这个确切问题的免费解决方案的人,我可以相当诚实地告诉你,你将找不到一个能够很好地从所有这些格式中提取文本的免费库。我所知道的唯一一个对所有这些格式都做得很好的库(以及更多)是一个商业库,它实际上不是.NET本机,它是一个C ++ / COM库,带有C ++ / CLI。 NET包装器。

What are some options?

有哪些选择?

  • iTextSharp -- This one is absolutely fantastic in extracting text from PDFs. While later versions of this library were commercial friendly (LGPL), the authors have decided instead that they want to charge for the software, so they've instead released it under the AGPL, so unless you want to release all of your source code, you probably don't want to use one of those versions. However, the last version (4.1.6) licensed under the LGPL can be found all over the internet. This SO question has a link to a version that is under the LGPL.

    iTextSharp - 这个从PDF中提取文本非常棒。虽然这个库的后续版本是商业友好的(LGPL),但作者已经决定他们想要为软件收费,所以他们反而在AGPL下发布它,所以除非你想要发布你的所有源代码,你可能不想使用其中一个版本。但是,LGPL许可的最新版本(4.1.6)可以在互联网上找到。这个SO问题链接到LGPL下的版本。

  • PdfBox -- Another PDF library. This one, IMO, is better because it's under the Apache 2.0 license. There are a few issues with it, as it sometimes (perhaps rarely) will not do as good of a job as iTextSharp. I attribute this more to the fact that it's a newer library than anything else. However, my experience with this library is from months ago. This project is actively developed, and just in the last month, 52 issues have been resolved. I would keep my eye on this one. Please note this is a java library. (Keep reading below for more information on why I've included this.)

    PdfBox - 另一个PDF库。这个,IMO,更好,因为它是在Apache 2.0许可下。它有一些问题,因为它有时(或许很少)不会像iTextSharp那样有效。我更多地将其归因于它是一个比其他任何东西更新的库。但是,我对这个库的经验来自几个月前。该项目积极开发,就在上个月,52个问题得到了解决。我会密切注意这个。请注意这是一个java库。 (请继续阅读以下内容,了解有关我为何包含此内容的详细信息。)

  • POI or NPOI -- These are libraries specifically written for Microsoft office documents, particularly the pre-2007 formats, OLE binary file formats. It does support the newer OpenXML formats, though I'm not sure how mature that part of the library is. POI is the java version (Keep reading below for more information on why I've included this.), where NPOI is a native .NET version. However, NPOI only supports excel documents, where POI can do text extraction on many more types.

    POI或NPOI - 这些是专门为Microsoft Office文档编写的库,特别是2007年之前的格式,OLE二进制文件格式。它确实支持较新的OpenXML格式,但我不确定该部分库的成熟程度。 POI是java版本(请继续阅读以下内容,了解有关我为何包含此内容的更多信息。),其中NPOI是本机.NET版本。但是,NPOI仅支持excel文档,其中POI可以在更多类型上进行文本提取。

  • Open XML SDK 2.0 -- A library for reading/modifying office 2007+ (unencrypted OpenXML) documents created my Microsoft themselves! This is an amazing library for working with these kinds of documents. However, it is a lower-level library and therefore doesn't actually (as far as I know of), have a it does everything text extraction class. There's a fairly good example, (I'm not sure it covers certain cases like text in tables, etc), of text extraction from a word document at this SO answer

    Open XML SDK 2.0 - 用于读取/修改office 2007+(未加密的OpenXML)文档的库自己创建了我的Microsoft!这是一个用于处理这些文档的惊人库。但是,它是一个较低级别的库,因此实际上并不是(据我所知),它有一个文本提取类。有一个相当不错的例子,(我不确定它涵盖某些情况,如表格中的文本等),在这个SO答案中从word文档中提取文本

  • Tika -- Once again, another Java library (I'm not telling you about java libraries for no reason. Keep on reading! :)), and this will be as close to "one library" for text extraction as you can get. Tika can extract metadata and structured text content from many different kinds of files, using existing parsing libraries. It actually uses POI and PdfBox under the hood for office and PDF documents.

    Tika - 再一次,另一个Java库(我没有理由告诉你有关java库的信息。继续阅读!:)),这将尽可能接近文本提取的“一个库”。 Tika可以使用现有的解析库从许多不同类型的文件中提取元数据和结构化文本内容。它实际上在办公室和PDF文档的引擎盖下使用POI和PdfBox。

Non-Commercial

  • dtSearch -- This is a library I'm very familiar with. It does a fantastic job, and can parse a ridiculous amount of file formats. However, it costs money and is probably overkill for what you need. It's actually exactly what we need, but we're trying to get rid of it ourselves, because we only use it for parsing (it's actually a full-text search engine), and there's plenty of parsing libraries out there that we can use or modify to suit our needs, but it honestly blows all these other libraries out of the water. As I mentioned before, it is also not native .NET code. A C++/CLI wrapper is used to intertop between the DLL and the .NET runtime.
  • dtSearch - 这是一个我非常熟悉的库。它做得很棒,可以解析大量的文件格式。然而,它需要花钱,而且可能对你需要的东西有点过分。它实际上正是我们所需要的,但我们正试图自己摆脱它,因为我们只使用它进行解析(它实际上是一个全文搜索引擎),并且有很多我们可以使用的解析库或者修改以满足我们的需求,但它诚实地吹掉所有这些其他库。正如我之前提到的,它也不是本机.NET代码。 C ++ / CLI包装器用于在DLL和.NET运行时之间进行交互。

iFilters can be used, and are mentioned in several other SO answers on different questions, but the text you will get back is unstructured. Sometimes it's just bad...unreadable for humans, at least. I believe that iFilters are also deprecated, and depending on license issues, you might not be able to redistribute them.

可以使用iFilters,并在不同问题的其他几个SO答案中提及,但您将获得的文本是非结构化的。有时它只是坏...至少对人类来说是不可读的。我相信iFilters也已被弃用,并且根据许可证问题,您可能无法重新分发它们。


Why did I mention all of those Java libraries? Well, for two reasons. First, there are no free .NET equivalents that come close to the quality of these Java libraries. Secondly, you can use these libraries in .NET (I've personally done this myself with these libraries, so I can at least vouch for that) using IKVM. It's an implementation of Java inside of .NET. Here is a good example on using IKVM to convert Tika into a .NET assembly that can be used in your project. Perhaps the scariest thing about IKVM, is that it just works!

为什么我提到所有这些Java库?好吧,有两个原因。首先,没有免费的.NET等价物接近这些Java库的质量。其次,您可以在.NET中使用这些库(我亲自使用这些库完成了这些库,因此我至少可以保证使用IKVM)。它是.NET内部的Java实现。以下是使用IKVM将Tika转换为可在项目中使用的.NET程序集的一个很好的示例。也许IKVM最可怕的事情就是它才有效!

EDIT: I forgot that the author of that blog had actually posted the code and converted libraries on a github project. So, if you want to quickly check it out, you can do so there. However, it's a much older version of Tika and over a year old. If the results aren't as you expected, I would suggest trying it yourself with the latest version.

编辑:我忘记了该博客的作者实际上已经在github项目上发布了代码并转换了库。所以,如果你想快速查看它,你可以在那里。然而,这是一个更老的Tika版本,已有一年多了。如果结果不符合您的预期,我建议您自己尝试使用最新版本。

#2


6  

You can take a look at toxy.codeplex.com. Toxy is a pure .NET text extraction framework.

你可以看看toxy.codeplex.com。 Toxy是一个纯.NET文本提取框架。

It's very simple to use Toxy. For example, to extract a Excel spreadsheet file called test.xlsx.

使用Toxy非常简单。例如,要提取名为test.xlsx的Excel电子表格文件。

ParserContext context = new ParserContext("test.xlsx");
ISpreadsheetParser parser = ParserFactory.CreateSpreadsheet(context);
ToxySpreadsheet ss = parser.Parse();
//then you can start handle the result - a ToxySpreadsheet object

#3


2  

Here's a link to extracting from word document:

这是从word文档中提取的链接:

How to extract text from MS office documents in C#

如何从C#中的MS office文档中提取文本

and for the pdf I would use PDFsharp, it is open source and has some good examples and such on their website:

对于pdf,我会使用PDFsharp,它是开源的,并且在他们的网站上有一些很好的例子等:

http://pdfsharp.com/PDFsharp/

#4


1  

For text extracting from pdf itextsharp is awesome. it is free and open source.

对于从pdf中提取文本,itextsharp非常棒。它是免费和开源的。

to read text from pdf it is very easy using this library.

要从pdf中读取文本,使用此库非常容易。

#5


1  

I would recommend Aspose Total for this. A few years ago I did a project on doing pretty much exactly what you are asking and compared to using the Office Interop stuff between different versions of Office (Prior to the change to XML) Aspose was the most robust library. You will probably have to do some OCR based on what you are talking about too. It's not cheap but I found their API's pretty solid and it works on most versions of the file types you are asking about. You should be able to use the free trial to see if it will fit for you project. I have no affiliation with Aspose other than that I used their tools in a production environment.

我会推荐Aspose Total。几年前,我做了一个项目,几乎完全按照你的要求进行操作,并与在不同版本的Office之间使用Office Interop的东西进行比较(在更改为XML之前)Aspose是最强大的库。您可能不得不根据您所说的内容进行一些OCR。它并不便宜,但我发现它们的API非常可靠,适用于您要询问的大多数文件类型。您应该可以使用免费试用版来查看它是否适合您的项目。除了在生产环境中使用他们的工具之外,我与Aspose没有任何关系。

Aspose Total

#6


0  

If you just need text then you can use iFilter. It is not a single product but it is free. iFilter is used to extract the text to support Microsoft Index Service. Search on iFilter .NET C# for examples on how to use it. If you need formatted text then not the right tool. It extracts raw text only with lot of line breaks.

如果您只需要文本,那么您可以使用iFilter。它不是单一产品,而是免费的。 iFilter用于提取文本以支持Microsoft Index Service。在iFilter .NET C#上搜索有关如何使用它的示例。如果您需要格式化文本,那么不是正确的工具。它只提取大量换行符的原始文本。