寻找PDF文件解析器

时间:2022-10-29 13:42:26

Does anyone know of a PDF file parser that I could use to pull out sections of text from the plaintext pdf file? Specifially I want a way to be able to reliably pull out the section of text specific to annotations?

有没有人知道我可以用来从明文pdf文件中提取文本部分的PDF文件解析器?特别是我想要一种能够可靠地提取特定于注释的文本部分的方法吗?

Delphi, C# RegEx I dont mind.

Delphi,C#RegEx我不介意。

6 个解决方案

#1


5  

The PDF File Parser article on xactpro seems to be exactly what you need. It explains the format of the PDF and comes with full source code for a parser (and another project for visualisation of the model).

关于xactpro的PDF File Parser文章似乎正是您所需要的。它解释了PDF的格式,并附带了解析器的完整源代码(以及用于可视化模型的另一个项目)。

The parser uses format-specific terms, but you could easily use the visualiser to learn what to look for.

解析器使用特定于格式的术语,但您可以轻松使用可视化工具来学习要查找的内容。

#2


2  

You can also take a look at Xpdf (http://www.foolabs.com/xpdf/download.html)

你也可以看看Xpdf(http://www.foolabs.com/xpdf/download.html)

#3


1  

Not sure if it supports the functionality you need, but we've been using abcPDF with some success.

不确定它是否支持您需要的功能,但我们一直在使用abcPDF取得了一些成功。

#4


1  

check out pdfbox

看看pdfbox

#5


1  

abcPDF does let you extract annotations, they have a very good section in the help for it, but the code to handle it is generally :

abcPDF确实允许你提取注释,它们在帮助中有一个非常好的部分,但处理它的代码通常是:

    for (int objectIndex = 0; objectIndex < theDoc.ObjectSoup.Count; objectIndex++)
        {
            try
            {
                IndirectObject element = theDoc.ObjectSoup.ElementAt(objectIndex);

                string elementType = element.GetType().ToString();
                switch (elementType)
                {
                    case "WebSupergoo.ABCpdf8.Objects.Annotation":
                       //process the annotation, which could be all kinds of stuff
                        WebSupergoo.ABCpdf8.Objects.Annotation annotation = (WebSupergoo.ABCpdf8.Objects.Annotation)element; 

                        ProcessAnnotation(annotation);

...

#6


0  

I don't know all the features of these PDF parsers, but Aspose has a pretty comprehensive one. We did, unfortunately, come across two bugs, and I've been waiting a long time for them to be fixed.

我不知道这些PDF解析器的所有功能,但Aspose有一个非常全面的功能。不幸的是,我们遇到了两个错误,我已经等了很长时间才能修复它们。

ITextSharp seems to be the most common open source PDF parser for .Net.

ITextSharp似乎是.Net最常见的开源PDF解析器。

#1


5  

The PDF File Parser article on xactpro seems to be exactly what you need. It explains the format of the PDF and comes with full source code for a parser (and another project for visualisation of the model).

关于xactpro的PDF File Parser文章似乎正是您所需要的。它解释了PDF的格式,并附带了解析器的完整源代码(以及用于可视化模型的另一个项目)。

The parser uses format-specific terms, but you could easily use the visualiser to learn what to look for.

解析器使用特定于格式的术语,但您可以轻松使用可视化工具来学习要查找的内容。

#2


2  

You can also take a look at Xpdf (http://www.foolabs.com/xpdf/download.html)

你也可以看看Xpdf(http://www.foolabs.com/xpdf/download.html)

#3


1  

Not sure if it supports the functionality you need, but we've been using abcPDF with some success.

不确定它是否支持您需要的功能,但我们一直在使用abcPDF取得了一些成功。

#4


1  

check out pdfbox

看看pdfbox

#5


1  

abcPDF does let you extract annotations, they have a very good section in the help for it, but the code to handle it is generally :

abcPDF确实允许你提取注释,它们在帮助中有一个非常好的部分,但处理它的代码通常是:

    for (int objectIndex = 0; objectIndex < theDoc.ObjectSoup.Count; objectIndex++)
        {
            try
            {
                IndirectObject element = theDoc.ObjectSoup.ElementAt(objectIndex);

                string elementType = element.GetType().ToString();
                switch (elementType)
                {
                    case "WebSupergoo.ABCpdf8.Objects.Annotation":
                       //process the annotation, which could be all kinds of stuff
                        WebSupergoo.ABCpdf8.Objects.Annotation annotation = (WebSupergoo.ABCpdf8.Objects.Annotation)element; 

                        ProcessAnnotation(annotation);

...

#6


0  

I don't know all the features of these PDF parsers, but Aspose has a pretty comprehensive one. We did, unfortunately, come across two bugs, and I've been waiting a long time for them to be fixed.

我不知道这些PDF解析器的所有功能,但Aspose有一个非常全面的功能。不幸的是,我们遇到了两个错误,我已经等了很长时间才能修复它们。

ITextSharp seems to be the most common open source PDF parser for .Net.

ITextSharp似乎是.Net最常见的开源PDF解析器。