如何以编程方式在c#中搜索PDF文档

时间:2021-07-04 23:11:30

I have a need to search a pdf file to see if a certain string is present. The string in question is definitely encoded as text (ie. it is not an image or anything). I have tried just searching the file as though it was plain text, but this does not work.

我需要搜索pdf文件以查看是否存在某个字符串。有问题的字符串肯定被编码为文本(即,它不是图像或任何东西)。我试过只是搜索文件,好像它是纯文本,但这不起作用。

Is it possible to do this? Are there any librarys out there for .net2.0 that will extract/decode all the text out of pdf file for me?

是否有可能做到这一点?是否有任何针对.net2.0的图书馆将为我提取/解码pdf文件中的所有文本?

3 个解决方案

#1


12  

There are a few libraries available out there. Check out http://www.codeproject.com/KB/cs/PDFToText.aspx and http://itextsharp.sourceforge.net/

那里有一些图书馆。查看http://www.codeproject.com/KB/cs/PDFToText.aspx和http://itextsharp.sourceforge.net/

It takes a little bit of effort but it's possible.

这需要一点努力,但它是可能的。

#2


2  

You can use Docotic.Pdf library to search for text in PDF files.

您可以使用Docotic.Pdf库来搜索PDF文件中的文本。

Here is a sample code:

这是一个示例代码:

static void searchForText(string path, string text)
{
    using (PdfDocument pdf = new PdfDocument(path))
    {
        for (int i = 0; i < pdf.Pages.Count; i++)
        {
            string pageText = pdf.Pages[i].GetText();
            int index = pageText.IndexOf(text, 0, StringComparison.CurrentCultureIgnoreCase);
            if (index != -1)
                Console.WriteLine("'{0}' found on page {1}", text, i);
        }
    }
}

The library can also extract formatted and plain text from the whole document or any document page.

该库还可以从整个文档或任何文档页面中提取格式化和纯文本。

Disclaimer: I work for Bit Miracle, vendor of the library.

免责声明:我为图书馆的供应商Bit Miracle工作。

#3


1  

In the vast majority of cases, it's not possible to search the contents of a PDF directly by opening it up in notepad -- and even in the minority of cases (depending on how the PDF was constructed), you'll only ever be able search for individual words due to the way that PDF handles text internally.

在绝大多数情况下,不可能通过在记事本中打开PDF来直接搜索PDF的内容 - 甚至在少数情况下(取决于PDF的构建方式),您将只能够由于PDF在内部处理文本的方式,搜索单个单词。

My company has a commercial solution that will let you extract text from a PDF file. I've included some sample code for you below, as shown on this page, that demonstrates how to search through the text from a PDF file for a particular string.

我的公司有一个商业解决方案,可以让您从PDF文件中提取文本。我在下面为您提供了一些示例代码,如本页所示,演示了如何从PDF文件中搜索特定字符串的文本。

using System;
using System.IO;
using QuickPDFDLL0718;

namespace QPLConsoleApp
{
    public class QPL
    {
        public static void Main()
        {
            // This example uses the DLL edition of Quick PDF Library
            // Create an instance of the class and give it the path to the DLL
            PDFLibrary QP = new PDFLibrary("QuickPDFDLL0718.dll");

            // Check if the DLL was loaded successfully
            if (QP.LibraryLoaded())
            {
                // Insert license key here / Check the license key
                if (QP.UnlockKey("...") == 1)
                {
                    QP.LoadFromFile(@"C:\Program Files\Quick PDF Library\DLL\GettingStarted.pdf");

                    int iPageCount = QP.PageCount();
                    int PageNumber = 1;
                    int MatchesFound = 0;

                    while (PageNumber <= iPageCount)
                    {
                        QP.SelectPage(PageNumber);
                        string PageText = QP.GetPageText(3);

                        using (StreamWriter TempFile = new StreamWriter(QP.GetTempPath() + "temp" + PageNumber + ".txt"))
                        {
                            TempFile.Write(PageText);
                        }

                        string[] lines = File.ReadAllLines(QP.GetTempPath() + "temp" + PageNumber + ".txt");
                        string[][] grid = new string[lines.Length][];

                        for (int i = 0; i < lines.Length; i++)
                        {
                            grid[i] = lines[i].Split(',');
                        }

                        foreach (string[] line in grid)
                        {
                            string FindMatch = line[11];

                            // Update this string to the word that you're searching for.
                            // It can be one or more words (i.e. "sunday" or "last sunday".

                            if (FindMatch.Contains("characters"))
                            {
                                Console.WriteLine("Success! Word match found on page: " + PageNumber);
                                MatchesFound++;
                            }
                        }
                        PageNumber++;
                    }

                    if (MatchesFound == 0)
                    {
                        Console.WriteLine("Sorry! No matches found.");
                    }
                    else
                    {
                        Console.WriteLine();
                        Console.WriteLine("Total: " + MatchesFound + " matches found!");
                    }
                    Console.ReadLine();
                }
            }
        }
    }
}

#1


12  

There are a few libraries available out there. Check out http://www.codeproject.com/KB/cs/PDFToText.aspx and http://itextsharp.sourceforge.net/

那里有一些图书馆。查看http://www.codeproject.com/KB/cs/PDFToText.aspx和http://itextsharp.sourceforge.net/

It takes a little bit of effort but it's possible.

这需要一点努力,但它是可能的。

#2


2  

You can use Docotic.Pdf library to search for text in PDF files.

您可以使用Docotic.Pdf库来搜索PDF文件中的文本。

Here is a sample code:

这是一个示例代码:

static void searchForText(string path, string text)
{
    using (PdfDocument pdf = new PdfDocument(path))
    {
        for (int i = 0; i < pdf.Pages.Count; i++)
        {
            string pageText = pdf.Pages[i].GetText();
            int index = pageText.IndexOf(text, 0, StringComparison.CurrentCultureIgnoreCase);
            if (index != -1)
                Console.WriteLine("'{0}' found on page {1}", text, i);
        }
    }
}

The library can also extract formatted and plain text from the whole document or any document page.

该库还可以从整个文档或任何文档页面中提取格式化和纯文本。

Disclaimer: I work for Bit Miracle, vendor of the library.

免责声明:我为图书馆的供应商Bit Miracle工作。

#3


1  

In the vast majority of cases, it's not possible to search the contents of a PDF directly by opening it up in notepad -- and even in the minority of cases (depending on how the PDF was constructed), you'll only ever be able search for individual words due to the way that PDF handles text internally.

在绝大多数情况下,不可能通过在记事本中打开PDF来直接搜索PDF的内容 - 甚至在少数情况下(取决于PDF的构建方式),您将只能够由于PDF在内部处理文本的方式,搜索单个单词。

My company has a commercial solution that will let you extract text from a PDF file. I've included some sample code for you below, as shown on this page, that demonstrates how to search through the text from a PDF file for a particular string.

我的公司有一个商业解决方案,可以让您从PDF文件中提取文本。我在下面为您提供了一些示例代码,如本页所示,演示了如何从PDF文件中搜索特定字符串的文本。

using System;
using System.IO;
using QuickPDFDLL0718;

namespace QPLConsoleApp
{
    public class QPL
    {
        public static void Main()
        {
            // This example uses the DLL edition of Quick PDF Library
            // Create an instance of the class and give it the path to the DLL
            PDFLibrary QP = new PDFLibrary("QuickPDFDLL0718.dll");

            // Check if the DLL was loaded successfully
            if (QP.LibraryLoaded())
            {
                // Insert license key here / Check the license key
                if (QP.UnlockKey("...") == 1)
                {
                    QP.LoadFromFile(@"C:\Program Files\Quick PDF Library\DLL\GettingStarted.pdf");

                    int iPageCount = QP.PageCount();
                    int PageNumber = 1;
                    int MatchesFound = 0;

                    while (PageNumber <= iPageCount)
                    {
                        QP.SelectPage(PageNumber);
                        string PageText = QP.GetPageText(3);

                        using (StreamWriter TempFile = new StreamWriter(QP.GetTempPath() + "temp" + PageNumber + ".txt"))
                        {
                            TempFile.Write(PageText);
                        }

                        string[] lines = File.ReadAllLines(QP.GetTempPath() + "temp" + PageNumber + ".txt");
                        string[][] grid = new string[lines.Length][];

                        for (int i = 0; i < lines.Length; i++)
                        {
                            grid[i] = lines[i].Split(',');
                        }

                        foreach (string[] line in grid)
                        {
                            string FindMatch = line[11];

                            // Update this string to the word that you're searching for.
                            // It can be one or more words (i.e. "sunday" or "last sunday".

                            if (FindMatch.Contains("characters"))
                            {
                                Console.WriteLine("Success! Word match found on page: " + PageNumber);
                                MatchesFound++;
                            }
                        }
                        PageNumber++;
                    }

                    if (MatchesFound == 0)
                    {
                        Console.WriteLine("Sorry! No matches found.");
                    }
                    else
                    {
                        Console.WriteLine();
                        Console.WriteLine("Total: " + MatchesFound + " matches found!");
                    }
                    Console.ReadLine();
                }
            }
        }
    }
}