How can I parse content from a PDF page with Swift

The documentation is not really clear to me. So far I reckon I need to set up a CGPDFOperatorTable and then create a CGPDFContentStreamCreateWithPage and CGPDFScannerCreate per PDF page.

文档对我来说并不是很清楚。到目前为止,我认为我需要设置一个CGPDFOperatorTable,然后为每个PDF页面创建一个CGPDFContentStreamCreateWithPage和CGPDFScannerCreate。

The documentation refers to setting up Callbacks, but it's unclear to me how. How to actually obtain the content from a page?

文档是指设置回调,但我不清楚如何。如何从页面实际获取内容?

This is my code so far.

到目前为止这是我的代码。

    let pdfURL = NSBundle.mainBundle().URLForResource("titleofdocument", withExtension: "pdf")

    // Create pdf document
    let pdfDoc = CGPDFDocumentCreateWithURL(pdfURL)

    // Nr of pages in this PF
    let numberOfPages = CGPDFDocumentGetNumberOfPages(pdfDoc) as Int

    if numberOfPages <= 0 {
        // The number of pages is zero
        return
    }

    let myTable = CGPDFOperatorTableCreate()

    // lets go through every page
    for pageNr in 1...numberOfPages {

        let thisPage = CGPDFDocumentGetPage(pdfDoc, pageNr)
        let myContentStream = CGPDFContentStreamCreateWithPage(thisPage)
        let myScanner = CGPDFScannerCreate(myContentStream, myTable, nil)

        CGPDFScannerScan(myScanner)

        // Search for Content here?
        // ??

        CGPDFScannerRelease(myScanner)
        CGPDFContentStreamRelease(myContentStream)

    }

    // Release Table
    CGPDFOperatorTableRelease(myTable)

It's a similar question to: PDF Parsing with SWIFT but has no answers yet.

这是一个类似的问题:PDF解析SWIFT但尚无答案。

3 个解决方案

#1

Here is an example of the callbacks implemented in Swift:

以下是Swift中实现的回调示例:

    let operatorTableRef = CGPDFOperatorTableCreate()

    CGPDFOperatorTableSetCallback(operatorTableRef, "BT") { (scanner, info) in
        print("Begin text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "ET") { (scanner, info) in
        print("End text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "Tf") { (scanner, info) in
        print("Select font")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "Tj") { (scanner, info) in
        print("Show text")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef, "TJ") { (scanner, info) in
        print("Show text, allowing individual glyph positioning")
    }

    let numPages = CGPDFDocumentGetNumberOfPages(pdfDocument)
    for pageNum in 1...numPages {
        let page = CGPDFDocumentGetPage(pdfDocument, pageNum)
        let stream = CGPDFContentStreamCreateWithPage(page)
        let scanner = CGPDFScannerCreate(stream, operatorTableRef, nil)
        CGPDFScannerScan(scanner)
        CGPDFScannerRelease(scanner)
        CGPDFContentStreamRelease(stream)
    }

#2

You've actually specified exactly how to do it, all you need to do is put it together and try until it works.

您实际上已经确切地指定了如何操作,您需要做的就是将它放在一起并尝试直到它工作。

First of all, you need to setup a a table with callbacks as you state yourself in the beginning of your question (all code in Objective C, NOT Swift):

首先,你需要设置一个带回调的表,当你在问题的开头陈述自己时(Objective C中的所有代码,不是Swift):

CGPDFOperatorTableRef operatorTable = CGPDFOperatorTableCreate();
CGPDFOperatorTableSetCallback(operatorTable, "q", &op_q);
CGPDFOperatorTableSetCallback(operatorTable, "Q", &op_Q);

This table contains a list of the PDF operators you want to get called for and associates a callback with them. Those callbacks are simply functions you define elsewhere:

此表包含要调用的PDF运算符列表,并将回调与它们关联。那些回调只是你在别处定义的函数:

static void op_q(CGPDFScannerRef s, void *info) {
    // Do whatever you have to do in here
    // info is whatever you passed to CGPDFScannerCreate
}

static void op_Q(CGPDFScannerRef s, void *info) {
    // Do whatever you have to do in here
    // info is whatever you passed to CGPDFScannerCreate
}

And then you create the scanner and get it going, while passing it the information you just defined.

然后你创建扫描仪并开始运行,同时传递你刚才定义的信息。

// Passing "self" is just an example, you can pass whatever you want and it will be provided to your callback whenever it is called by the scanner.
CGPDFScannerRef contentStreamScanner = CGPDFScannerCreate(contentStream, operatorTable, self);

CGPDFScannerScan(contentStreamScanner);

If you want to see a complete example with sourcecode on how to find and process images, check this website.

如果您想查看有关如何查找和处理图像的源代码的完整示例,请查看此网站。

#3

-1

To understand why a parser works this way, you need to read the PDF specification a bit better. A PDF file contains something close to printing instructions. Such as "move to this coordinate, print this character, move there, change the color, print the character number 23 from the font #23", etc.

要理解解析器为何以这种方式工作,您需要更好地阅读PDF规范。 PDF文件包含与打印说明相近的内容。例如“移动到此坐标,打印此字符,移动到那里,更改颜色,从字体#23打印字符编号23”等。

The parser gives you callbacks for each instructions, with the possibility to retrieve the instruction parameters. That's all.

解析器为每条指令提供回调,并可以检索指令参数。就这样。

So, in order to get the content from a file, you need to rebuild its state manually. Which means, recompute the frames for all characters, and try to reverse-engineer the page layout. This is clearly not an easy task, and that's why people have created libraries to do so.

因此,为了从文件中获取内容,您需要手动重建其状态。这意味着,重新计算所有字符的帧,并尝试对页面布局进行反向工程。这显然不是一件容易的事,这也是人们创建库的原因。

You may want to have a look at PDFKitten , or PDFParser which is a Swift port with some improvement that i did.

你可能想看看PDFKitten,或者PDFParser这是一个Swift端口,我做了一些改进。

#1