使用Python将PDF转换为一系列图像

时间:2022-05-11 09:00:53

I'm attempting to use Python to convert a multi-page PDF into a series of JPEGs. I can split the PDF up into individual pages easily enough with available tools, but I haven't been able to find anything that can covert PDFs to images.

我正在尝试使用Python将多页PDF转换为一系列JPEG。我可以使用可用的工具轻松地将PDF分割成单独的页面,但我无法找到任何可以将PDF转换为图像的内容。

PIL does not work, as it can't read PDFs. The two options I've found are using either GhostScript or ImageMagick through the shell. This is not a viable option for me, since this program needs to be cross-platform, and I can't be sure either of those programs will be available on the machines it will be installed and used on.

PIL不起作用,因为它无法读取PDF。我发现的两个选项是通过shell使用GhostScript或ImageMagick。这对我来说不是一个可行的选择,因为这个程序需要是跨平台的,我不能确定这些程序中的任何一个都可以在它将被安装和使用的机器上使用。

Are there any Python libraries out there that can do this?

有没有可以做到这一点的Python库?

5 个解决方案

#1


18  

ImageMagick has Python bindings.

ImageMagick有Python绑定。

#2


4  

You can't avoid the Ghostscript dependency. Even Imagemagick relies on Ghostscript for its PDF reading functions. The reason for this is the complexity of the PDF format: a PDF doesn't just contain bitmap information, but mostly vector shapes, transparencies etc. Furthermore it is quite complex to figure out which of these objects appear on which page.

你无法避免Ghostscript依赖。甚至Imagemagick也依赖于Ghostscript的PDF阅读功能。原因在于PDF格式的复杂性:PDF不仅包含位图信息,而且主要是矢量形状,透明度等。此外,要弄清楚哪些对象出现在哪个页面上是非常复杂的。

So the correct rendering of a PDF Page is clearly out of scope for a pure Python library.

因此,正确呈现PDF页面显然超出了纯Python库的范围。

The good news is that Ghostscript is pre-installed on many windows and Linux systems, because it is also needed by all those PDF Printers (except Adobe Acrobat).

好消息是Ghostscript已预先安装在许多Windows和Linux系统上,因为所有这些PDF打印机(Adobe Acrobat除外)也需要它。

#3


4  

Here's whats worked for me using the python ghostscript module (installed by '$ pip install ghostscript'):

这里有什么用我使用python ghostscript模块(由'$ pip install ghostscript'安装):

import ghostscript

def pdf2jpeg(pdf_input_path, jpeg_output_path):
    args = ["pdf2jpeg", # actual value doesn't matter
            "-dNOPAUSE",
            "-sDEVICE=jpeg",
            "-r144",
            "-sOutputFile=" + jpeg_output_path,
            pdf_input_path]
    ghostscript.Ghostscript(*args)

I also installed Ghostscript 9.18 on my computer and it probably wouldn't have worked otherwise.

我还在我的电脑上安装了Ghostscript 9.18,否则它可能没有用。

#4


1  

If you're using linux some versions come with a command line utility called 'pdftopbm' out of the box. Check out netpbm

如果您正在使用Linux,则某些版本会附带一个名为'pdftopbm'的命令行实用程序。查看netpbm

#5


1  

Perhaps relevant: http://www.swftools.org/gfx_tutorial.html

也许相关:http://www.swftools.org/gfx_tutorial.html

#1


18  

ImageMagick has Python bindings.

ImageMagick有Python绑定。

#2


4  

You can't avoid the Ghostscript dependency. Even Imagemagick relies on Ghostscript for its PDF reading functions. The reason for this is the complexity of the PDF format: a PDF doesn't just contain bitmap information, but mostly vector shapes, transparencies etc. Furthermore it is quite complex to figure out which of these objects appear on which page.

你无法避免Ghostscript依赖。甚至Imagemagick也依赖于Ghostscript的PDF阅读功能。原因在于PDF格式的复杂性:PDF不仅包含位图信息,而且主要是矢量形状,透明度等。此外,要弄清楚哪些对象出现在哪个页面上是非常复杂的。

So the correct rendering of a PDF Page is clearly out of scope for a pure Python library.

因此,正确呈现PDF页面显然超出了纯Python库的范围。

The good news is that Ghostscript is pre-installed on many windows and Linux systems, because it is also needed by all those PDF Printers (except Adobe Acrobat).

好消息是Ghostscript已预先安装在许多Windows和Linux系统上,因为所有这些PDF打印机(Adobe Acrobat除外)也需要它。

#3


4  

Here's whats worked for me using the python ghostscript module (installed by '$ pip install ghostscript'):

这里有什么用我使用python ghostscript模块(由'$ pip install ghostscript'安装):

import ghostscript

def pdf2jpeg(pdf_input_path, jpeg_output_path):
    args = ["pdf2jpeg", # actual value doesn't matter
            "-dNOPAUSE",
            "-sDEVICE=jpeg",
            "-r144",
            "-sOutputFile=" + jpeg_output_path,
            pdf_input_path]
    ghostscript.Ghostscript(*args)

I also installed Ghostscript 9.18 on my computer and it probably wouldn't have worked otherwise.

我还在我的电脑上安装了Ghostscript 9.18,否则它可能没有用。

#4


1  

If you're using linux some versions come with a command line utility called 'pdftopbm' out of the box. Check out netpbm

如果您正在使用Linux,则某些版本会附带一个名为'pdftopbm'的命令行实用程序。查看netpbm

#5


1  

Perhaps relevant: http://www.swftools.org/gfx_tutorial.html

也许相关:http://www.swftools.org/gfx_tutorial.html