从pdf文件中提取矢量图像

时间:2021-04-25 00:10:49

Is there a command line tool on linux that would extract figures from a pdf file, and save them in vector format? I know about pdfimages, but that would create a bitmap, and that is not what I need.

Linux上是否有一个命令行工具可以从pdf文件中提取数据,并以矢量格式保存它们?我知道pdfimages,但这会创建一个位图,这不是我需要的。

3 个解决方案

#1


15  

not for images only, as you seem to need, but

不仅仅是图像,因为你似乎需要,但是

  • pdftocairo
  • pdftocairo

http://poppler.freedesktop.org/

http://poppler.freedesktop.org/

http://www.manpagez.com/man/1/pdftocairo/ (manpage)

http://www.manpagez.com/man/1/pdftocairo/(手册页)

is able to render a pdf page to other vector formats like PS/EPS/SVG

能够将pdf页面呈现为其他矢量格式,如PS / EPS / SVG

assuming you have a pdf page with vectorized images, you can render this page to svg and then copy only image you are interested in

假设您有一个带矢量化图像的pdf页面,您可以将此页面渲染为svg,然后仅复制您感兴趣的图像

note: pdftocairo cannot render multipage pdf to multipage svg

注意:pdftocairo无法将多页pdf渲染为多页svg

if you need to convert to svg several pdf pages you need first to pick this page range and then burst pdf pages into single pdf pages

如果您需要转换为svg几个pdf页面,您首先需要选择此页面范围,然后将pdf页面突发到单个pdf页面

example (if we need to convert pages 1-10 of a pdf file to svg)

示例(如果我们需要将pdf文件的1-10页转换为svg)

pdftk file.pdf cat 1-10 output 1-10.pdf

pdftk file.pdf cat 1-10输出1-10.pdf

pdftk 1-10.pdf burst

pdftk 1-10.pdf爆发

for f in *.pdf; do pdftocairo -svg $f; done

for f in * .pdf; do pdftocairo -svg $ f; DONE

finally, with sodipodi or inkscape, you can extract images you are interested from svg rendered pdf page

最后,使用sodipodi或inkscape,您可以从svg渲染的pdf页面中提取您感兴趣的图像

#2


3  

What do you consider a "figure"? This is a concept that doesn't exist in PDF. The reason there are so many tools that can extract images from a PDF file, is because images are a very clearly identified entity.

你认为什么是“人物”?这是PDF中不存在的概念。有这么多工具可以从PDF文件中提取图像的原因是因为图像是一个非常清晰的实体。

Your "figures" however, are much less clearly defined. PDF files may contain lots of vector content that you wouldn't call a figure. Text can be stroked for example, which would make it vector art and as such it might be confused with your figures. Other decorative elements may be used in the background of the pages. Text may be underlined, which would be a vector element...

然而,你的“数字”却没那么明确。 PDF文件可能包含许多您不会称之为数字的矢量内容。例如,可以描述文本,这将使其成为矢量艺术,因此可能会与您的数字混淆。其他装饰元素可以在页面的背景中使用。文字可能带下划线,这是一个矢量元素......

In the other direction, your "figure" may contain a caption that is text, further complicating things.

在另一个方向,你的“图”可能包含一个文本标题,使事情更加复杂。

As PDF doesn't have the notion of a figure, you'll have to figure out how to isolate one on a PDF page (perhaps because the creator application always adds metadata to them, or because they use a special color or... If you can isolate them, it should be possible to trim everything irrelevant on the page and export what you need as EPS or SVG using some of the techniques described in the other answer.

由于PDF没有图形的概念,你必须弄清楚如何在PDF页面上隔离一个(可能是因为创建者应用程序总是向它们添加元数据,或者因为它们使用特殊颜色或......如果您可以隔离它们,则应该可以修剪页面上不相关的所有内容,并使用其他答案中描述的一些技术将您需要的内容导出为EPS或SVG。

#3


2  

This article describes the tools gpdfx, inkscape and pdf2svg which are not completely commandline-based, but still sound helpful.

本文介绍的工具gpdfx,inkscape和pdf2svg不是完全基于命令行的,但仍然听起来很有帮助。

#1


15  

not for images only, as you seem to need, but

不仅仅是图像,因为你似乎需要,但是

  • pdftocairo
  • pdftocairo

http://poppler.freedesktop.org/

http://poppler.freedesktop.org/

http://www.manpagez.com/man/1/pdftocairo/ (manpage)

http://www.manpagez.com/man/1/pdftocairo/(手册页)

is able to render a pdf page to other vector formats like PS/EPS/SVG

能够将pdf页面呈现为其他矢量格式,如PS / EPS / SVG

assuming you have a pdf page with vectorized images, you can render this page to svg and then copy only image you are interested in

假设您有一个带矢量化图像的pdf页面,您可以将此页面渲染为svg,然后仅复制您感兴趣的图像

note: pdftocairo cannot render multipage pdf to multipage svg

注意:pdftocairo无法将多页pdf渲染为多页svg

if you need to convert to svg several pdf pages you need first to pick this page range and then burst pdf pages into single pdf pages

如果您需要转换为svg几个pdf页面,您首先需要选择此页面范围,然后将pdf页面突发到单个pdf页面

example (if we need to convert pages 1-10 of a pdf file to svg)

示例(如果我们需要将pdf文件的1-10页转换为svg)

pdftk file.pdf cat 1-10 output 1-10.pdf

pdftk file.pdf cat 1-10输出1-10.pdf

pdftk 1-10.pdf burst

pdftk 1-10.pdf爆发

for f in *.pdf; do pdftocairo -svg $f; done

for f in * .pdf; do pdftocairo -svg $ f; DONE

finally, with sodipodi or inkscape, you can extract images you are interested from svg rendered pdf page

最后,使用sodipodi或inkscape,您可以从svg渲染的pdf页面中提取您感兴趣的图像

#2


3  

What do you consider a "figure"? This is a concept that doesn't exist in PDF. The reason there are so many tools that can extract images from a PDF file, is because images are a very clearly identified entity.

你认为什么是“人物”?这是PDF中不存在的概念。有这么多工具可以从PDF文件中提取图像的原因是因为图像是一个非常清晰的实体。

Your "figures" however, are much less clearly defined. PDF files may contain lots of vector content that you wouldn't call a figure. Text can be stroked for example, which would make it vector art and as such it might be confused with your figures. Other decorative elements may be used in the background of the pages. Text may be underlined, which would be a vector element...

然而,你的“数字”却没那么明确。 PDF文件可能包含许多您不会称之为数字的矢量内容。例如,可以描述文本,这将使其成为矢量艺术,因此可能会与您的数字混淆。其他装饰元素可以在页面的背景中使用。文字可能带下划线,这是一个矢量元素......

In the other direction, your "figure" may contain a caption that is text, further complicating things.

在另一个方向,你的“图”可能包含一个文本标题,使事情更加复杂。

As PDF doesn't have the notion of a figure, you'll have to figure out how to isolate one on a PDF page (perhaps because the creator application always adds metadata to them, or because they use a special color or... If you can isolate them, it should be possible to trim everything irrelevant on the page and export what you need as EPS or SVG using some of the techniques described in the other answer.

由于PDF没有图形的概念,你必须弄清楚如何在PDF页面上隔离一个(可能是因为创建者应用程序总是向它们添加元数据,或者因为它们使用特殊颜色或......如果您可以隔离它们,则应该可以修剪页面上不相关的所有内容,并使用其他答案中描述的一些技术将您需要的内容导出为EPS或SVG。

#3


2  

This article describes the tools gpdfx, inkscape and pdf2svg which are not completely commandline-based, but still sound helpful.

本文介绍的工具gpdfx,inkscape和pdf2svg不是完全基于命令行的,但仍然听起来很有帮助。