运行ImageMagick将低质量pdf转换为图像的最佳参数是什么(对于OCR)

时间:2021-09-19 08:57:03

I have several low quality pdfs. I would like to use OCR -- to be more precise Ocropus to get text from them. To do use, I use first ImageMagick -- a command line tool to convert pdf to images -- to transforms these pdfs into jpg or png.

我有几个低质量的pdf。我想使用OCR - 更精确的Ocropus从中获取文本。要使用,我首先使用ImageMagick - 一个命令行工具将pdf转换为图像 - 将这些pdf转换为jpg或png。

However ImageMagick produces very low quality images and Ocropus hardly recognizes anything. I would like to learn what are the best parameters for handling low quality pdfs to provide as-good-as-possible-quality images to OCR.

然而,ImageMagick产生非常低质量的图像,而Ocropus几乎无法识别任何东西。我想了解处理低质量pdf的最佳参数是什么,以便为OCR提供尽可能高质量的图像。

I have found this page, but I do not know where to start.

我找到了这个页面,但我不知道从哪里开始。

3 个解决方案

#1


14  

You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing

您可以通过键入来了解ImageMagick的“委托”(IM使用的外部程序,如Ghostscript)的详细设置

convert -list delegate

(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:

(在我的系统上是32个不同命令的列表。)现在看看哪些命令用于转换为PNG,请使用:

convert -list delegate | findstr /i png

Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:

好的,这是针对Windows的。您没有说明您使用的操作系统。 [*]如果您使用的是Linux,请尝试以下操作:

convert -list delegate | grep -i png

You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:

你会发现IM确实只从PS或EPS输入产生PNG。那么IM如何从你的PDF中获取(E)PS?简单:

convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF

Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.

啊!它使用Ghostscript进行PDF => PS转换,然后再次使用Ghostscript进行PS => PNG转换。有效,但如果您知道Ghostscript可以一次性执行PDF => PNG,则不是最有效的方法。而且更快。而且质量要好得多。

About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:

关于IM通过Ghostscript代表处理PDF转换到图像的处理,您应首先了解两件事:

  1. By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
  2. 默认情况下,如果您不提供额外参数,Ghostscript将输出分辨率为72dpi的图像。这就是为什么Karl的回答建议添加-density 600,它告诉Ghostscript使用600 dpi分辨率来显示图像。
  3. The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
    • PDF can handle transparencies, which PostScript can not.
    • PDF可以处理透明胶片,而PostScript则无法处理。
    • PDF can embed TrueType fonts, which Ghostscript can not. etc.pp. Conversion in the direction PS => PDF is not that critical....)
    • PDF可以嵌入TrueType字体,Ghostscript不能。 etc.pp. PS => PDF方向的转换并不重要....)
  4. 绕过IM调用Ghostscript两次转换PDF => PS然后PS => PNG是一个真正的错误。因为你从来没有赢过并且在第一步中保持质量,但经常会松一些。理由:PDF可以处理透明胶片,而PostScript则无法处理。 PDF可以嵌入TrueType字体,Ghostscript不能。 etc.pp. PS => PDF方向的转换并不重要....)

That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:

这就是为什么我建议您直接使用Ghostscript将PDF转换为PNG(或JPEG)。并使用Ghostscript的最新版本8.71(即将发布:9.01)!以下是示例命令:

gswin32c.exe ^
  -sDEVICE=pngalpha ^
  -o output/page_%03d.png ^
  -r600 ^
  d:/path/to/your/input.pdf

(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try

(这是Windows的命令行。在Linux上,使用gs而不是gswin32c.exe,而不是^。)此命令需要找到一个输出子目录,它将为每个PDF页面存储一个单独的文件。要制作高质量的JPEG,请尝试

gs \
  -sDEVICE=jpeg \
  -o output/page_%03d.jpeg \
  -r600 \
  -dJPEGQ=95 \
  /path/to/your/input.pdf

(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.

(Linux命令版本)。这种直接转换避免了中间PostScript格式,这可能丢失了原始PDF文件中的TrueType字体和透明度对象的信息。


[*] D'oh! I missed to see your "linux" tag at first...

[*] D'哦!我最初错过了看到你的“linux”标签......

#2


5  

-density 600 or so should give you what you need.

- 密度600左右应该给你你需要的。

#3


0  

At least two other tools you may want to consider:

您可能需要考虑至少两个其他工具:

  • pdfimages, which comes with the package poppler-utils, makes it easy to extract the images from a PDF without degrading them.
  • popfler-utils软件包附带的pdfimages可以轻松地从PDF中提取图像而不会降低它们的性能。
  • pdfsandwich, which can give you an OCR'd file by simply running pdfsandwich inputfile.pdf. You may need to tweak the options to get a decent result. See the official page for more info.
  • pdfsandwich,只需运行pdfsandwich inputfile.pdf即可为您提供OCR文件。您可能需要调整选项以获得不错的结果。有关详细信息,请参阅官方页面。

#1


14  

You can learn about the detailed settings ImageMagick's "delegates" (external programs IM uses, such as Ghostscript) by typing

您可以通过键入来了解ImageMagick的“委托”(IM使用的外部程序,如Ghostscript)的详细设置

convert -list delegate

(On my system that's a list of 32 different commands.) Now to see which commands are used to convert to PNG, use this:

(在我的系统上是32个不同命令的列表。)现在看看哪些命令用于转换为PNG,请使用:

convert -list delegate | findstr /i png

Ok, this was for Windows. You didn't say which OS you use. [*] If you are on Linux, try this:

好的,这是针对Windows的。您没有说明您使用的操作系统。 [*]如果您使用的是Linux,请尝试以下操作:

convert -list delegate | grep -i png

You'll discover that IM does produce PNG only from PS or EPS input. So how does IM get (E)PS from your PDF? Easy:

你会发现IM确实只从PS或EPS输入产生PNG。那么IM如何从你的PDF中获取(E)PS?简单:

convert -list delegate | findstr /i PDF
convert -list delegate | grep -i PDF

Ah! It uses Ghostscript to make a PDF => PS conversion, then uses Ghostscript again to make a PS => PNG conversion. Works, but isn't the most efficient way if you know that Ghostscript can do PDF => PNG in one go. And faster. And in much better quality.

啊!它使用Ghostscript进行PDF => PS转换,然后再次使用Ghostscript进行PS => PNG转换。有效,但如果您知道Ghostscript可以一次性执行PDF => PNG,则不是最有效的方法。而且更快。而且质量要好得多。

About IM's handling of PDF conversion to images via the Ghostscript delegate you should know two things first and foremost:

关于IM通过Ghostscript代表处理PDF转换到图像的处理,您应首先了解两件事:

  1. By default, if you don't give an extra parameter, Ghostscript will output images with a 72dpi resolution. That's why Karl's answer suggested to add -density 600 which tells Ghostscript to use a 600 dpi resolution for its image output.
  2. 默认情况下,如果您不提供额外参数,Ghostscript将输出分辨率为72dpi的图像。这就是为什么Karl的回答建议添加-density 600,它告诉Ghostscript使用600 dpi分辨率来显示图像。
  3. The detour of IM to call Ghostscript twice to convert first PDF => PS and then PS => PNG is a real blunder. Because you never win and harldy keep quality in the first step, but very often loose some. Reasons:
    • PDF can handle transparencies, which PostScript can not.
    • PDF可以处理透明胶片,而PostScript则无法处理。
    • PDF can embed TrueType fonts, which Ghostscript can not. etc.pp. Conversion in the direction PS => PDF is not that critical....)
    • PDF可以嵌入TrueType字体,Ghostscript不能。 etc.pp. PS => PDF方向的转换并不重要....)
  4. 绕过IM调用Ghostscript两次转换PDF => PS然后PS => PNG是一个真正的错误。因为你从来没有赢过并且在第一步中保持质量,但经常会松一些。理由:PDF可以处理透明胶片,而PostScript则无法处理。 PDF可以嵌入TrueType字体,Ghostscript不能。 etc.pp. PS => PDF方向的转换并不重要....)

That's why I'd suggest you convert your PDFs in one go to PNG (or JPEG) using Ghostscript directly. And use the most recent version 8.71 (soon to be released: 9.01) of Ghostscript! Here are example commands:

这就是为什么我建议您直接使用Ghostscript将PDF转换为PNG(或JPEG)。并使用Ghostscript的最新版本8.71(即将发布:9.01)!以下是示例命令:

gswin32c.exe ^
  -sDEVICE=pngalpha ^
  -o output/page_%03d.png ^
  -r600 ^
  d:/path/to/your/input.pdf

(This is the commandline for Windows. On Linux, use gs instead of gswin32c.exe, and \ instead of ^.) This command expects to find an output subdirectory where it will store a separate file for each PDF page. To produce JPEGs of good quality, try

(这是Windows的命令行。在Linux上,使用gs而不是gswin32c.exe,而不是^。)此命令需要找到一个输出子目录,它将为每个PDF页面存储一个单独的文件。要制作高质量的JPEG,请尝试

gs \
  -sDEVICE=jpeg \
  -o output/page_%03d.jpeg \
  -r600 \
  -dJPEGQ=95 \
  /path/to/your/input.pdf

(Linux command version). This direct conversion avoids the intermediate PostScript format, which may have lost your TrueType font and transparency object's information that were in the original PDF file.

(Linux命令版本)。这种直接转换避免了中间PostScript格式,这可能丢失了原始PDF文件中的TrueType字体和透明度对象的信息。


[*] D'oh! I missed to see your "linux" tag at first...

[*] D'哦!我最初错过了看到你的“linux”标签......

#2


5  

-density 600 or so should give you what you need.

- 密度600左右应该给你你需要的。

#3


0  

At least two other tools you may want to consider:

您可能需要考虑至少两个其他工具:

  • pdfimages, which comes with the package poppler-utils, makes it easy to extract the images from a PDF without degrading them.
  • popfler-utils软件包附带的pdfimages可以轻松地从PDF中提取图像而不会降低它们的性能。
  • pdfsandwich, which can give you an OCR'd file by simply running pdfsandwich inputfile.pdf. You may need to tweak the options to get a decent result. See the official page for more info.
  • pdfsandwich,只需运行pdfsandwich inputfile.pdf即可为您提供OCR文件。您可能需要调整选项以获得不错的结果。有关详细信息,请参阅官方页面。