R中的Tesseract无法识别同一文档中的较小字体

时间:2023-01-23 08:59:11

With beginner level expertise in R and asked to write codes to convert text from images into a txt file, I am struggling with the Tesseract and the Magick packages.

凭借R中的初级专业知识,并要求编写代码将图像中的文本转换为txt文件,我正在努力使用Tesseract和Magick软件包。

I will unfortunately not be able to upload the original document as it is confidential, but I have tried my best to replicate the same into a dummy image which is attached. The original is similarly structured as the attached example.

遗憾的是,我无法上传原始文件,因为它是保密的,但我已尽力将其复制到附加的虚拟图像中。原始的结构类似于附加的示例。

The document contains a line, which is in very small fonts. The code that I am running, reads most of the fonts correctly but does not read the fonts which are much smaller in size (around 6-6.5 font size or lesser in MS Word.)

该文档包含一行,字体非常小。我正在运行的代码正确读取大多数字体,但不读取尺寸小得多的字体(在MS Word中大约6-6.5字体或更小字体。)

This is a huge problem because the most vital piece of information, lies in those smaller fonts and not being able to read it, makes the whole exercise of conversion pretty much useless

这是一个巨大的问题,因为最重要的信息,就是那些较小的字体,而且无法读取它,使整个转换练习几乎无用

I have followed 2 different versions of the code and both of them come with their on set of challenges:-

我已经关注了两个不同版本的代码,他们都遇到了一系列挑战: -

Version 1 -->

版本1 - >

text5 <- ocr("D:/abc/dummy.PNG")
cat(text5)
write.table(text5, "D:/abc/Outputs/dummy.txt", sep="\t")

Problem with version 1 --> The output is generated in a few seconds, everything is just about perfect, but the text of input in smaller fonts is not at all acceptable.

版本1的问题 - >输出在几秒钟内生成,一切都很完美,但是较小字体的输入文本根本不可接受。

Version 2 -->

版本2 - >

test2 <- image_read("D:/abc/dummy.PNG") %>%
image_resize("3000") %>%
image_convert(colorspace = 'gray') %>%
image_trim() %>%
image_ocr()
cat(test2)
write.table(test2, "D:/abc/Outputs/dummy.txt", sep="\t")

Problem with Version 2 --> The output is slightly better, but still there is a lot of scope of improvement.

版本2的问题 - >输出稍微好一点,但仍然有很多改进的范围。

I tried multiple resources like source1, source2 and feel it has something to do with the low dpi of that particular line, but I am not sure how to go about it. I might be totally wrong so feel free to correct.

我尝试了多个资源,比如source1,source2,并觉得它与该特定行的低dpi有关,但我不知道如何去做。我可能完全错了,所以请随意纠正。

Optimistic to get some help from this forumR中的Tesseract无法识别同一文档中的较小字体]3

乐观地从这个论坛得到一些帮助] 3

1 个解决方案

#1


0  

Does the document have the same format every time, or does it change?

文档每次都具有相同的格式,还是会更改?

If it were the same, you can just crop the place youre struggling with, then resize it little by little while using morphology operations, such as opening. What this does is, for every time you make the image larger, and unwanted white pixels appear inbetween your letters, it fills them again with the black pixels.

如果它是相同的,你可以只裁剪你正在努力的地方,然后在使用形态操作(如打开)时一点一点地调整它。它的作用是,每次你使图像变大,并且在你的字母之间出现不需要的白色像素时,它会再次用黑色像素填充它们。

http://www.fmwconcepts.com/imagemagick/morphology/index.php

Edit: Added new comments.

编辑:添加了新评论。

#1


0  

Does the document have the same format every time, or does it change?

文档每次都具有相同的格式,还是会更改?

If it were the same, you can just crop the place youre struggling with, then resize it little by little while using morphology operations, such as opening. What this does is, for every time you make the image larger, and unwanted white pixels appear inbetween your letters, it fills them again with the black pixels.

如果它是相同的,你可以只裁剪你正在努力的地方,然后在使用形态操作(如打开)时一点一点地调整它。它的作用是,每次你使图像变大,并且在你的字母之间出现不需要的白色像素时,它会再次用黑色像素填充它们。

http://www.fmwconcepts.com/imagemagick/morphology/index.php

Edit: Added new comments.

编辑:添加了新评论。