不准确的测试从QT c ++中的png图像中获取OCR数据

时间:2022-12-05 19:30:15

I am using Tesseract OCR c++ library in QT to get a text from a png image using this code

我在QT中使用Tesseract OCR c ++库来使用此代码从png图像中获取文本

const char* lang = "eng";
QString filename = "D:/image.png";

tesseract::TessBaseAPI tess;
tess.Init(NULL, lang, tesseract::OEM_DEFAULT);
tess.SetPageSegMode(tesseract::PSM_AUTO);

FILE* fin = fopen(filename.toStdString().c_str(), "rb");
if (fin == NULL)
{
    std::cout << "Cannot open " << filename.toStdString().c_str() << std::endl;
    return;
}
fclose(fin);

STRING text;
if (tess.ProcessPages(filename.toStdString().c_str(), NULL, 0, &text))
{
    ui->plainTextEdit->setPlainText(QString::fromUtf8(text.string()));
 //show result in plainttext qt gui

}

put the data not accurate enough for the data in the table and it gives me strange characters and when I use an online OCR website to convert my image to text (the same image) it does it with 100% accurate so what makes it gives me this wrong text is this a problem with the library? or my code? or if there is a better free library I can use to be more accurate?

把数据放在表格中的数据不够准确,它给了我奇怪的字符,当我使用在线OCR网站将我的图像转换为文本(相同的图像)时,它完全100%准确,所以它给了我什么这个错误的文本是库的问题吗?还是我的代码?或者如果有更好的免费库我可以使用更准确?

I got the image from pdf I use ghost script to get the image with a good quality so the OCR library should get me the correct data

我从pdf获得图像我使用ghost脚本来获得高质量的图像,因此OCR库应该能够获得正确的数据

2 个解决方案

#1


0  

I am not experienced with cpp, but I think your problem relates to the below line with a great probability:

我对cpp没有经验,但我认为你的问题很可能与下面的行有关:

tess.Init(NULL, lang, tesseract::OEM_DEFAULT);

It must show the tessdata folder. instead of NULL you may write the folder name, for example "C:/tessdata/". Again, I am not experienced with cpp, that's why you may decide slash "/" or backslash "\". This folder should contain the language file(s).

它必须显示tessdata文件夹。您可以编写文件夹名称,而不是NULL,例如“C:/ tessdata /”。再一次,我对cpp没有经验,这就是为什么你可以决定斜杠“/”或反斜杠“\”。该文件夹应包含语言文件。

#2


0  

As Eddge mentioned in his comment you should apply some image preprocessing stuff there are bunch of scripts for imagemagick. Ans of course OpenCV will vastly help in this stuff as well.

正如Eddge在他的评论中提到的,你应该应用一些图像预处理的东西,有一堆脚本用于imagemagick。 Ans当然,OpenCV也会对这些东西有很大的帮助。

The next point could be PSM mode which by default should satisfy your needs to extract whole page information.

下一点可能是PSM模式,默认情况下应满足您提取整页信息的需求。

Also the result of the online OCR is not 100% as you mentioned.

如上所述,在线OCR的结果也不是100%。

There is "1 S Days" instead of "15 Days"
There is "Mail: finance(a)" instead of "E Mail: finance@"
There is "TiA THE GREEN HOL1 5" instead of "T/A THE GREEN HOU 5"

etc.

Which Tesseract version are you using? I highly recommend to use 3.05. (4.0 shows much better results but it is not officially released yet).

您使用的是哪个Tesseract版本?我强烈建议使用3.05。 (4.0显示了更好的结果,但尚未正式发布)。

Also the following link could help you with your results: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

以下链接可以帮助您获得结果:https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

P.S. I hope you are eligible to share publicly such financial documentations;)

附:我希望你有资格公开分享这些财务文件;)

#1


0  

I am not experienced with cpp, but I think your problem relates to the below line with a great probability:

我对cpp没有经验,但我认为你的问题很可能与下面的行有关:

tess.Init(NULL, lang, tesseract::OEM_DEFAULT);

It must show the tessdata folder. instead of NULL you may write the folder name, for example "C:/tessdata/". Again, I am not experienced with cpp, that's why you may decide slash "/" or backslash "\". This folder should contain the language file(s).

它必须显示tessdata文件夹。您可以编写文件夹名称,而不是NULL,例如“C:/ tessdata /”。再一次,我对cpp没有经验,这就是为什么你可以决定斜杠“/”或反斜杠“\”。该文件夹应包含语言文件。

#2


0  

As Eddge mentioned in his comment you should apply some image preprocessing stuff there are bunch of scripts for imagemagick. Ans of course OpenCV will vastly help in this stuff as well.

正如Eddge在他的评论中提到的,你应该应用一些图像预处理的东西,有一堆脚本用于imagemagick。 Ans当然,OpenCV也会对这些东西有很大的帮助。

The next point could be PSM mode which by default should satisfy your needs to extract whole page information.

下一点可能是PSM模式,默认情况下应满足您提取整页信息的需求。

Also the result of the online OCR is not 100% as you mentioned.

如上所述,在线OCR的结果也不是100%。

There is "1 S Days" instead of "15 Days"
There is "Mail: finance(a)" instead of "E Mail: finance@"
There is "TiA THE GREEN HOL1 5" instead of "T/A THE GREEN HOU 5"

etc.

Which Tesseract version are you using? I highly recommend to use 3.05. (4.0 shows much better results but it is not officially released yet).

您使用的是哪个Tesseract版本?我强烈建议使用3.05。 (4.0显示了更好的结果,但尚未正式发布)。

Also the following link could help you with your results: https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

以下链接可以帮助您获得结果:https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

P.S. I hope you are eligible to share publicly such financial documentations;)

附:我希望你有资格公开分享这些财务文件;)