使用tesseract-ocr进行简单的验证码识别和训练

由于工作需要，尝试验证码识别方案

这里只涉及简单的验证码识别，复杂的自行尝试

一、处理图像

图像质量可能不行，这种情况下进行图像处理，将图像进行灰度化，二值化，去噪处理，必要是可截取图片

二、识别图像

准备工作：

安装tesseract软件。我用的是windows版，3.02版本，linux自行安装哈

还需要jtessBoxEditor ，java开发的，所以需要jdk

2.1 不涉及训练使用tesseract自带的语言库（eng）进行识别

tesseract 图片名输出文件名 -l 字库文件 -psm pagesegmode 配置文件

如：tessract xx.jpg res -l eng -psm 7

原图如下；噪点没有处理完全

使用tesseract-ocr进行简单的验证码识别和训练

识别后效果使用tesseract-ocr进行简单的验证码识别和训练

换命令： tessract xx.jpg -l eng -psm 5

识别后效果使用tesseract-ocr进行简单的验证码识别和训练

参数7 和5的区别文末有

这种效果可能不是很好很明显，识别率不高。当然与图像质量有关系

方案三：

涉及训练：根据大量图片来训练自定义语言。生成训练文件的语言库文件

具体步骤如下：

01、

合成tif文件

选取样本文件，比如选择30张jpg文件，当然越多越好，使用jtessBoxEditor合成tif文件

命名为mylang.myfont.exp0.tif

这里有mylang.myfont.exp0.tif是因为之前生成过

02、

根据mylang.myfont.exp0.tif生成box文件（命令行），box文件记录的是字符在每个图片中的位置信息，5 的意思看文末

tesseract mylang.myfont.exp0.tifmylang.myfont.exp0 -l eng -psm 5 batch.nochop makebox

03、

用jtesseditor 来修改box文件，即用jtessBoxEditor打开对应的tif文件mylang.myfont.exp0.tif

矫正识别出的字符，如果识别错误的话，改正，并且看下X、Y、W、H是否需要修正

04、

生成font文件，这里的font为自定义的myfont，与前面一致

命令：（命令行）

echo myfont 0 0 0 0 0 >font_properties

05、

生成训练文件（命令行）

tesseract mylang.myfont.exp0.tifmylang.myfont.exp0 -l eng -psm 5 nobatch box.train

06、

生成字符集文件（命令行）

unicharset_extractor mylang.myfont.exp0.box

07、

生成shape文件（命令行）

shapeclustering -F font_properties -Uunicharset -O mylang.unicharset mylang.myfont.exp0.tr

08、

生成聚集字符特征文件（命令行）

mftraining -F font_properties -U unicharset-O mylang.unicharset mylang.myfont.exp0.tr

09、

生成字符正常化特征文件（命令行）

cntraining mylang.myfont.exp0.tr

10、

更名（命令行）

rename normproto myfont.normproto

rename inttemp myfont.inttemp

rename pffmtable myfont.pffmtable

rename unicharset myfont.unicharset

rename shapetable myfont.shapetable

11、

合并训练文件（命令行）

combine_tessdata myfont.

注：1 3 4 5 13 位置应该有正数

12、

测试将最终的到的myfont.traineddata放到tesseract安装目录的tessdata目录下

tesseract xx.jpg result -l myfont -psm 5

注：生成的训练文件可以用javamyfont.traineddata调用来识别图片

命令详解：

Usage:tesseract imagename outputbase [-llang] [-psm pagesegmode] [configfile...]

pagesegmode values are:

0 = Orientation and script detection (OSD)only.

1 = Automatic page segmentation with OSD.

2 = Automatic page segmentation, but noOSD, or OCR

3 = Fully automatic page segmentation, butno OSD. (Default)

4 = Assume a single column of text ofvariable sizes.

5 = Assume a single uniform block ofvertically aligned text.

6 = Assume a single uniform block of text.

7 = Treat the image as a single text line.

8 = Treat the image as a single word.

9 = Treat the image as a single word in acircle.

10 = Treat the image as a single character.

-l lang and/or -psm pagesegmode must occurbefore anyconfigfile.

秒客网

使用tesseract-ocr进行简单的验证码识别和训练

相关文章