Linux ___验证码识别

1.下载安装leptonica
http://www.leptonica.org/download.html 或者
http://code.google.com/p/leptonica/downloads/list

解压后切换到该目录下
./configuremake
$make install

2.tesseract安装:
要先安装完leptonica才能安装
install tesseract 3.0.2
$ wget http://tesseract-ocr.googlecode.com/files/tesseract-3.0.2.tar.gz
tarzxvftesseract−3.02.tar.gzcd tesseract-3.02
./configure make
$sudo make install

3.安装中英文字库：chi_sim.traineddata 、eng.traineddata
下好后把它们放到/tessdata目录下

识别语句格式：tesseract [pic_dir] abc -l eng
结果生成abc.txt -l eng 指的是英文 -l chi_sim 指的是中文

Usage:
tesseract –help | –help-psm | –version
tesseract –list-langs [–tessdata-dir PATH]
tesseract –print-parameters [options…] [configfile…]
tesseract imagename|stdin outputbase|stdout [options…] [configfile…]

OCR options:
–tessdata-dir PATH Specify the location of tessdata path.
–user-words PATH Specify the location of user words file.
–user-patterns PATH Specify the location of user patterns file.
-l LANG[+LANG] Specify language(s) used for OCR.
-c VAR=VALUE Set value for config variables.
Multiple -c arguments are allowed.
-psm NUM Specify page segmentation mode.
NOTE: These options must occur before any configfile.

Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.

Single options:
-h, –help Show this help message.
–help-psm Show page segmentation modes.
-v, –version Show version information.
–list-langs List available languages for tesseract engine.
–print-parameters Print tesseract parameters to stdout.

装是装上了，但是貌似不管用啊@[email protected]

Linux ___验证码识别

这种图片都识别不了

难道是我的打开方式不对？？？

=====================>>>补充：
确实是我的打开方式不对，还需要自己调整
利用jTessBoxEditor工具进行Tesseract3.02.02样本训练，提高验证码识别率
训练生成自己的traineddata，步骤如下：
1.首先通过PIL.Image 把图片生成或者说转换tif格式的，在第二步jTessBoxEditor 的merge生成tif集时候需要tif格式的图片，命名为lang.henson.exp0.tif
2.在java环境里运行jTessBoxEditor.jar，java -jar jTessBoxEditor.jar，窗口的tool里有merge，把准备好的样本图片(tif格式的)导进去，生成tif集。
3.tesseract lang.henson.exp0.tif lang.henson.exp0 -l eng -psm 7 batch.nochop makebox
生成box文件
4.jTessBoxEditor里的box Edit 对box文件参数进行调整
Linux ___验证码识别

5.生成font_properties：echo fontyp 0 0 0 0 0 >font_properties
6.生成训练文件，生成lang.henson.exp0.tr训练文件
tesseract lang.henson.exp0.tif lang.henson.exp0 -l eng -psm 7 nobatch box.train
7.生成字符集文件,生成unicharset的字符集文件
unicharset_extractor lang.henson.exp0.box
8.生成shape文件
shapeclustering -F henson_properties -U unicharset -O lang.unicharset lang.henson.exp0.tr
9.生成聚集字符特征文件,生成3个特征字符文件，unicharset、inttemp、pffmtable
mftraining -F henson_properties -U unicharset -O lang.unicharset lang.henson.exp0.tr
10.生成字符正常化特征文件,生成正常化特征文件normproto
cntraining lang.henson.exp0.tr
11.变名，分别给normproto、inttemp、pffmtable、unicharset、shapetable文件重命名方便后面合成训练文件，如下
mv normproto henson.normproto
…
12.生成henson.traineddata文件：
combine_tessdata henson.

13.将生成的训练文件.traindata 放到/usr/share/tesseract-ocr/tessdata

进行识别测试：
tesseract test.tif output -l eng -psm 7

最最最后，可能是我添加的样本图片不是很多，毕竟一个个调整太累了，试了二十张，效果不是很好——————吐血

Linux ___验证码识别

秒客网

Linux ___验证码识别

相关文章