Tesseract 3.02中文字库训练
下载chi_sim.traindata字库
下载tesseract-ocr-setup-3.02.02.exe
下载jTessBoxEditor用于修改box文件
0.准备
为了方便 tif文面命名格式[lang].[fontname].exp[num].tif
lang是语言 fontname是字体
比如我们要训练自定义字库 mjorcen字体名normal
那么我们把tif文件重命名 mjorcen.normal.exp0.jpg
图片 :
下面开始训练字库:
1、生成 .box文件
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
把图片文件和box文件放在同一目录,
2、用jTessBoxEditor.jar打开tif文件,然后根据实际情况修改box文件
3、 生成 .tr文件
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 nobatch box.train
4、成一个unicharset文件
unicharset_extractor mjorcen.normal.exp0.box
5、新建一个font_properties文件
里面内容写入 normal 0 0 0 0 0 表示默认普通字体
6、运行命令
shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.tr
mftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.tr
cntraining mjorcen.normal.exp0.tr
结果如下:
E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_propertie
s -U unicharset mjorcen.normal.exp0.tr
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4
Stopped with 0 merged, min dist 0.365385
Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
ichars = 0
E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -U
unicharset -O unicharset mjorcen.normal.exp0.tr
Read shape table shapetable of 5 shapes
Reading mjorcen.normal.exp0.tr ...
Done!
E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.tr
Reading mjorcen.normal.exp0.tr ...
Clustering ...
Writing normproto ...
7、把目录下的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上normal.
8、执行combine_tessdata normal.
9、把 normal.traineddata 复制到Tesseract-OCR 安装目录下的tessdata文件夹中
10、测试
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l normal
debug:
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
Too many unichars in ambiguity on line 22358424
Too many unichars in ambiguity on line 22358424
Too many unichars in ambiguity on line 14941344
Tesseract Open Source OCR Engine v3.02 with Leptonica
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jp
g mjorcen.normal.exp0 nobatch box.train
Tesseract Open Source OCR Engine v3.02 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 6
Found 6 good blobs.
TRAINING ... Font name = normal
Generated training data for 2 words
E:\data\Users\Administrator\Desktop\ocrBuider3>unicharset_extractor mjorcen.norm
al.exp0.box
Extracting unicharset from mjorcen.normal.exp0.box
Wrote unicharset file ./unicharset.
E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_propertie
s -U unicharset mjorcen.normal.exp0.tr
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4
Stopped with 0 merged, min dist 0.365385
Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
ichars = 0
E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -U
unicharset -O unicharset mjorcen.normal.exp0.tr
Read shape table shapetable of 5 shapes
Reading mjorcen.normal.exp0.tr ...
Done!
E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.tr
Reading mjorcen.normal.exp0.tr ...
Clustering ...
Writing normproto ...
E:\data\Users\Administrator\Desktop\ocrBuider3>combine_tessdata normal.
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is 489
Offset for type 4 is 123081
Offset for type 5 is 123134
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 123920
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
mjorcen.normal.exp0 -l normal
Tesseract Open Source OCR Engine v3.02 with Leptonica
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg
mjorcen.normal.exp1 -l chi_sim
Too many unichars in ambiguity on line 15280712
Too many unichars in ambiguity on line 15280712
Too many unichars in ambiguity on line 4324296
Tesseract Open Source OCR Engine v3.02 with Leptonica
normal 结果
应收: 119
普通的中文结果:
应收= II苜
脚本(需要java环境):
目录结果如下:
脚本如下:
window
@echo off
set "src=%1%"
set "font_name=%2%"
set "desc=%3%"
if not defined src set /p src=" please pass your filename : "
if not defined font_name set /p font_name=" please pass your font_name : "
rem 判断参数的合法性
if not defined src echo IllegalArgumentException arg1 must not be null & pause>nul & exit
if not defined font_name echo IllegalArgumentException arg2 must not be null & pause>nul & exit
if not defined desc set "desc=%src:~0,-4%"
echo desc %desc%
rem 如果目录下没有font_properties 文件创建 font_properties ,并写入文件
if exist font_properties (
echo font_properties exist
) else (
ECHO %font_name% 0 0 0 0 0 >"font_properties"
)
rem 删除原有文件
if exist %font_name%.unicharset ECHO DEL %font_name%.unicharset & DEL /Q names %font_name%.unicharset
if exist %font_name%.inttemp ECHO DEL %font_name%.inttemp & DEL /Q names %font_name%.inttemp
if exist %font_name%.pffmtable ECHO DEL %font_name%.pffmtable & DEL /Q names %font_name%.pffmtable
if exist %font_name%.shapetable ECHO DEL %font_name%.shapetable & DEL /Q names %font_name%.shapetable
if exist %font_name%.normproto ECHO DEL %font_name%.normproto & DEL /Q names %font_name%.normproto
if exist %font_name%.font_properties ECHO DEL %font_name%.font_properties & DEL /Q names %font_name%.font_properties
rem makebox
tesseract %src% %desc% -l chi_sim batch.nochop makebox
java -Xms128m -Xmx512m -jar jTessBoxEditor/jTessBoxEditor.jar
ECHO Please change your results , and press any key to continue
pause>nul
tesseract %src% %desc% nobatch box.train
unicharset_extractor %desc%.box
shapeclustering -F font_properties -U unicharset %desc%.tr
mftraining -F font_properties -U unicharset -O unicharset %desc%.tr
cntraining %desc%.tr
rem 配置新文件
if exist unicharset ECHO rename unicharset %font_name%.unicharset & rename unicharset %font_name%.unicharset
if exist inttemp ECHO rename inttemp %font_name%.inttemp & rename inttemp %font_name%.inttemp
if exist pffmtable ECHO rename pffmtable %font_name%.pffmtable & rename pffmtable %font_name%.pffmtable
if exist shapetable ECHO rename shapetable %font_name%.shapetable & rename shapetable %font_name%.shapetable
if exist normproto ECHO rename normproto %font_name%.normproto & rename normproto %font_name%.normproto
combine_tessdata %font_name%.
if exist font_properties ECHO rename font_properties %font_name%.font_properties & rename font_properties %font_name%.font_properties
ECHO press any key to continue
pause>nul
调用:
注意: 参数1: 文件全名 , 参数2 字体名, 参数3 :输出文件名, 不填默认为文件名
E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg normal
实例:
E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg n
ormal
desc mjorcen.normal.exp0
font_properties exist
Too many unichars in ambiguity on line 2188584
Too many unichars in ambiguity on line 2188584
Too many unichars in ambiguity on line 2686128
Tesseract Open Source OCR Engine v3.02 with Leptonica
Please change your results , and press any key to continue
Tesseract Open Source OCR Engine v3.02 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 6
Found 6 good blobs.
TRAINING ... Font name = normal
Generated training data for 2 words
Extracting unicharset from mjorcen.normal.exp0.box
Wrote unicharset file ./unicharset.
Reading mjorcen.normal.exp0.tr ...
Building master shape table
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances...
Stopped with 0 merged, min dist 999.000000
Computing shape distances... 0 1 2 3 4
Stopped with 0 merged, min dist 0.365385
Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple un
ichars = 0
Read shape table shapetable of 5 shapes
Reading mjorcen.normal.exp0.tr ...
Done!
Reading mjorcen.normal.exp0.tr ...
Clustering ...
Writing normproto ...
rename unicharset normal.unicharset
rename inttemp normal.inttemp
rename pffmtable normal.pffmtable
rename shapetable normal.shapetable
rename normproto normal.normproto
Combining tessdata files
TessdataManager combined tesseract data files.
Offset for type 0 is -1
Offset for type 1 is 140
Offset for type 2 is -1
Offset for type 3 is 489
Offset for type 4 is 123081
Offset for type 5 is 123134
Offset for type 6 is -1
Offset for type 7 is -1
Offset for type 8 is -1
Offset for type 9 is -1
Offset for type 10 is -1
Offset for type 11 is -1
Offset for type 12 is -1
Offset for type 13 is 123920
Offset for type 14 is -1
Offset for type 15 is -1
Offset for type 16 is -1
rename font_properties normal.font_properties
E:\data\Users\Administrator\Desktop\ocrBuider3>
linux (出自文档:http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.asc) :
#!/bin/bash
tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.train
unicharset_extractor zzz.ocra.exp0.box
echo "ocra 0 0 1 0 0" >font_properties
shapeclustering -F font_properties -U unicharset zzz.ocra.exp0.tr
mftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.tr
cntraining zzz.ocra.exp0.tr
cp normproto zzz.normproto
cp inttemp zzz.inttemp
cp pffmtable zzz.pffmtable
cp shapetable zzz.shapetable
combine_tessdata zzz.
cp zzz.traineddata /home/youruserid/tessdata/.
sudo cp zzz.traineddata /usr/share/tesseract-ocr/tessdata/.
tesseract zzz.ocra.exp0.tif output -l zzz