光学字符识别引擎 tesseract-ocr 简介

时间:2021-06-08 09:01:51
Tesseract是一个 由HP实验室开发 由Google维护的 开源的 光学字符识别 (OCR)引擎,可以在  Apache 2.0 许可 下获得。
它可以直接使用,或者(对于程序员)使用 API​​ 从图像中提取输入,包括手写的或打印的文本。

与Microsoft Office Document Imaging(MODI)相比,我们可以不断的训练的库,使图像转换文本的能力不断增强;

如果团队深度需要,还可以以它为模板,开发出符合自身需求的OCR引擎。

源码地址为:https://github.com/tesseract-ocr/tesseract

你可以训练它

大体流程为:安装jTessBoxEditor -> 获取样本文件 -> Merge样本文件 –> 生成BOX文件 -> 定义字符配置文件 -> 字符矫正 -> 执行批处理文件 -> 将生成的 traineddata 放 入tessdata 中。

具体细节参考:光学字符识别引擎 tesseract-ocr 样体训练

它是跨平台的,支持:

Linux


macOS


Windows

Tesseract-OCR4.0 版本在 Win7 上的安装过程

Tesseract-OCR4.0版本在VS2015上的编译与运行

它支持很多种的语言,包括:

Lang Code Language 4.0 traineddata
afr Afrikaans afr.traineddata
amh Amharic amh.traineddata
ara Arabic ara.traineddata
asm Assamese asm.traineddata
aze Azerbaijani aze.traineddata
aze_cyrl Azerbaijani - Cyrillic aze_cyrl.traineddata
bel Belarusian bel.traineddata
ben Bengali ben.traineddata
bod * bod.traineddata
bos Bosnian bos.traineddata
bul Bulgarian bul.traineddata
cat Catalan; Valencian cat.traineddata
ceb Cebuano ceb.traineddata
ces Czech ces.traineddata
chi_sim Chinese - Simplified chi_sim.traineddata
chi_tra Chinese - Traditional chi_tra.traineddata
chr Cherokee chr.traineddata
cym Welsh cym.traineddata
dan Danish dan.traineddata
deu German deu.traineddata
dzo Dzongkha dzo.traineddata
ell Greek, Modern (1453-) ell.traineddata
eng English eng.traineddata
enm English, Middle (1100-1500) enm.traineddata
epo Esperanto epo.traineddata
est Estonian est.traineddata
eus Basque eus.traineddata
fas Persian fas.traineddata
fin Finnish fin.traineddata
fra French fra.traineddata
frk Frankish frk.traineddata
frm French, Middle (ca. 1400-1600) frm.traineddata
gle Irish gle.traineddata
glg Galician glg.traineddata
grc Greek, Ancient (-1453) grc.traineddata
guj Gujarati guj.traineddata
hat Haitian; Haitian Creole hat.traineddata
heb Hebrew heb.traineddata
hin Hindi hin.traineddata
hrv Croatian hrv.traineddata
hun Hungarian hun.traineddata
iku Inuktitut iku.traineddata
ind Indonesian ind.traineddata
isl Icelandic isl.traineddata
ita Italian ita.traineddata
ita_old Italian - Old ita_old.traineddata
jav Javanese jav.traineddata
jpn Japanese jpn.traineddata
kan Kannada kan.traineddata
kat Georgian kat.traineddata
kat_old Georgian - Old kat_old.traineddata
kaz Kazakh kaz.traineddata
khm Central Khmer khm.traineddata
kir Kirghiz; Kyrgyz kir.traineddata
kor Korean kor.traineddata
kur Kurdish kur.traineddata
lao Lao lao.traineddata
lat Latin lat.traineddata
lav Latvian lav.traineddata
lit Lithuanian lit.traineddata
mal Malayalam mal.traineddata
mar Marathi mar.traineddata
mkd Macedonian mkd.traineddata
mlt Maltese mlt.traineddata
msa Malay msa.traineddata
mya Burmese mya.traineddata
nep Nepali nep.traineddata
nld Dutch; Flemish nld.traineddata
nor Norwegian nor.traineddata
ori Oriya ori.traineddata
pan Panjabi; Punjabi pan.traineddata
pol Polish pol.traineddata
por Portuguese por.traineddata
pus Pushto; Pashto pus.traineddata
ron Romanian; Moldavian; Moldovan ron.traineddata
rus Russian rus.traineddata
san Sanskrit san.traineddata
sin Sinhala; Sinhalese sin.traineddata
slk Slovak slk.traineddata
slv Slovenian slv.traineddata
spa Spanish; Castilian spa.traineddata
spa_old Spanish; Castilian - Old spa_old.traineddata
sqi Albanian sqi.traineddata
srp Serbian srp.traineddata
srp_latn Serbian - Latin srp_latn.traineddata
swa Swahili swa.traineddata
swe Swedish swe.traineddata
syr Syriac syr.traineddata
tam Tamil tam.traineddata
tel Telugu tel.traineddata
tgk Tajik tgk.traineddata
tgl Tagalog tgl.traineddata
tha Thai tha.traineddata
tir Tigrinya tir.traineddata
tur Turkish tur.traineddata
uig Uighur; * uig.traineddata
ukr Ukrainian ukr.traineddata
urd Urdu urd.traineddata
uzb Uzbek uzb.traineddata
uzb_cyrl Uzbek - Cyrillic uzb_cyrl.traineddata
vie Vietnamese vie.traineddata
yid Yiddish yid.traineddata
参考: https://github.com/tesseract-ocr/tesseract/wiki/Data-Files