Python3.x：如何识别图片上的文字

安装pytesseract库，必须先安装其依赖的PIL及tesseract-ocr，其中PIL为图像处理库，而后面的tesseract-ocr则为google的ocr识别引擎；

其中PIL可以用pillow来替代；

一、安装识别引擎tesseract-ocr

下载地址（解压安装）：https://sourceforge.net/projects/tesseract-ocr/

　　这里需要注意这一段话：Currently, there is no official Windows installer for newer versions.意思就是官方不提供最新版windows平台安装包，只有相对略老的3.02.02版本，其下载地址：https://sourceforge.net/projects/tesseract-ocr-alt/files/。

　　 Python3.x：如何识别图片上的文字

环境变量配置（path）：D:\Program Files (x86)\Tesseract-OCR

设置环境变量：TESSDATA_PREFIX=D:\Program Files (x86)\Tesseract-OCR\tessdata

打开DOS界面，输入tesseract，如下图则标示安装成功：

Python3.x：如何识别图片上的文字

测试识别功能：

切换到图片的目录：cd \d E:\pydevworkspaces，然后输入tesseract tttt.png result（识别tttt.png结果写入result.txt文件中，输出文件在同级目录下）：

Python3.x：如何识别图片上的文字

tttt.png图片内容：

Python3.x：如何识别图片上的文字

result.txt文件内容：

Python3.x：如何识别图片上的文字

识别率貌似不高，第三个数字就识别出错了；

“tesseract OCR 训练样本” --可以提高识别率；

说明安装成功；

tesseract语法：

tesseract code.jpg result  -l chi_sim -psm 7 nobatch

-l chi_sim 表示用简体中文字库（需要下载中文字库文件，解压后，存放到tessdata目录下去,字库文件扩展名为  .raineddata 简体中文字库文件名为:  chi_sim.traineddata）

-psm 7 表示告诉tesseract code.jpg图片是一行文本  这个参数可以减少识别错误率.  默认为 3

configfile 参数值为tessdata\configs 和  tessdata\tessconfigs 目录下的文件名

二、安装第三方库（pytesseract、pillow）

#pytesseract安装

pip install pytesseract

#Pillow 安装

pip install pillow

注意：修改 pytesseract 的路径。

（1）路径：D:\Python36\Lib\site-packages\pytesseract\pytesseract.py

（2）修改内容：tesseract_cmd = 'D:/Program Files (x86)/Tesseract-OCR/tesseract.exe'

三、实例代码

# python3

# author lizm

# datetime 2018-01-26 12:00:00

'''

    Demo：pytesseract解析图片上的文字

'''

import pytesseract

from PIL import Image

# 指定路径

# pytesseract.pytesseract.tesseract_cmd = 'C:/Program Files/Tesseract-ORC/tesseract'

image = Image.open('tttt.png')

code = pytesseract.image_to_string(image)

print(code)

四、识别中文

1，增加中文库：chi_sim.traineddata

2，将中文库拷贝到：D:\Program Files (x86)\Tesseract-OCR\tessdata目录下

3，代码示例：

# python3

# author lizm

# datetime 2018-09-21 12:00:00

'''

    Demo：pytesseract解析图片上的中文文字

'''

import pytesseract

from PIL import Image

code = pytesseract.image_to_string(Image.open('8.jpg'),lang='chi_sim')

print(code)

注意：chi_sim.traineddata必须和安装的tessdata的版本一致，才能生效。

秒客网

Python3.x：如何识别图片上的文字

Python3.x：如何识别图片上的文字

一、安装识别引擎tesseract-ocr

二、安装第三方库（pytesseract、pillow）

三、实例代码

四、识别中文

相关文章