tesseract ocr文字识别

一.环境搭建（基于VS2010）

　1.下载安装 tesseract-ocr-setup-3.02.02.exe 安装包，安装时候最好是在FQ的情况下安装。(安装一点要勾选 Tesseract development files 选项)。

　　安装包下载地址：链接：http://pan.baidu.com/s/1pKAbyvp 密码：iicm

　2.解压tesseract-3.02.02-win32-lib-include-dirs.zip覆盖到tesseract-ocr安装目录下。

　　链接：http://pan.baidu.com/s/1cEfU6U 密码：o80p

3.解压DLL.zip（新的VS2010）覆盖tesseract-ocr安装目录下的旧的VS2008的DLL。

　　链接：http://download.csdn.net/detail/xadxyz/9789395

　 4.解压中文识别字库到tesseract-ocr安装目录下C:\Tesseract-OCR\tessdata

　　链接：http://pan.baidu.com/s/1i5ojm1f 密码：oqqb

二.创建工程

　 1.添加安装目录include和lib路径到VS工程配置

2.示例代码

// TestOCR.cpp : 定义控制台应用程序的入口点。
//

#include "stdafx.h"
#include "strngs.h"
#include "baseapi.h"
#include <iostream>
using namespace std;

#pragma comment(lib,"libtesseract302d.lib")

std::string UTF8_To_string(const std::string & str)//编码转换
{
	int nwLen = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, NULL, 0);
	wchar_t * pwBuf = new wchar_t[nwLen + 1];
	memset(pwBuf, 0, nwLen * 2 + 2);
	MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), pwBuf, nwLen);
	int nLen = WideCharToMultiByte(CP_ACP, 0, pwBuf, -1, NULL, NULL, NULL, NULL);
	char * pBuf = new char[nLen + 1];
	memset(pBuf, 0, nLen + 1);
	WideCharToMultiByte(CP_ACP, 0, pwBuf, nwLen, pBuf, nLen, NULL, NULL);
	std::string retStr = pBuf;
	delete []pBuf;
	delete []pwBuf;
	pBuf = NULL;
	pwBuf = NULL;
	return retStr;
}


int _tmain(int argc, _TCHAR* argv[])
{
	tesseract::TessBaseAPI api;
	api.Init(NULL,"chi_sim",tesseract::OEM_DEFAULT);
	STRING text_out;
	api.ProcessPages("test.jpg",NULL,0,&text_out);
	cout<<UTF8_To_string(text_out.string()).c_str()<<endl;	
	system("pause");
	return 0;
}

3.识别结果

　 tesseract ocr文字识别

中文字库的识别错误率还是比较大,需要进一步优化训练字库。

　 http://blog.csdn.net/problc/article/details/8065011

　　所有用的到资源下载地址：http://download.csdn.net/detail/xadxyz/9789381

示例工程源码：http://download.csdn.net/detail/xadxyz/9789417

　　交流QQ：0x7317AF28

秒客网

tesseract ocr文字识别

相关文章