如何在小型c++项目中使用tesseract ocr(或其他免费ocr) ?

So what I heard after research is that the only solid free OCR options are either Tesseract or CuneiForm.

所以我在研究之后听到的是唯一可靠的无OCR选项不是Tesseract就是CuneiForm。

Now, the Tesseract docs are plain horrible, all they give you is a bunch of Visual Studio code (for me on Windows) and from there you are on your own in an ocean of their API. All you can do is use the exe that compiles then use it on a tiff image.

现在，Tesseract文档非常糟糕，它们提供给您的只是一堆Visual Studio代码(对于Windows上的我来说)，然后您就可以自己在API的海洋中了。您所能做的就是使用编译的exe，然后在tiff映像上使用它。

I was expecting at least short documentation that tells you how to pull their API call to use OCR at least for a small example but no, there's nothing like that in their docs.

我希望至少有简短的文档告诉您如何使用OCR来调用他们的API，至少在一个小例子中是这样的，但是没有，在他们的文档中没有类似的东西。

CuneiForm: I downloaded it and "great" everything is in Russian. :(

楔形文字:我下载了它，“伟大”一切都是俄语。:(

Is it really hard for those guys to pull a small example instead they supply us with bunch of irrelevant info that probably 90% of people won't reach, how can you reach there without starting on small things and they explain none of it!

这些人很难找到一个小的例子，他们给我们提供了一大堆不相关的信息，可能90%的人都够不着，你怎么能在没有小事情的情况下到达那里，而他们却没有解释!

So I have bunch of API but how the hell am I supposed to use it if it's explained nowhere?... Maybe someone can offer me advice and a solution? I'm not asking for a miracle, just something small to show me how things work.

我有很多API，但是如果没有解释，我该怎么用呢?也许有人可以给我建议和解决办法?我并不是要求奇迹出现，只是想让我看到一些小事。

4 个解决方案

#1

You might have given up, but there may be some other who are still trying. So here is what you need to start with tesseract:

你可能已经放弃了，但可能还有其他人还在尝试。这就是你需要从tesseract开始的地方:

First of all you should read all the documentation about tesseract. You may find something useful is the wiki.

首先，您应该阅读所有关于tesseract的文档。你可能会发现一些有用的东西是wiki。

To start using the API(v 3.0.1, currently in trunk, read also the README and ChangeLog from trunk) you should check out the baseapi.h. The documentation of how to use the api is right there, a comment above each function.

要开始使用API(v3.0.1，目前在trunk中，还要从trunk读取README和ChangeLog)，您应该检查baseape .h。关于如何使用api的文档就在那里，每个函数上面都有一个注释。

For starters:

首先:

include baseapi.h & construct TessBaseAPI object
包括baseapi。h &构造TessBaseAPI对象
call Init()
调用Init()
Some optional like
- change some params with the SetVariable() func. You can see all the params and their values if you print them in a file using PrintVariables() func.
- 使用SetVariable() func更改一些参数。如果您使用PrintVariables() func将所有的参数及其值打印到文件中，就可以看到它们。
- change the segmentation mode with SetPageSegMode(). Tell tesseract what the image you are about to OCR represents - block or line of text, word or character.
- 使用SetPageSegMode()更改分割模式。告诉tesseract你将要在OCR中代表什么图像-块或一行文本，词或字符。
一些可选的，比如用SetVariable() func更改一些参数。如果您使用PrintVariables() func将所有的参数及其值打印到文件中，就可以看到它们。使用SetPageSegMode()更改分割模式。告诉tesseract你将要在OCR中代表什么图像-块或一行文本，词或字符。
SetImage()
SetImage()
GetUTF8Text()
GetUTF8Text()

(Again, that is just for starters.)

(同样，这只是开始。)

You can check the tesseract's community for alredy answerd questions or ask your own here.

您可以查看tesseract的社区以获得更多的答案，或者在这里问您自己的问题。

#2

I'm digging into it .. so far I've generated DoxyGen code for it .. that's helping. Still reading all the docs though.

我正在研究它。到目前为止，我已经生成了DoxyGen代码。这是帮助。仍然在阅读所有的文档。

Some links that help me:

一些帮助我的链接:

The dev google group is full of broken examples from desperate devs
dev谷歌组充满了绝望的开发人员的坏例子
A slightly old (v2.0) hacking tesseract how to
一个有点老的(v2.0)黑客tesseract how to

Any I downloaded the svn from google code: http://code.google.com/p/tesseract-ocr/

我从谷歌下载了svn代码:http://code.google.com/p/tesserac -ocr/

and made and installed it then used doxygen to generate my own API reference docs. Very useful.

然后用doxygen生成我自己的API参考文档。非常有用的。

The way I did it is:

我的做法是:

I used 'make install' and it put some stuff in /usr/include/tesseract
我用了make install，它在/usr/include/tesseract中加入了一些东西
I copied that dir to my home dir
我把那个目录复制到我的家庭目录
doxygen -g doxygen.conf; # To generate a doxygen file
doxygen - g doxygen.conf;生成doxygen文件
Go through the file it generates and set output dir and project name or whatever. I used 'doxy-dox' as my output dir
遍历它生成的文件并设置输出目录和项目名称。我使用'doxy-dox'作为输出目录
doxygen -g doxygen.conf
doxygen - g doxygen.conf
chromium-browser chromium-browser doxy-doc/html/index.html
chromium浏览器chromium浏览器doxy-doc / html / index . html

Hope that helps a bit.

希望这能有所帮助。

#3

I figured it out, if you are using visual studios 2010 and are using windows forms / designer you can add it easily this way with no issues

我发现，如果你正在使用visual studio 2010并且正在使用windows forms / designer，你可以很容易地添加它，没有问题

add the following projects to your project ( i am warning you once, do not add the tesseract solution, or change any setting in the projects you add, unless you love to hate yourself )

将以下项目添加到您的项目中(我警告您一次，不要添加tesseract解决方案，或更改添加的项目中的任何设置，除非您喜欢讨厌自己)

ccmain ccstruct ccutil classify cube cutil dict image libtesseract nutral_networks textord viewer wordrec

ccmain ccstruct ccutil分类立方体cutil dict image libtesseract nutral_networks textord viewer wordrec

you can add the others but you don’t really want all that built into your project do you? naaa, build those separately

你可以添加其他的，但你并不想把所有的都嵌入到你的项目中，对吧?naaa,分别构建这些

go to your project properties and add libtesseract as a reference, you can now that it is visible as a project, this will make it so that your project builds fast without examining the millions of warnings within tesseract. [common properties]->[add reference]

转到您的项目属性并添加libtesseract作为参考，您现在可以看到它作为一个项目，这将使您的项目快速构建，而无需检查tesseract中的数百万条警告。[公共属性]- >[添加参考)
right click your project in the solution explorer and click project dependencies, make sure it is dependant on libtesseract or even all of them, it just means they build before your project.

在解决方案资源管理器中右键单击项目，并单击项目依赖项，确保它依赖于libtesseract，甚至是所有这些，这只是意味着它们在您的项目之前构建。
the tesseract 2010 visual studio projects contain a number of configuration settings aka release, release.dll, debug, debug.dll, it seems that the release.dll settings produce the right files. First, set the solution output to release.dll. Click your project properties. Then click configuration manager. If that is not available, do this, click the SOLUTION's properties in the solution tree and click configuration tab, you will see a list of projects and the associated configuration settings. You will notice your project is not set to release.dll even though the output is. If you took the second route you still need to click configuration manager. Then you can edit the settings, click new on your projects settings and call it release.dll...exactly the same as the rest of them and copy the settings from release. Do the same thing for Debug, so that you have a debug.dll name copied from debug settings. wheew...almost done

tesseract 2010 visual studio项目包含许多配置设置，即release、release。dll、调试调试。dll，似乎发布了。dll设置生成正确的文件。首先，将解决方案输出设置为release.dll。点击你的项目属性。然后单击配置管理器。如果这是不可用的，那么在解决方案树中单击解决方案的属性并单击configuration选项卡，您将看到一个项目列表和相关的配置设置。您将注意到您的项目没有设置为发布。即使输出是dll。如果选择第二种路径，仍然需要单击configuration manager。然后你可以编辑设置，在你的项目设置上点击new，然后调用release.dll…与其余部分完全相同，并从发行版复制设置。对调试执行相同的操作，以便进行调试。从调试设置中复制的dll名称。wheew……差不多做完了
Don’t try to change tesseracts settings to match yours....that wont work ....and when the new release comes out you wont be able to just "throw it in" and go. Accept the fact that in this state your new modes are Release.dll and Debug.dll. don’t stress out...you can go back when its is finished and remove the projects from your solution.

不要试图改变超正方体的设置来匹配你的....不会工作....当新发布出来的时候，你就不可能把它扔进去。接受这样的事实:在这种状态下，您的新模式是发布的。dll和Debug.dll。不要紧张…当它完成时，您可以返回并从解决方案中删除项目。
Guess where the libraries and dll’s come out? in your project, you may or may not need to add the library directories. Some people say to dump all the headers into a single folder so they only need to add one folder to the includes but not me. I want to be able to delete the tesseract folder and reload it from the zips without extra work....and be fully ready to update in one move or restore it if I made a mess of the code. Its a bit of work and you can to it with code instead of the settings which is the way i do it, but you should include all the folders that contain header files within the 2010 tesseract project folder and leave them alone.

猜猜库和dll在哪里?在您的项目中，您可能需要也可能不需要添加库目录。有些人说把所有的头文件都转储到一个文件夹中，这样他们只需要向include添加一个文件夹，而不需要向me添加一个文件夹。我希望能够删除超正方体文件夹并重新加载它的拉链没有额外的工作....如果我把代码搞砸了，要做好随时更新或恢复的准备。它有一点工作，你可以用代码代替设置，这是我的方法，但是你应该包括所有包含了2010年tesseract项目文件夹中的头文件的文件夹，并让它们单独保存。
there is no need to add any files to your project. just these lines of code..... I have included some additional code that converts from one foreign data set to the tiff friendly version with no need to save / load file. aren’t I nice?

不需要向项目添加任何文件。就是这几行代码……我已经包含了一些附加的代码，这些代码可以将一个外部数据集转换为tiff友好版本，而不需要保存/加载文件。难道我还不够好心吗?
now you can fully debug in debug.dll and release.dll, once you have successfully built it into your project even once you can remove all the added projects and it will be peeerfect. no extra compiling or errors. fully debugable, all natural.

现在，您可以在调试中完全调试。dll和释放。dll，一旦您成功地将其构建到项目中，即使您可以删除所有添加的项目，它也将是peeerfect。没有额外的编译或错误。完全debugable,所有自然。
If I remember right, I could not get around the fact I had to copy the files in 2008/lib/ into my projects release folder….darn it.

如果我没记错的话,我无法绕过我不得不在2008年复制文件/ lib /在我的项目发布文件夹....真讨厌。

In my projects “functions.h” I put

在我的项目”功能。h”我把

#pragma comment (lib, "liblept.lib" )
#define _USE_TESSERACT_
#ifdef _USE_TESSERACT_
#pragma comment (lib, "libtesseract.lib" )
#include <baseapi.h>
#endif
#include <allheaders.h>

in my main project I put this in a class as a member:

在我的主要项目中，我把它作为一个成员放在课堂上:

tesseract::TessBaseAPI *readSomeNombers;

and of course I included “functions.h” somewhere

当然我也包括了“函数”。h”的地方

then I put this in my classes constructor:

然后我把它放在类构造函数中:

readSomeNombers = new tesseract::TessBaseAPI();
readSomeNombers ->Init(NULL, "eng" );
readSomeNombers ->SetVariable( "tessedit_char_whitelist", "0123456789,." );

then I created this class member function: and a class member to serve as an output, don’t hate, I don’t like returning variables. Not my style. The memory for the pix does not need to be destroyed when used inside a member function this way I believe and my test suggest this is a safe way to call these functions. But by all means, you can do whatever.

然后我创建了这个类成员函数:一个类成员作为输出，不要讨厌，我不喜欢返回变量。不是我的风格。我相信，在成员函数中使用时，pix的内存不需要被破坏，我的测试表明，这是调用这些函数的安全方法。但无论如何，你可以做任何事。

void Gaara::scanTheSpot()
{
    Pix *someNewPix;
    char* outText;
    ostringstream tempStream;
    RECT tempRect;
    someNewPix = pixCreate( 200 , 40 , 32 );
    convertEasyBmpToPix( &scanImage, someNewPix, 87, 42 );

    readSomeNombers ->SetImage(someNewPix);
    outText = readSomeNombers ->GetUTF8Text();
    tempStream.str("");
    tempStream << outText;
    classMemeberVariable = tempStream.str();
//pixWrite( "test.bmp", someNewPix, IFF_BMP );
}

The object that has the information that I want to scan is in memory and is pointed to by &scanImage. It is from the “EasyBMP” library but that is not important.

具有我想要扫描的信息的对象是在内存中，并被指向了&scanImage。它来自“EasyBMP”库，但这并不重要。

Which I deal with in a function in “functions.h”/ “functions.cpp” by the way, i am doing a little extra processing here while i am in the loop, namely thinning the characters and making it black and white and reversing black and white which is unnecessary. At this phase in my development I am still looking for ways to improve the recognition. Though for my proposes this has not yielded bad data yet. My view is to use the default Tess data for simplicity. I am acting heuristically to solve a very complex problem.

我在函数中处理的。h“/”功能。cpp "顺便说一下，在循环中，我在这里做了一些额外的处理，也就是细化字符使其变黑变白，颠倒黑白，这是不必要的。在我发展的这个阶段，我仍然在寻找提高认知度的方法。尽管我的提议还没有得出坏数据。我的观点是使用默认的Tess数据来简化。我以启发式的方式解决一个非常复杂的问题。

void convertEasyBmpToPix( BMP *sourceImage, PIX *outputImage, unsigned startX, unsigned startY )
{
    int endX = startX + ( pixGetWidth( outputImage ) );
    int endY = startY + ( pixGetHeight( outputImage ) );
    unsigned destinationX;
    unsigned destinationY = 0;
    for( int yLoop = startY; yLoop < endY; yLoop++ )
    {
        destinationX = 0;
        for( int xLoop = startX; xLoop < endX; xLoop++ )
        {
            if( isWhite( &( sourceImage->GetPixel( xLoop, yLoop ) ) ) )
            {
                pixSetRGBPixel( outputImage, destinationX, destinationY, 0,0,0 );
            }
            else
            {
                pixSetRGBPixel( outputImage, destinationX, destinationY, 255,255,255 );
            }
            destinationX++;
        }
        destinationY++;
    }
}
bool isWhite( RGBApixel *image )
{
    if(
        //destination->SetPixel( x, y, source->GetPixel( xLoop, yLoop ) );
        ( image->Red   < 50 ) ||
        ( image->Blue  < 50 ) ||
        ( image->Green < 50 )
        )
    {
        return false;
    }
    else
    {
        return true;
    }
}

one thing I don't like is the way I declare the size of the pix outside the function. It seems if I try to do it within the function I have unexpected results....if the memory is allocated while inside it is destroyed when I leave.

我不喜欢的一件事是我在函数外面声明pix的大小。似乎如果我尝试做它在函数....意想不到的结果如果内存是在内存中分配的，那么当我离开时内存就会被销毁。

g m a i l Certainly not my most elegant work but I also gutted the hell out of it for simplicity. Why I bother to share this I don't know. I should have kept it to myself. What is my name? Kage.Sabaku.No.Gaara

我当然不是最优雅的作品，但为了简单起见，我也把它删掉了。我不知道为什么要和大家分享这个。我应该保密的。我的名字是什么?Kage.Sabaku.No.Gaara

before i let you go i should mention the subtle differences between my windows form app and the default settings. namely i use "multi-byte" character set. project properties...and such..give a dog a bone, maybe a vote?

在我让你离开之前，我应该提到我的windows窗体应用程序和默认设置之间的细微差别。也就是说，我使用“多字节”字符set. project properties…这样的. .给狗一根骨头，也许投票?

p.p.s. I hate to say it but I made one change to host.c if you use 64 bit you can do the same. Otherwise your on your own.....but my reason was a bit insane you don't have to

附注:我不想说，但我对主持人做了一个改变。如果你使用64位，你也可以这样做。否则你就只能靠自己了……但我的理由有点疯狂，你不必这么做

typedef unsigned int uinT32;
#if (_MSC_VER >= 1200)            //%%% vkr for VC 6.0
typedef _int64 inT64;
typedef unsigned _int64 uinT64;
#else
typedef long long int inT64;
typedef unsigned long long int uinT64;
#endif                           //%%% vkr for VC 6.0
typedef float FLOAT32;
typedef double FLOAT64;
typedef unsigned char BOOL8;

#4

Marko, I've tried to write a quick C++ app as well using Tesseract and ran into the same problems.

Marko，我试着用Tesseract写一个快速c++应用，结果遇到了同样的问题。

In a nutshell I found it confusing with little examples/docs, but I don't fault the product, heck, it's open source and the contributers are probably more interested in improving it than marketing.

简单地说，我发现用一些例子和文档会让人感到困惑，但我并没有指责这个产品，糟糕的是，它是开源的，而且贡献者可能对改进它比营销更感兴趣。

You could try poking around at the source code and possibly spending time might get an understanding, but I can totally relate to your frustration.

您可以尝试查看源代码，并可能花时间了解一下，但是我完全可以理解您的沮丧。

Good luck!

好运！

#1