tesseract v3.03使用可搜索的文本示例呈现PDF

时间:2021-01-24 08:58:09

From the tesseract v3.03 's release note, tesseract is now supporting render PDF output with searchable text, but I don't know how to use this feature in my code.
Currently I use tess-two for my android app, then I just wonder can this feature work for android?

从tesseract v3.03的发行说明中,tesseract现在支持使用可搜索文本的渲染PDF输出,但我不知道如何在我的代码中使用此功能。目前我使用tess-two为我的Android应用程序,然后我只是想知道这个功能可以用于Android吗?

It would be great if you can give me an example that uses tesseract api to render pdf, and then I will try to port missing functions for tess-two library.
Thanks in advance.

如果你能给我一个使用tesseract api来渲染pdf的例子,那将会很棒,然后我将尝试为tess-two库移植缺少的函数。提前致谢。

P/s: I can see the pdfrenderer file which may handle render pdf output, but I don't know how to apply it with base api.

P / s:我可以看到可以处理render pdf输出的pdfrenderer文件,但我不知道如何将它应用于base api。

Update: here is my try:

更新:这是我的尝试:

 tesseract::TessResultRenderer* renderer = new tesseract::TessPDFRenderer(nat->api.GetDatapath());
__android_log_print(ANDROID_LOG_ERROR, "Test_tesseract", "data path = %s", nat->api.GetDatapath());
if (!nat->api.ProcessPages(c_file_name, NULL, 0, renderer)) {
    __android_log_print(ANDROID_LOG_ERROR, "Test_tesseract", "process page failed");
    delete renderer;
    return;
}

FILE* fout = fopen(c_pdf_file_name, "wb");
if (fout == NULL) {
    __android_log_print(ANDROID_LOG_ERROR, "Test_tesseract", "Cannot create output file %s\n", c_pdf_file_name);
    delete renderer;
    return;
}

const char* data;
int dataLength;

bool boolValue = renderer->GetOutput(&data, &dataLength);
if (boolValue) {
    fwrite(data, 1, dataLength, fout);
    if (fout != stdout)
        fclose(fout);
    else
        clearerr(fout);
}else{
    __android_log_print(ANDROID_LOG_ERROR, "Test_tesseract", "Cannot get output file");
}

delete renderer;

My code is failed at ProcessPages method. After write log (I have a problem with debugging in ndk), I found pdfrender BeginDocument always return false in TessBaseAPI::ProcessPages method of baseapi.cpp:

我的代码在ProcessPages方法中失败了。写日志后(我在ndk中调试有问题),我发现pdfrender BeginDocument总是在baseapi.cpp的TessBaseAPI :: ProcessPages方法中返回false:

if (renderer && !renderer->BeginDocument(kUnknownTitle)) {
    success = false;
 }

Do I miss something?

P/s: I use tess-two, which prefer baseapi to capi

我错过了什么吗? P / s:我使用tess-two,它更喜欢baseapi和capi

1 个解决方案

#1


1  

It's as follows:

它如下:

TessResultRenderer renderer = api.TessPDFRendererCreate(dataPath);
api.TessBaseAPIProcessPages1(handle, image, null, 0, renderer);
PointerByReference data = new PointerByReference();
IntByReference dataLength = new IntByReference();
api.TessResultRendererGetOutput(renderer, data, dataLength);
byte[] bytes = data.getValue().getByteArray(0, dataLength);
// then write bytes array to a file with PDF extension.

If you have problem following the codes, check out the renderer example in this post.

如果您在遵循代码时遇到问题,请查看此帖子中的渲染器示例。

#1


1  

It's as follows:

它如下:

TessResultRenderer renderer = api.TessPDFRendererCreate(dataPath);
api.TessBaseAPIProcessPages1(handle, image, null, 0, renderer);
PointerByReference data = new PointerByReference();
IntByReference dataLength = new IntByReference();
api.TessResultRendererGetOutput(renderer, data, dataLength);
byte[] bytes = data.getValue().getByteArray(0, dataLength);
// then write bytes array to a file with PDF extension.

If you have problem following the codes, check out the renderer example in this post.

如果您在遵循代码时遇到问题,请查看此帖子中的渲染器示例。