如何给Tesseract一个单词列表（.NET包装器）

TLDR; version:

Does anyone have a working 'bazaar' config for Tasseract using the .NET wrapper that I could see?

有没有人使用我能看到的.NET包装器为Tasseract提供工作'bazaar'配置?

I'm pretty sure that's what I want (only recognise some words from a list), but it doesn't seem to do anything

我很确定这就是我想要的(只能识别列表中的一些单词),但它似乎没有做任何事情

I have a pretty short list of possible strings I'm trying to find (1-4 words). The documentation for Tesseract states:

我有一个非常简短的列表,我正在寻找可能的字符串(1-4个字)。 Tesseract的文档说明:

If you want to replace the whole dictionary, you will need to unpack the .traineddata file, create a new word-dawg file, and then pack the files back into a .traineddata file. See TrainingTesseract for more details.

如果要替换整个字典,则需要解压缩.traineddata文件,创建新的word-dawg文件,然后将文件打包回.traineddata文件。有关详细信息,请参阅TrainingTesseract。

That sounds like what I want! So I look at TrainingTesseract and see:

这听起来像我想要的!所以我看一下TrainingTesseract,看看:

The traineddata file is simply a concatenation of the input files, with a table of contents that contains the offsets of the known file types. See ccutil/tessdatamanager.h in the source code for a list of the currently accepted filenames.

训练的数据文件只是输入文件的串联,其内容表包含已知文件类型的偏移量。请参阅源代码中的ccutil / tessdatamanager.h以获取当前接受的文件名列表。

Great. So how do I go about unpacking this simple concatenation of input files, modifying the content and header and re-packing it, then? :)

大。那么我该如何解压缩这个简单的输入文件串联,修改内容和标题并重新打包呢? :)

This post appears to be the same question - which involves simply turning off the default dictionary and using user-words instead:

这篇文章似乎是同一个问题 - 只需关闭默认字典并使用用户词代替:

let’s suppose you want to OCR in English, but suppress the normal dictionary and load an alternative word list and an alternative list of patterns — these two files are the most commonly used extra data files.

让我们假设您想用英语进行OCR,但是禁止普通字典并加载替代单词列表和替代模式列表 - 这两个文件是最常用的额外数据文件。

If your language pack is in /path/to/eng.traineddata and the hocr config is in /path/to/configs/hocr then create three new files:

如果您的语言包位于/path/to/eng.traineddata中,并且hocrr配置位于/ path / to / configs / hocr中,则创建三个新文件:

/path/to/eng.user-words: -snip

/path/to/eng.user-patterns: -snip

/path/to/configs/bazaar: -snip

Now, if you pass the word bazaar as a trailing command line parameter to Tesseract, Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng.user-words and eng.user-patterns files you provided. The former is a simple word list, one per line. The format of the latter is documented in dict/trie.h on read_pattern_list().

现在,如果您将单词bazaar作为尾随命令行参数传递给Tesseract,Tesseract将不会打扰加载系统字典或频繁单词的字典,并将加载和使用eng.user-words和eng.user-patterns文件提供。前者是一个简单的单词列表,每行一个。后者的格式记录在read_pattern_list()的dict / trie.h中。

But having done this it's made no difference at all!

但是这样做完全没有任何区别!

I'm creating the engine with:

我用以下方法创建引擎:

using (engine = new TesseractEngine(@"C:\src\x\tessdata", "eng", EngineMode.Default, @"C:\src\x\tessdata\engine.config"))

Having made a (UTF-8, unix line endings) file engine.config:

制作了(UTF-8,unix行结尾)文件engine.config:

load_system_dawg     F
load_freq_dawg       F
user_words_suffix    user-words
user_patterns_suffix user-patterns

And created eng.user-patterns and eng.user-words (UTF-8, Unix line ending) files alongide the eng.traineddata.

并在eng.traineddata旁边创建了eng.user-patterns和eng.user-words(UTF-8,Unix行结尾)文件。

1 个解决方案

#1

Did you figure this out?

你搞清楚了吗?

Looks like here is a way to increase its' preference of finding dictionary words:

看起来这是一种增加其寻找字典单词的偏好的方法:

https://github.com/tesseract-ocr/tesseract/wiki/FAQ#how-to-increase-the-trust-instrength-of-the-dictionary

How to increase the trust in/strength of the dictionary?

如何增加字典的信任度?

For tesseract-ocr < 3.01 try upping NON_WERD and GARBAGE_STRING in dict/permute.cpp to maybe 3 or even 5.

对于tesseract-ocr <3.01,请尝试将dict / permute.cpp中的NON_WERD和GARBAGE_STRING提升到3或甚至5。

For tesseract-ocr >= 3.01 try increasing the variables language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word in a config file. By default they are 0.1 and 0.15 respectively.

对于tesseract-ocr> = 3.01,尝试在配置文件中增加变量language_model_penalty_non_freq_dict_word和language_model_penalty_non_dict_word。默认情况下,它们分别为0.1和0.15。

#1