TLDR; version:
Does anyone have a working 'bazaar' config for Tasseract using the .NET wrapper that I could see?
有没有人使用我能看到的.NET包装器为Tasseract提供工作'bazaar'配置?
I'm pretty sure that's what I want (only recognise some words from a list), but it doesn't seem to do anything
我很确定这就是我想要的(只能识别列表中的一些单词),但它似乎没有做任何事情
I have a pretty short list of possible strings I'm trying to find (1-4 words). The documentation for Tesseract states:
我有一个非常简短的列表,我正在寻找可能的字符串(1-4个字)。 Tesseract的文档说明:
If you want to replace the whole dictionary, you will need to unpack the .traineddata file, create a new word-dawg file, and then pack the files back into a .traineddata file. See TrainingTesseract for more details.
如果要替换整个字典,则需要解压缩.traineddata文件,创建新的word-dawg文件,然后将文件打包回.traineddata文件。有关详细信息,请参阅TrainingTesseract。
That sounds like what I want! So I look at TrainingTesseract and see:
这听起来像我想要的!所以我看一下TrainingTesseract,看看:
The traineddata file is simply a concatenation of the input files, with a table of contents that contains the offsets of the known file types. See ccutil/tessdatamanager.h in the source code for a list of the currently accepted filenames.
训练的数据文件只是输入文件的串联,其内容表包含已知文件类型的偏移量。请参阅源代码中的ccutil / tessdatamanager.h以获取当前接受的文件名列表。
Great. So how do I go about unpacking this simple concatenation of input files, modifying the content and header and re-packing it, then? :)
大。那么我该如何解压缩这个简单的输入文件串联,修改内容和标题并重新打包呢? :)
This post appears to be the same question - which involves simply turning off the default dictionary and using user-words instead:
这篇文章似乎是同一个问题 - 只需关闭默认字典并使用用户词代替:
let’s suppose you want to OCR in English, but suppress the normal dictionary and load an alternative word list and an alternative list of patterns — these two files are the most commonly used extra data files.
让我们假设您想用英语进行OCR,但是禁止普通字典并加载替代单词列表和替代模式列表 - 这两个文件是最常用的额外数据文件。
If your language pack is in /path/to/eng.traineddata and the hocr config is in /path/to/configs/hocr then create three new files:
如果您的语言包位于/path/to/eng.traineddata中,并且hocrr配置位于/ path / to / configs / hocr中,则创建三个新文件:
/path/to/eng.user-words: -snip
/path/to/eng.user-patterns: -snip
/path/to/configs/bazaar: -snip
Now, if you pass the word bazaar as a trailing command line parameter to Tesseract, Tesseract will not bother loading the system dictionary nor the dictionary of frequent words and will load and use the eng.user-words and eng.user-patterns files you provided. The former is a simple word list, one per line. The format of the latter is documented in dict/trie.h on read_pattern_list().
现在,如果您将单词bazaar作为尾随命令行参数传递给Tesseract,Tesseract将不会打扰加载系统字典或频繁单词的字典,并将加载和使用eng.user-words和eng.user-patterns文件提供。前者是一个简单的单词列表,每行一个。后者的格式记录在read_pattern_list()的dict / trie.h中。
But having done this it's made no difference at all!
但是这样做完全没有任何区别!
I'm creating the engine with:
我用以下方法创建引擎:
using (engine = new TesseractEngine(@"C:\src\x\tessdata", "eng", EngineMode.Default, @"C:\src\x\tessdata\engine.config"))
Having made a (UTF-8, unix line endings) file engine.config:
制作了(UTF-8,unix行结尾)文件engine.config:
load_system_dawg F
load_freq_dawg F
user_words_suffix user-words
user_patterns_suffix user-patterns
And created eng.user-patterns and eng.user-words (UTF-8, Unix line ending) files alongide the eng.traineddata.
并在eng.traineddata旁边创建了eng.user-patterns和eng.user-words(UTF-8,Unix行结尾)文件。
1 个解决方案
#1
0
Did you figure this out?
你搞清楚了吗?
Looks like here is a way to increase its' preference of finding dictionary words:
看起来这是一种增加其寻找字典单词的偏好的方法:
How to increase the trust in/strength of the dictionary?
如何增加字典的信任度?
For tesseract-ocr < 3.01 try upping NON_WERD and GARBAGE_STRING in dict/permute.cpp to maybe 3 or even 5.
对于tesseract-ocr <3.01,请尝试将dict / permute.cpp中的NON_WERD和GARBAGE_STRING提升到3或甚至5。
For tesseract-ocr >= 3.01 try increasing the variables language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word in a config file. By default they are 0.1 and 0.15 respectively.
对于tesseract-ocr> = 3.01,尝试在配置文件中增加变量language_model_penalty_non_freq_dict_word和language_model_penalty_non_dict_word。默认情况下,它们分别为0.1和0.15。
#1
0
Did you figure this out?
你搞清楚了吗?
Looks like here is a way to increase its' preference of finding dictionary words:
看起来这是一种增加其寻找字典单词的偏好的方法:
How to increase the trust in/strength of the dictionary?
如何增加字典的信任度?
For tesseract-ocr < 3.01 try upping NON_WERD and GARBAGE_STRING in dict/permute.cpp to maybe 3 or even 5.
对于tesseract-ocr <3.01,请尝试将dict / permute.cpp中的NON_WERD和GARBAGE_STRING提升到3或甚至5。
For tesseract-ocr >= 3.01 try increasing the variables language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word in a config file. By default they are 0.1 and 0.15 respectively.
对于tesseract-ocr> = 3.01,尝试在配置文件中增加变量language_model_penalty_non_freq_dict_word和language_model_penalty_non_dict_word。默认情况下,它们分别为0.1和0.15。