In tesseract's google documentation here https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 here, there is a instruction that I have to get Unicode for the generated characters in my box files.It looks like this
在tesseract的谷歌文档https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3这里,有一条指令,我必须获取我的盒子文件中生成的字符的Unicode。它看起来像这样
s 734 494 751 519 0
s 734 494 751 519 0
p 753 486 776 518 0
p 753 486 776 518 0
r 779 494 796 518 0
r 779 494 796 518 0
i 799 494 810 527 0
i 799 494 810 527 0
n 814 494 837 518 0
n 814 494 837 518 0
g 839 485 862 518 0
g 839 485 862 518 0
t 865 492 878 521 0
吨865 492 878 521 0
u 101 453 122 484 0
你101 453 122 484 0
b 126 453 146 486 0
b 126 453 146 486 0
e 149 452 168 477 0
e 149 452 168 477 0
r 172 453 187 476 0
r 172 453 187 476 0
d 211 451 232 484 0
d 211 451 232 484 0
e 236 451 255 475 0
e 236 451 255 475 0
n 259 452 281 475 0
n 259 452 281 475 0
Now, my question is where or how I get this? I am developing an OCR for Bangla language.
现在,我的问题是我在哪里或如何得到这个?我正在为Bangla语言开发OCR。
1 个解决方案
#1
0
The box file is a UTF-8 encoded text file. You can use a Unicode-compatible text editor, or a box file editor, to open and edit the characters with your favorite Bangla input method.
盒子文件是UTF-8编码的文本文件。您可以使用与Unicode兼容的文本编辑器或文件夹编辑器,使用您最喜欢的Bangla输入法打开和编辑字符。
#1
0
The box file is a UTF-8 encoded text file. You can use a Unicode-compatible text editor, or a box file editor, to open and edit the characters with your favorite Bangla input method.
盒子文件是UTF-8编码的文本文件。您可以使用与Unicode兼容的文本编辑器或文件夹编辑器,使用您最喜欢的Bangla输入法打开和编辑字符。