我在哪里可以找到HOCR文件的样本?

时间:2020-12-10 08:54:54

Where can I find samples or examples of files in the hocr format? (The format in which OCR extracted text is stored with pages coordinates.)

我在哪里可以找到在hocr格式的文件样本或例子?(OCR提取文本的格式存储在页面坐标中。)

I've been looking on Google, but can't kind any samples.

我一直在看谷歌,但不能提供任何样品。

Thanks!

谢谢!

2 个解决方案

#1


2  

You can use Tesseract's command-line option "hocr" to output results in hocr format:

您可以使用Tesseract的命令行选项“hocr”输出结果的hocr格式:

tesseract youimage.tif out hocr

#2


1  

Here is a fragment of an hOCR file with a few newlines added for readability. Unfortunately, I don't remember which tool was used to generate it (possibly ocropus), but I think tesseract 3.01 and maybe others defined the bounding box for each word instead of each letter in their hOCR output.

这里是一个hOCR文件的片段,添加了一些新行,以增加可读性。不幸的是,我不记得是用哪个工具来生成它(可能是ocropus),但是我认为tesseract 3.01,或者其他的工具定义了每个单词的绑定框,而不是每个字母的hOCR输出。

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"   "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>
    </title>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
  </head>
  <body>
    <p>
      <span title="bbox 171 287 184 302">B</span><span title="bbox 186 292 195 302">a</span><span title="bbox 196 292 205 302">n</span><span title="bbox 209 287 217 302">k</span> <span title="bbox 226 287 239 302">A</span><span title="bbox 242 292 250 303">c</span><span title="bbox 252 292 260 303">c</span><span title="bbox 262 292 272 303">o</span><span title="bbox 274 293 283 303">u</span><span title="bbox 285 293 294 303">n</span><span title="bbox 297 291 302 303">t</span> <span title="bbox 309 288 323 303">N</span><span title="bbox 326 293 335 303">u</span><span title="bbox 337 293 353 303">m</span><span title="bbox 354 288 364 303">b</span><span title="bbox 366 293 375 303">e</span><span title="bbox 377 293 380 303">r</span> 
    </p>
    <p>
      <span title="bbox 170 340 183 355">B</span><span title="bbox 186 345 195 355">a</span><span title="bbox 196 345 205 355">n</span><span title="bbox 208 340 217 355">k</span> <span title="bbox 225 341 239 355">A</span><span title="bbox 242 340 252 356">d</span><span title="bbox 253 340 263 356">d</span><span title="bbox 264 345 271 355">r</span><span title="bbox 272 345 280 356">e</span><span title="bbox 282 345 289 356">s</span><span title="bbox 291 345 298 356">s</span> 
    </p>
  </body>
</html>

#1


2  

You can use Tesseract's command-line option "hocr" to output results in hocr format:

您可以使用Tesseract的命令行选项“hocr”输出结果的hocr格式:

tesseract youimage.tif out hocr

#2


1  

Here is a fragment of an hOCR file with a few newlines added for readability. Unfortunately, I don't remember which tool was used to generate it (possibly ocropus), but I think tesseract 3.01 and maybe others defined the bounding box for each word instead of each letter in their hOCR output.

这里是一个hOCR文件的片段,添加了一些新行,以增加可读性。不幸的是,我不记得是用哪个工具来生成它(可能是ocropus),但是我认为tesseract 3.01,或者其他的工具定义了每个单词的绑定框,而不是每个字母的hOCR输出。

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"   "http://www.w3.org/TR/html4/loose.dtd">
<html>
  <head>
    <title>
    </title>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8" >
  </head>
  <body>
    <p>
      <span title="bbox 171 287 184 302">B</span><span title="bbox 186 292 195 302">a</span><span title="bbox 196 292 205 302">n</span><span title="bbox 209 287 217 302">k</span> <span title="bbox 226 287 239 302">A</span><span title="bbox 242 292 250 303">c</span><span title="bbox 252 292 260 303">c</span><span title="bbox 262 292 272 303">o</span><span title="bbox 274 293 283 303">u</span><span title="bbox 285 293 294 303">n</span><span title="bbox 297 291 302 303">t</span> <span title="bbox 309 288 323 303">N</span><span title="bbox 326 293 335 303">u</span><span title="bbox 337 293 353 303">m</span><span title="bbox 354 288 364 303">b</span><span title="bbox 366 293 375 303">e</span><span title="bbox 377 293 380 303">r</span> 
    </p>
    <p>
      <span title="bbox 170 340 183 355">B</span><span title="bbox 186 345 195 355">a</span><span title="bbox 196 345 205 355">n</span><span title="bbox 208 340 217 355">k</span> <span title="bbox 225 341 239 355">A</span><span title="bbox 242 340 252 356">d</span><span title="bbox 253 340 263 356">d</span><span title="bbox 264 345 271 355">r</span><span title="bbox 272 345 280 356">e</span><span title="bbox 282 345 289 356">s</span><span title="bbox 291 345 298 356">s</span> 
    </p>
  </body>
</html>