从OCR文本中提取段落中断?

时间：2021-07-30 08:58:05

I'm trying to recreate the paragraphs and indentations from the output of OCR'd image text, like so:

我试图重新创建OCR d图像文本输出的段落和缩进，如下所示:

Input (imagine that this is an image, not typed):

输入(假设这是一个图像，不是类型):

从OCR文本中提取段落中断?

Output (with a few mistakes):

输出(有几个错误):

从OCR文本中提取段落中断?

As you can see, no paragraph breaks or indentations are preserved.

如您所见，没有段落中断或缩进被保留。

Using Python, I tried an approach like this, but it doesn't work (fails too often):

使用Python，我尝试过这样的方法，但它不起作用(经常失败):

Code:

代码:

def smart_format(text):
  textList = text.split('\n')
  temp = ''

  averageLL = sum([len(line) for line in textList]) / len(textList)

  for line in textList:
    if (line.strip().endswith('!') or line.strip().endswith('.') or line.strip().endswith('?')) and not line.strip().endswith('-'):
      if averageLL - len(line) > 7:
        temp += '{{ paragraph }}' + line + '\n'
      else:
        temp += line + '\n'
    else:
      temp += line + '\n'

  return temp.replace(' -\n', '').replace('-\n', '').replace(' \n', '').replace('\n', ' ').replace('{{ paragraph }}', '\n\n      ')

Does anyone have any suggestions as how I could recreate this layout? I'm working with old books, so I was hoping to re-typeset them with LaTeX, as it's quite simple to create a Python script to do that.

对于如何重新创建这个布局，有人有什么建议吗?我使用的是旧书，所以我希望用LaTeX来重新排版，因为创建Python脚本很容易做到这一点。

Thanks!

谢谢!

2 个解决方案

#1

3

You can break up the image into multiple paragraphs by looking at the entropy of each 5-10 pixel horizontal slice.

通过观察每一个5-10像素的水平切片的熵，你可以将图像分割成多个段落。

This means you divide the image into a bunch of horizontal strips, each 5-10 pixels tall. If a strip is not "busy" then you can assume that there is no text there. You can use this to isolate paragraphs. Now, you take each paragraph individually, and feed it into your OCR.

这意味着你将图像分割成一束水平条带，每条都有5-10个像素高。如果一个条带没有“忙”，那么您可以假设那里没有文本。您可以使用它来隔离段落。现在，你把每一段都单独写下来，并把它输入你的OCR。

#2

0

You could try to tell if the first word on a line could have easily fit on the previous line, indicating an intentional newline, instead of purely looking for short lines. Apart from that (and paying close attention to punctuation like you're doing in your example), I'd think the only other option is going back to the original images.

你可以试着判断一行上的第一个单词是否可以很容易地与前一行匹配，表示一个有意的换行，而不是纯粹地寻找短线。除此之外(还要像在例子中那样密切关注标点符号)，我认为唯一的选择是回到原始图像。

#1

3

You can break up the image into multiple paragraphs by looking at the entropy of each 5-10 pixel horizontal slice.

通过观察每一个5-10像素的水平切片的熵，你可以将图像分割成多个段落。

This means you divide the image into a bunch of horizontal strips, each 5-10 pixels tall. If a strip is not "busy" then you can assume that there is no text there. You can use this to isolate paragraphs. Now, you take each paragraph individually, and feed it into your OCR.

这意味着你将图像分割成一束水平条带，每条都有5-10个像素高。如果一个条带没有“忙”，那么您可以假设那里没有文本。您可以使用它来隔离段落。现在，你把每一段都单独写下来，并把它输入你的OCR。

#2

0

You could try to tell if the first word on a line could have easily fit on the previous line, indicating an intentional newline, instead of purely looking for short lines. Apart from that (and paying close attention to punctuation like you're doing in your example), I'd think the only other option is going back to the original images.

你可以试着判断一行上的第一个单词是否可以很容易地与前一行匹配，表示一个有意的换行，而不是纯粹地寻找短线。除此之外(还要像在例子中那样密切关注标点符号)，我认为唯一的选择是回到原始图像。

标签：python tesseract latex OCR

相关文章

