如何在tesseract中保存文档结构?

时间:2022-09-24 08:54:27

I am using tesseract ocr to extract text from an image. Preserving the structure of the document is very important to me. Currently tesseract does not preserve the structure, infact it changes the order of text. My input is the image below.

我使用tesseract ocr从图像中提取文本。保存文档的结构对我来说非常重要。当前的tesseract并没有保留该结构,事实上它改变了文本的顺序。我的输入是下图。

如何在tesseract中保存文档结构?

and the output I am getting is as follows:

我得到的输出如下:

Someto the left
Someto the left

Some in the middle
Some in the middle

Some with some tab
Some with some tab

Some with some space between them
Some with some space between them

Sometext here
Sometext here

this much
this much

How do I get the desired output as of the same structure in image?

在图像中如何得到期望的输出?

i.e. as follows:

即如下:

                                                 Some text here
                                                 Some text here

Some to the left
Some to the left

                    Some in the middle
                    Some in the middle

        Some with some tab
        Some with some tab

Some with some space between them                       this much
Some with some space between them                       this much

3 个解决方案

#1


11  

Newer versions of tesseract (3.04) have an option called preserve_interword_spaces which should do what you want.

新版本的tesseract(3.04)有一个名为preserve_interword_spaces的选项,该选项应该做您想做的事情。

Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.

请注意,tesseract在单词之间检测的空间数量可能并不总是相同的。因此,左对齐的单词与前面的空格(在您的示例中)可能不会以这种方式输出——preserve_interword_spaces选项不尝试做任何想象的事情,它只是保留了找到的空格。默认情况下,tesseract会将空格合并为一个。

Details on this option are here.

关于这个选项的详细信息在这里。

#2


4  

The only reliable way would be enabling hOCR output and parsing it. It will contain positions of each word on the page in pixels, as in the original image.

唯一可靠的方法是启用hOCR输出并解析它。它将包含每个单词在页面上的位置,如在原始图像中。

You can do it by specifying tessedit_create_hocr 1 in Tesseract's config file, or in whatever API you use.

您可以通过在Tesseract的配置文件中指定tessedit_create_hocr 1,或者在您使用的任何API中进行操作。

hOCR is a subset of HTML, and what Tesseract generates isn't always a valid XML, so you can either use an HTML parser or write your own, but you can't use reliably an XML parser.

hOCR是HTML的一个子集,Tesseract生成的并不总是有效的XML,因此您可以使用HTML解析器或编写自己的XML,但是您不能使用可靠的XML解析器。

#3


3  

Tesseract code compresses spaces in output. You will need to change the code to preserve them. See Tesseract - ambiguity in space and tab post.

Tesseract代码压缩输出中的空格。您需要更改代码来保存它们。参见Tesseract -在空格和制表符上的歧义。

#1


11  

Newer versions of tesseract (3.04) have an option called preserve_interword_spaces which should do what you want.

新版本的tesseract(3.04)有一个名为preserve_interword_spaces的选项,该选项应该做您想做的事情。

Note that the number of spaces tesseract detects between words may not always be the same between similar lines. So words that are left-aligned with a run of spaces preceding them (as in your example) may not be output this way -- the preserve_interword_spaces option does not attempt to do anything fancy, it merely preserves the spaces extraction found. By default tesseract collapses runs of spaces into one.

请注意,tesseract在单词之间检测的空间数量可能并不总是相同的。因此,左对齐的单词与前面的空格(在您的示例中)可能不会以这种方式输出——preserve_interword_spaces选项不尝试做任何想象的事情,它只是保留了找到的空格。默认情况下,tesseract会将空格合并为一个。

Details on this option are here.

关于这个选项的详细信息在这里。

#2


4  

The only reliable way would be enabling hOCR output and parsing it. It will contain positions of each word on the page in pixels, as in the original image.

唯一可靠的方法是启用hOCR输出并解析它。它将包含每个单词在页面上的位置,如在原始图像中。

You can do it by specifying tessedit_create_hocr 1 in Tesseract's config file, or in whatever API you use.

您可以通过在Tesseract的配置文件中指定tessedit_create_hocr 1,或者在您使用的任何API中进行操作。

hOCR is a subset of HTML, and what Tesseract generates isn't always a valid XML, so you can either use an HTML parser or write your own, but you can't use reliably an XML parser.

hOCR是HTML的一个子集,Tesseract生成的并不总是有效的XML,因此您可以使用HTML解析器或编写自己的XML,但是您不能使用可靠的XML解析器。

#3


3  

Tesseract code compresses spaces in output. You will need to change the code to preserve them. See Tesseract - ambiguity in space and tab post.

Tesseract代码压缩输出中的空格。您需要更改代码来保存它们。参见Tesseract -在空格和制表符上的歧义。