在Django上使用Herse上的Tesseract

时间:2023-01-23 08:59:29

I would like to add OCR capabilities to my Django app running on Heroku. I suspect the easiest way is by using Tesseract. I've noticed that there are a number of python wrappers for Tesseract's API, but what is the best way to get Tesseract installed and running on Heroku? Via a custom buildpack like heroku-buildpack-tesseract maybe?

我想在运行Heroku的Django应用程序中添加OCR功能。我怀疑最简单的方法是使用Tesseract。我注意到Tesseract的API有很多python包装器,但是在Heroku上安装和运行Tesseract的最佳方法是什么?通过像heroku-buildpack-tesseract这样的自定义buildpack可能吗?

1 个解决方案

#1


1  

I'll try to capture some notes on the solution I arrived at here.

我将尝试捕捉一些关于我到达的解决方案的注释。

My .buildpacks file:

我的.buildpacks文件:

https://github.com/heroku/heroku-buildpack-python
https://github.com/clearideas/heroku-buildpack-ghostscript
https://github.com/marcolinux/heroku-buildpack-libraries

My .buildpacks_bin_download file:

我的.buildpacks_bin_download文件:

tesseract-ocr https://s3.amazonaws.com/tesseract-ocr/heroku/tesseract-ocr-3.02.02.tar.gz 3.02 eng,spa

Here is the key piece of python that does the OCRing of pdf files:

这是执行pdf文件OCRing的python的关键部分:

        # Additional processing
        document_path = Path(str(document.attachment_file))

        if document_path.ext == '.pdf':
            working_path = Path('temp', document.directory)
            working_path.mkdir(parents=True)

            input_path = Path(working_path, name)
            input_path.write_file(document.attachment_file.read(), 'w')

            rb = ReadBot()

            args = [
                'VBEZ',
                # '-sDEVICE=tiffg4',
                '-sDEVICE=pnggray',
                '-dNOPAUSE',
                '-r600x600',
                '-sOutputFile=' + str(working_path) + '/page-%00d.png',
                str(input_path)
            ]

            ghostscript.Ghostscript(*args)
            image_paths = working_path.listdir(pattern='*.png')
            txt = ''

            for image_path in image_paths:
                ocrtext = rb.interpret(str(image_path))
                txt = txt + ocrtext

            document.notes = txt
            document.save()
            working_path.rmtree()

#1


1  

I'll try to capture some notes on the solution I arrived at here.

我将尝试捕捉一些关于我到达的解决方案的注释。

My .buildpacks file:

我的.buildpacks文件:

https://github.com/heroku/heroku-buildpack-python
https://github.com/clearideas/heroku-buildpack-ghostscript
https://github.com/marcolinux/heroku-buildpack-libraries

My .buildpacks_bin_download file:

我的.buildpacks_bin_download文件:

tesseract-ocr https://s3.amazonaws.com/tesseract-ocr/heroku/tesseract-ocr-3.02.02.tar.gz 3.02 eng,spa

Here is the key piece of python that does the OCRing of pdf files:

这是执行pdf文件OCRing的python的关键部分:

        # Additional processing
        document_path = Path(str(document.attachment_file))

        if document_path.ext == '.pdf':
            working_path = Path('temp', document.directory)
            working_path.mkdir(parents=True)

            input_path = Path(working_path, name)
            input_path.write_file(document.attachment_file.read(), 'w')

            rb = ReadBot()

            args = [
                'VBEZ',
                # '-sDEVICE=tiffg4',
                '-sDEVICE=pnggray',
                '-dNOPAUSE',
                '-r600x600',
                '-sOutputFile=' + str(working_path) + '/page-%00d.png',
                str(input_path)
            ]

            ghostscript.Ghostscript(*args)
            image_paths = working_path.listdir(pattern='*.png')
            txt = ''

            for image_path in image_paths:
                ocrtext = rb.interpret(str(image_path))
                txt = txt + ocrtext

            document.notes = txt
            document.save()
            working_path.rmtree()