如何使用Keras OCR示例?

I found examples/image_ocr.py which seems to for OCR. Hence it should be possible to give the model an image and receive text. However, I have no idea how to do so. How do I feed the model with a new image? Which kind of preprocessing is necessary?

我发现/ image_ocr例子。这似乎对OCR来说。因此，给模型一个图像并接收文本应该是可能的。然而，我不知道该怎么做。我如何用一个新图像来给模型提供信息?需要哪种预处理?

What I did

Installing the depencencies:

安装depencencies:

Install cairocffi: sudo apt-get install python-cairocffi
安装cairocffi: sudo apt-get安装python-cairocffi。
Install editdistance: sudo -H pip install editdistance
安装编辑距离:sudo -H pip安装编辑距离
Change train to return the model and save the trained model.
更改列车返回模型并保存经过训练的模型。
Run the script to train the model.
运行脚本以训练模型。

Now I have a model.h5. What's next?

现在我有一个模型，h5。接下来是什么?

See https://github.com/MartinThoma/algorithms/tree/master/ML/ocr/keras for my current code. I know how to load the model (see below) and this seems to work. The problem is that I don't know how to feed new scans of images with text to the model.

在我当前的代码中，请参阅https://github.com/martina/algorithms/tree/master/ml/ocr/keras。我知道如何加载模型(见下面)，这似乎是可行的。问题是我不知道如何向模型提供带有文本的图像的新扫描。

What I tried

#!/usr/bin/env python

from keras import backend as K
import keras
from keras.models import load_model
import os

from image_ocr import ctc_lambda_func, create_model, TextImageGenerator
from keras.layers import Lambda
from keras.utils.data_utils import get_file
import scipy.ndimage
import numpy

img_h = 64
img_w = 512
pool_size = 2
words_per_epoch = 16000
val_split = 0.2
val_words = int(words_per_epoch * (val_split))
if K.image_data_format() == 'channels_first':
    input_shape = (1, img_w, img_h)
else:
    input_shape = (img_w, img_h, 1)

fdir = os.path.dirname(get_file('wordlists.tgz',
                                origin='http://www.mythic-ai.com/datasets/wordlists.tgz', untar=True))

img_gen = TextImageGenerator(monogram_file=os.path.join(fdir, 'wordlist_mono_clean.txt'),
                             bigram_file=os.path.join(fdir, 'wordlist_bi_clean.txt'),
                             minibatch_size=32,
                             img_w=img_w,
                             img_h=img_h,
                             downsample_factor=(pool_size ** 2),
                             val_split=words_per_epoch - val_words
                             )
print("Input shape: {}".format(input_shape))
model, _, _ = create_model(input_shape, img_gen, pool_size, img_w, img_h)

model.load_weights("my_model.h5")

x = scipy.ndimage.imread('example.png', mode='L').transpose()
x = x.reshape(x.shape + (1,))

# Does not work
print(model.predict(x))

this gives

这给了

2017-07-05 22:07:58.695665: I tensorflow/core/common_runtime/gpu/gpu_device.cc:996] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX TITAN Black, pci bus id: 0000:01:00.0)
Traceback (most recent call last):
  File "eval_example.py", line 45, in <module>
    print(model.predict(x))
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 1567, in predict
    check_batch_axis=False)
  File "/usr/local/lib/python2.7/dist-packages/keras/engine/training.py", line 106, in _standardize_input_data
    'Found: array with shape ' + str(data.shape))
ValueError: The model expects 4 arrays, but only received one array. Found: array with shape (512, 64, 1)

3 个解决方案

#1

Here, you created a model that needs 4 inputs:

在这里，您创建了一个需要4个输入的模型:

model = Model(inputs=[input_data, labels, input_length, label_length], outputs=loss_out)

Your predict attempt, on the other hand, is loading just an image.
Hence the message: The model expects 4 arrays, but only received one array

另一方面，你的预测尝试，只是加载一个图像。因此，消息是:模型需要4个数组，但只收到一个数组

From your code, the necessary inputs are:

从你的守则，所需的输入是:

input_data = Input(name='the_input', shape=input_shape, dtype='float32')
labels = Input(name='the_labels', shape=[img_gen.absolute_max_string_len],dtype='float32')
input_length = Input(name='input_length', shape=[1], dtype='int64')
label_length = Input(name='label_length', shape=[1], dtype='int64')

The original code and your training work because they're using the TextImageGenerator. This generator cares to give you the four necessary inputs for the model.

原始代码和训练工作，因为它们使用TextImageGenerator。这个生成器关心的是为模型提供四个必要的输入。

So, what you have to do is to predict using the generator. As you have the fit_generator() method for training with the generator, you also have the predict_generator() method for predicting with the generator.

所以，你要做的就是用生成器进行预测。当您有fit_generator()方法用于培训生成器时，您还有predict_generator()方法用于使用生成器进行预测。

Now, for a complete answer and solution, I'd have to study your generator and see how it works (which would take me some time). But now you know what is to be done, you can probably figure it out.

现在，为了得到一个完整的答案和解决方案，我必须研究一下你的发电机，看看它是如何工作的(这需要一些时间)。但是现在你知道该做什么了，你可能就能知道了。

You can either use the generator as it is, and predict probably a huge lot of data, or you can try to replicate a generator that will yield just one or a few images with the necessary labels, length and label length.

您可以使用生成器本身，并预测可能大量的数据，也可以尝试复制一个生成器，该生成器只生成一个或几个图像，其中包含必要的标签、长度和标签长度。

Or maybe, if possible, just create the 3 remaining arrays manually, but making sure they have the same shapes (except for the first, which is the batch size) as the generator outputs.

或者，如果可能的话，手动创建剩下的3个数组，但是要确保它们具有与生成器输出相同的形状(除了第一个，即批处理大小)。

The one thing you must assert, though, is: have 4 arrays with the same shapes as the generator outputs, except for the first dimension.

但是，您必须断言的一件事是:除了第一个维度外，还有4个数组，其形状与生成器输出相同。

#2

Now I have a model.h5. What's next?

现在我有一个模型，h5。接下来是什么?

First I should comment that the model.h5 contains the weights of your network, if you wish to save the architecture of your network as well you should save it as a json like this example:

首先，我应该对模型进行注释。h5包含您的网络的权重，如果您也希望保存网络的架构，您应该将它保存为json，如下例所示:

model_json = model_json = model.to_json()
with open("model_arch.json", "w") as json_file:
    json_file.write(model_json)

Now, once you have your model and its weights you can load them on demand by doing the following:

现在，一旦你有了你的模型和它的重量，你可以根据需要进行以下操作:

json_file = open('model_arch.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)
# load weights into new model
# if you already have a loaded model and dont need to save start from here
loaded_model.load_weights("model.h5")    
# compile loaded model with certain specifications
sgd = SGD(lr=0.01)
loaded_model.compile(loss="binary_crossentropy", optimizer=sgd, metrics=["accuracy"])

Then, with that loaded_module you can proceed to predict the classification of certain input like this:

然后，有了loaded_module，您可以继续预测某些输入的分类如下:

prediction = loaded_model.predict(some_input, batch_size=20, verbose=0)

Which will return the classification of that input.

返回输入的分类。

About the Side Questions:

有关方面的问题:

CTC seems to be a term they are defining in the paper you refered, extracting from it says:
CTC似乎是他们在你引用的论文中定义的一个术语，从它中提取出来说:

In what follows, we refer to the task of labelling un- segmented data sequences as temporal classification (Kadous, 2002), and to our use of RNNs for this pur- pose as connectionist temporal classification (CTC).

在接下来的内容中，我们将未分割的数据序列标记为时间分类(Kadous, 2002)，并将RNNs用于这个pur-冒充连接主义时间分类(CTC)。

To compensate the rotation of a document, images, or similar you could either generate more data from your current one by applying such transformations (take a look at this blog post that explains a way to do that ), or you could use a Convolutional Neural Network approach, which also is actually what that Keras example you are using does, as we can see from that git:
补偿的旋转一个文档、图片或类似的你可以产生更多的这样的应用数据从你当前的一个转换(看看这个博客解释方法),或者你可以使用卷积神经网络方法,也就是Keras例子你使用,我们可以看到从git:

This example uses a convolutional stack followed by a recurrent stack and a CTC logloss function to perform optical character recognition of generated text images.

这个示例使用一个卷积堆栈，然后是一个重复堆栈和一个CTC logloss函数来对生成的文本图像进行光学字符识别。

You can check this tutorial that is related to what you are doing and where they also explain more about Convolutional Neural Networks.

您可以查看本教程，它与您正在做的工作相关，并且它们还可以在其中解释有关卷积神经网络的更多内容。

Well this one is a broad question but to detect lines you could use the Hough Line Transform, or also Canny Edge Detection could be good options.
这是一个广泛的问题但是要检测行你可以使用霍夫线变换，或者精明的边缘检测也可以是很好的选择。

Edit: The error you are getting is because it is expected more parameters instead of 1, from the keras docs we can see:

编辑:您所得到的错误是因为我们期望从keras文档中得到更多的参数，而不是1:

predict(self, x, batch_size=32, verbose=0)

Raises ValueError: In case of mismatch between the provided input data and the model's expectations, or in case a stateful model receives a number of samples that is not a multiple of the batch size.

引发ValueError:如果所提供的输入数据与模型的期望不匹配，或者如果有状态模型接收到许多不是批处理大小的倍数的样本。

#3

Well, I will try to answer everything you asked here:

好吧，我试着回答你提出的所有问题:

As commented in the OCR code, Keras doesn't support losses with multiple parameters, so it calculated de NN loss in a lambda layer. What does this mean in this case?

正如OCR代码中的注释，Keras不支持带有多个参数的损失，因此它计算了lambda层的NN损失。这在这种情况下意味着什么?

The neural network may look confusing because it is using 4 inputs ([input_data, labels, input_length, label_length]) and loss_out as output. Besides input_data, everything else is information used only for calculate the loss, it means it is only used for training. We desire something like in line 468 of the original code:

神经网络可能看起来很混乱，因为它使用了4个输入([input_data, label, input_length, label_length])和损失输出。除了input_data，其他所有信息都只用于计算损失，这意味着它只用于培训。我们想要原始代码的第468行:

Model(inputs=input_data, outputs=y_pred).summary()

which means "I have an image as input, please tell me what is written here". So how to achieve it?

这意味着“我有一个图像作为输入，请告诉我这里写了什么”。那么如何实现呢?

1) Keep the original training code as it is, do the training normally;

1)保留原培训代码，正常进行培训;

2) After training, save this model Model(inputs=input_data, outputs=y_pred)in a .h5 file to be loaded wherever you want;

2)在训练后，将该模型模型(input =input_data, output =y_pred)保存在一个.h5文件中，以便在任何需要的地方加载;

3) Do the prediction: if you take a look ate the code, the input image is inverted and translated, so you can use this code to make it easy:

3)做预测:如果你看一下代码，输入的图像是倒过来翻译的，所以你可以用这段代码来简化:

from scipy.misc import imread, imresize
#use width and height from your neural network here.

def load_for_nn(img_file):
    image = imread(img_file, flatten=True)
    image = imresize(image,(height, width))
    image = image.T

    images = np.ones((1,width,height)) #change 1 to any number of images you want to predict, here I just want to predict one
    images[0] = image
    images = images[:,:,:,np.newaxis]
    images /= 255

    return images

With the image loaded, let's do the prediction:

加载图像后，我们进行预测:

def predict_image(image_path): #insert the path of your image 
    image = load_for_nn(image_path) #load from the snippet code
    raw_word = model.predict(image) #do the prediction with the neural network
    final_word = decode_output(raw_word)[0] #the output of our neural network is only numbers. Use decode_output from image_ocr.py to get the desirable string.
    return final_word

This should be enough. From my experience, the images used in the training are not good enough to make good predictions, I will release a code using other datasets that improved my results latter if necessary.

这应该是足够了。根据我的经验，在培训中使用的图像不足以做出好的预测，我将使用其他数据集来发布代码，如果必要的话，可以改进我的结果。

Answering related questions:

回答相关问题:

What is CTC? Connectionist Temporal Classification?
CTC是什么?联结主义的分类?

It is a technique used to improve sequence classification. The original paper proves it improves results on discovering what is said in audio. In this case it is a sequence of characters. The explanation is a bit trick but you can find a good one here.

它是一种改进序列分类的技术。原始论文证明它改进了发现音频中所讲内容的结果。在这种情况下，它是一个字符序列。这个解释有点复杂，但你可以在这里找到一个很好的解释。

Are there algorithms which reliably detect the rotation of a document?
有没有可靠地检测文档旋转的算法?

I am not sure but you could take a look ate Attemption mechanism in neural networks. I don't have any good link now but I know it could be the case.

我不确定，但你可以看看神经网络的尝试机制。我现在没有任何好的链接，但我知道可能是这样。

Are there algorithms which reliably detect lines / text blocks / tables / images (hence make a reasonable segmentation)? I guess edge detection with smoothing and line-wise histograms already works reasonably well for that?
是否有可靠地检测行/文本块/表/图像的算法(从而做出合理的分割)?我想用平滑和线状直方图进行边缘检测已经很有效了吧?

OpenCV implements Maximally Stable Extremal Regions (known as MSER). I really like the results of this algorithm, it is fast and was good enough for me when I needed.

OpenCV实现最大稳定的极值区域(称为MSER)。我非常喜欢这个算法的结果，它速度快，在我需要的时候对我来说已经足够好了。

As I said before, I will release a code soon. I will edit the question with the repository when I do, but I believe the information here is enough to get the example running.

如前所述，我将很快发布代码。当我做的时候，我将用存储库编辑这个问题，但是我相信这里的信息足以让示例运行起来。

#1