Deep Learning solution for digit recognition on natural scene

I am working on a problem, where I want to automatically read the number on images as follows:

enter image description here

enter image description here

As can be seen, the images are quite challenging! Not only are these not connected lines in all cases, but also the contrast differs a lot. My first attempt was using pytesseract after some preprocessing. I also created a StackOverflow post here.

While this approach works fine on an individual image, it is not universal, as it requires too much manual information for the preprocessing. The best solution I have so far, is to iterate over some hyperparameters such as threshold value, filter size of erosion/dilation, etc. However, this is computationally expensive!

Therefore I came to believe, that the solution I am looking for must be deep-learning based. I have two ideas here:

  • Using a pre-trained network on a similar task
  • Splitting the input images into separate digits and train / finetune a network myself in an MNIST fashion

Regarding the first approach, I have not found something good yet. Does anyone have an idea for that?

Regarding the second approach, I would need a method first to automatically generate images of the separate digits. I guess this should also be deep-learning-based. Afterward, I could maybe achieve some good results with some data augmentation.

Does anyone have ideas? 🙂


Regarding to your first approach,

There are two synthetically prepared datasets available:

  1. Text Recognition Data consists from 9M images.
  2. SynthText in the Wild consists from 8M images.

I have used above datasets for text recognition on slab images. Images were quite challenging however now I achieved more than 90% accuracy for that. I have implemented following models to solve this task. These are:

  1. CRAFT for text localization.
  2. Deep Text Recognition for text recognition.

If you are working with enter image description here kinds of images only, I highly encourage you to try Deep Text Recognition. It is 4 stage framework.
enter image description here

  1. For Transformation, you can choose TPS or None. With TPS, it has showed higher performance. They implemented Spatial Transformer Networks.

  2. On Feature Extraction stage, you will have options: ResNet or VGG

  3. For Sequential Stage, BiLSTM

  4. Attn or CTC for prediction stage.

They achieved best accuracy on TPS-ResNet-BiLSTM-Attn version. You can easily fine tune this network and I hope it can solve your task. The model trained with above mentioned datasets.