Yes … I’ve surprised for how easy is to deal with Optical Character Recognition OCR using Python 2.x, …. if you have the right tools installed.
I decided to try OCR because I received a WhatsApp message with a photo of the monthly menu at school, and … why not can I study what the children are eating?
If this post helps you, buy me a beer!
Installing the tools
I did a Google search for that, and … the road to do it doesn’t seem very difficult. The instructions followed were found here: http://fosshelp.blogspot.com.es/2013/04/how-to-convert-jpg-to-tiff-for-ocr-with.html. The first step is install PIL, a package that allows to deal with images. If you are using the Anaconda distribution, you don’t need to install anything, because it’s already done for you!
Click next, and in one step, some elements need to be downloaded:
Next step is to download the PyTesser library (https://code.google.com/p/pytesser/downloads/detail?name=pytesser_v0.0.1.zip&can=2&q=), the tool that allows Python to work with Tesser. Unzip the file on the same dir where you have your code.
Trying OCR with one image
Let’s code some lines, but before going on, you only need the image, and run this little code:
from PIL import Image from pytesser import * image_file = 'menu.tif' im = Image.open(image_file) text = image_to_string(im) text = image_file_to_string(image_file) text = image_file_to_string(image_file, graceful_errors=True) print "=====output=======\n" print text
In my case, menu.tif is the image I want to try.
As you can see in the next image, the outcome is not perfect (as I expected):
The code and the tool works, but … it works better if the image has better quality.
Trying an example
Let’s verify the OCR works better with a prepared image, and the best option id to try the image included on PyTesser. The image is this:
And the outcome:
And … that’s all. If you like it, I encaurage you donate for keep on doing this blog!!
By now, I wasn’t able to analize the menu because of the data extraction method is not perfect. In order to improve the data extraction, I manage the following scenarios: better photos, or another method. The first is the easiest method.
Have a nice day, and keep coding!