Hands-on Optical Character Recognition (OCR)
Lockdown era. It’s exam season. You get up. You set up your machines. Exam starts. You open the question paper. You jot down your answers. You scan your answers using — CamScanner (which later got banned. MAKE IN INDIA), Adobe Scan or some other application. Hit upload. Phew.
This very concept of scanning information from physical documents & *digitizing* it is known as Optical Character recognition (OCR). We are living in the era of data & making the data publicly available. For that, digitizing information is the way to go. Now, we can digitize it in 2 ways:
- Typing it out (Sounds like work)
- Scanning it & let the machine do the work (All Hail OCR)
Alright. OCR is awesome. Now, How does it basically work?
It comprises of 2 main steps: Scanning & then applying OCR on the scanned document. The scanning part is not much of a hassle as we can do it using scanners or by taking pictures of documents. But we don’t know how to apply OCR yet. So, let’s take a deep-dive into it, shall we?
Let’s say we have to implement an OCR algorithm from scratch. How should we approach this? I need to “recognize” the text from an image. Ah, Machine Learning & Image Processing shall come in handy, right? So, now we know what to apply. Now, let’s see how to apply that.
This shall work in following steps:
- Acquiring Training Data
- Pre-processing Data
- Image Segmentation & Feature Extraction
- Model Training & Optimization
Classic ol’ Machine Learning life-cycle. But, hold up! Let’s not get all caught up in Machine Learning right now. My goal is to familiarize you with OCR & it’s basic working hands-on. So for now, we will be using pytesseract library which is a wrapper for Google’s Tesseract-OCR Engine. This open-source library is nothing short of a blessing as it supports wide range of image formats such as jpeg,png,gif, tiff etc. & helps automate the text extraction procedure.
Alright. Pytesseract makes the whole text extraction part easy. But, to make sure that the accuracy of text recognition & extraction is good, it’s on us to pre-process the images. So, what image pre-processing techniques we could use? Mostly, we have to customize techniques according to our input images but to generalize, we can do the following:
- Convert the image to grayscale
- Apply thresholding & binarization
- Apply skew correction if the image is rotated
Now, let’s not get overwhelmed with these terms just yet. To give you an overview, here’s what they mean:
a. Grayscaling: To convert an image of other color spaces (RGB, HSV, CMYK etc.) to shades of gray
b. Thresholding: To separate foreground & background of an image. Meaning, to ‘segment’ the image in 2 parts. We can also call it ‘binarizing’ the image since the output is just 2 colors — Black & White.
c. Skew Correction: Skew is nothing but a measure of ‘symmetry’ in image distribution. We evaluate this said distribution by plotting a histogram of an image. Then, based on the histogram evaluation, we check if the image/text is rotated & we rotate it again to make the image/text recognizable.
Let’s try it out. Suppose, we have to extract text from this image:
We use OpenCV library for the image pre-processing,Tkinter to create a GUI for selecting input images & pytesseract for the text extraction.
Now, since our input image is already in B/W format, there’s no immediate significant difference visible after applying grayscaling & thresholding.
Now, I am saving the output of our algorithm to a text file. Here’s our output:
Our algorithm successfully extracted text from our input image. Now let’s try it on this image:
After applying grayscaling & thresholding, we get —
Now, since our image is rotated, let’s see how well our skew correction works.
We dint achieve a 100% accurate output. Meaning, we have to calibrate our skew correction parameters & also look into how pytesseract works with special characters like “!” & “)”.