Although plenty of digital information is available for consumption by businesses, employees still have to handle printed invoices, flyers, brochures, and forms in hard copies or textual images saved in .jpg,.png, or .pdf formats. Handling such data manually in these files is tedious, time-consuming, and prone to manual errors. Such files cannot be edited directly, and there is a need to make them editable first or have a tool that can read the content from the image and extract it for further processing. We all must have used online or offline tools to convert images to editable text formats to make things easier. This is possible using OCR or Optical Character Recognition.
This article was published as a part of the Data Science Blogathon.
The acronym ‘OCR’ stands for Optical Character Recognition. Commonly known as ‘Text Recognition,’ it is a popular technique for extracting text from images. An OCR program is a tool that extracts and re-purposes data from scanned documents, camera images, and image-only pdf. An OCR system uses a combination of hardware, such as optical scanners and software capable of image processing. For text extraction, the OCR tools (OCR libraries) employ several machine algorithms for pattern recognition to identify the presence and layout of the text in an image file.
These tools are trained to identify the shapes of characters or numbers on an image to recognize the text in the image. Later these can reconstruct the extracted text in a machine-readable format. Due to this, the extracted text can be selected, edited, or copy-pasted like regular text. In a simpler sense, OCR converts digital data in image format into editable word processing documents. Thankfully, many free and commercial tools (offline and online) allow OCR technology to extract text from images.
Currently, OCR tools are pretty advanced due to the implementation of techniques such as intelligent character recognition (ICR), which can identify languages, handwriting styles, etc.
In this article, we will discuss OCR, the benefits of OCR, why we need text extraction from documents, OCR libraries available in Python, and an example of text extraction from an image using the Keras-OCR library in Python.
As mentioned in the above section, the primary benefit of OCR technology is that it automates manual and time-consuming data entry tasks. This is because by using OCR, we can create digital documents that can be edited and stored per requirements. An OCR tool processes the image to identify the text and creates a hidden layer of text behind the image. This additional layer can be easily read by a computer, thus making the image recognizable and searchable. This is crucial for businesses as they have to deal with media and content daily. OCR also offers the following benefits –
A typical example of an OCR application can be seen in medical insurance claim form processing. With OCR, it is easier to compare the insurance claim with the policyholder’s details. OCR-equipped systems can flag any anomalies in the data to the concerned teams and prevent possible fraud.
Even though OCR can easily extract text from images, it sometimes faces challenges. This happens when the text is available in images representing natural environments, geometrical distortions, too much noise or cluttered and complex backgrounds, and different fonts other than the regular ones. Still, the OCR technology has an increasingly strong potential in deep learning applications to build tools for reading license plates on vehicles, digitizing invoices or menus, scanning ID cards, comparing claim forms, and so on.
Now that we have understood OCR and its use let us look at some commonly used open-source Python libraries for text recognition and extraction.
In this section, we will build a Keras-OCR pipeline to extract text from a few sample images. I am using Google Colab for this tutorial.
Let’s begin by installing the keras-ocr library (supports Python >= 3.6 and TensorFlow >= 2.0.0) using the following code –
!pip install -q keras-ocr
You can also use the following command to install the package from the master location.
pip install git+https://github.com/faustomorales/keras-ocr.git#egg=keras-ocr
We must import matplotlib and the newly-installed Keras-ocr library to process the images and extract text from them.
import keras_ocr import matplotlib.pyplot as plt
Let’s set up a pipeline with Keras-ocr. The model is a pre-trained text extraction model loaded with pre-trained weights for the detector and recognizer.
pipeline = keras_ocr.pipeline.Pipeline()
We will use two images to test the capabilities of the Keras-ocr library. You can try the same with any other image with text of your choice.
# Read images from folder path to image object images = [ keras_ocr.tools.read(img) for img in ['/content/Image1.png', '/content/Image2.png',] ]
Here are the two images we used for this tutorial on the Keras-ocr library. One is a plain image with text using handwriting style font, and the other is an image containing text.
Now, let us run the pipeline recognizer on images and make predictions about the text in these images.
# generate text predictions from the images prediction_groups = pipeline.recognize(images)
We can plot the predictions from the model using the following code –
# plot the text predictions fig, axs = plt.subplots(nrows=len(images), figsize=(10, 20)) for ax, image, predictions in zip(axs, images, prediction_groups): keras_ocr.tools.drawAnnotations(image=image, predictions=predictions, ax=ax)
We get the predicted output as –
The Keras-OCR library performed well on both images. It was able to correctly identify the text’s location and extract the words from the input images.
We can also print the identified text from the images as –
predicted_image = prediction_groups[1] for text, box in predicted_image: print(text)
If required, the above-recognized text from the above images can be converted to .csv or .txt format for further use.
In this tutorial, we discussed OCR, its advantages to businesses for image processing, and different open-source OCR libraries in Python. Next, we learned how to extract text from multiple images using the Keras-OCR library. Here are a few key takeaways from the article-
That’s it for this tutorial. Try the Keras-ocr library to see how accurately it can identify the text in your images.
I hope you enjoyed reading this article and learned about Keras-ocr. The code for this text extraction tutorial is available on my GitHub repository.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.