Document image analysis is the name for the algorithms and methods used to turn the pixels in an image into a description that a computer can understand. Optical Character Recognition, or OCR, uses computer vision to find and read the text in images. OCR can accurately predict the output in a matter of milliseconds. OCR was one of the first problems that computer vision tried to solve, and it has come a long way since then. With the help of these OCR models, we found a way of label detection invoices, such as the vendor’s name, the bill date, the bill number, the bill amount, and the total number of items. To get a high level of accuracy, we used an ensemble technique in which we used different OCRs for detecting and recognizing the labels separately.
Learning Objectives
Below are the major learning objectives of this article:
This article was published as a part of the Data Science Blogathon.
Since the input is an image of an invoice, we know that preprocessing the images is a very important step that will help us get better results. For this, we used Skew Correction, Binarisation, Noise Filtering, and contour detection as part of the preprocessing.
#binarisation
res = cv.adaptiveThreshold(img,255,cv.ADAPTIVE_THRESH_GAUSSIAN_C,\
cv.THRESH_BINARY,11,2)
plt.figure(figsize=(100, 60))
plt.imshow(res,'gray')
plt.show()
#noise filtering
cv2.fastNlMeansDenoisingColored(img,None,10,10,7,21)
#skew correction
import numpy as np
from skimage import io
from skimage.transform import rotate
from skimage.color import rgb2gray
from deskew import determine_skew
image = io.imread(_img)
grayscale = rgb2gray(image)
angle = determine_skew(grayscale)
rotated = rotate(image, angle, resize=True) * 255
rotated=rotated.astype(np.uint8)
Contour Detection is done because the invoices in the images we have are in different places and we need to find them. This was done with the help of a ” contour detection method.” Find the image’s largest contour, crop it to fit, and show it. This was done by using the cv2.findContours() function to find the edges and the cv2.contourArea() method to find the edge with the most area, then cropping the image to that edge.
contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE)
# Find Biggest Contour
areas = [cv2.contourArea(c) for c in contours]
max_index = np.argmax(areas)
# Find approxPoly Of Biggest Contour
epsilon = 0.1 * cv2.arcLength(contours[max_index], True)
approx = cv2.approxPolyDP(contours[max_index], epsilon, True)
# Crop The Image
points1 = np.float32(approx)
points = np.float32([[0, 0], [width, 0], [width, height], [0, height]])
result = cv2.warpPerspective(img, matrix, (width, height))
matrix = cv2.getPerspectiveTransform(points1, points)
Then, using EasyOCR as the detection model and PaddleOCR as the recognition model, the MultiOcr model is built to get the coordinates of the labels for each invoice template.
reader = easyocr.Reader(['en'])
ocr = PaddleOCR(lang='en')
#detection
def detect_text_blocks(img_path):
detection_result = reader.detect(img_path,width_ths=0.7,mag_ratio=1.5)
text_coordinates = detection_result[0][0]
return text_coordinates
The MultiOcr model finds the coordinates of label names in the template labels dataset for each template invoice and stores them in a table (csv file). Because the number of items on an invoice can vary, the starting and ending coordinates of the table of invoice items in the invoice image were given to predict how many items were on the invoice.
When the size of the table of items in the invoice image changes, labels like the “total amount” position change. This is because the total amount comes after the table of invoice items in any invoice. To solve this problem, a relative positioning method can be used to guess and detect the total amount. This can be done by storing the coordinates of the strings around the total amount label in the invoice. This is done because the string’s value (or name) doesn’t change, even if the invoices are different but come from the same template.
The document similarity method is used on these three images. Image1 and image2 are from the same vendor, and image3 is from a different vendor. The document similarity results are shown below:
From the document similarity method results, we can see that the distance between image1 and image2 is less than between image1 and image3. This means that images 1 and 2 are from the same vendor.
Since we got the template from the table (csv file), the label coordinates are taken and used to identify invoice image labels.
Example: When an image of an invoice like the one below is given as input, it first looks for the invoice’s template. The table (csv file) is used to get coordinates for the labels. The image labels on the invoice will be identified with these label coordinates.
In Conclusion, With this work, we propose an algorithm for label detection from the invoices using the MultiOcr Model; we will be able to successfully detect the positions of the labels for templates as well as the labels for any new invoices within the given templates. For this, we used OCR models like easyOCR as the detection model and PaddleOCR as the recognition model. Also, we are happy to say that we are able to give better results with this algorithm.
Key takeaways of this article
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.