Scene text recognition (STR) continues challenging researchers due to the diversity of text appearances in natural environments. It is one thing to detect text on images on documents and another thing when the text is in an image on a person’s T-shirt. The introduction of Multi-Granularity Prediction for Scene Text Recognition (MGP-STR), presented at ECCV 2022, represents a transformative approach in this domain. MGP-STR merges the robustness of Vision Transformers (ViT) with innovative multi-granularity linguistic predictions. This enhances its ability to handle complex scene text recognition tasks. This ensures improved accuracy and usability across a variety of challenging real-world scenarios creating a simple yet powerful solution for STR tasks.
This article was published as a part of the Data Science Blogathon.
MGP-STR is a vision-based STR model designed to excel without relying on an independent language model. Instead, it integrates linguistic information directly within its architecture through the Multi-Granularity Prediction (MGP) strategy. This implicit approach allows MGP-STR to outperform both pure vision models and language-augmented methods, achieving state-of-the-art results in STR.
The architecture comprises two primary components, both of which are pivotal for ensuring the model’s exceptional performance and ability to handle diverse scene text challenges:
The fusion of predictions at character, subword, and word levels via a straightforward yet effective strategy ensures that MGP-STR captures the intricacies of both visual and linguistic features.
MGP-STR is primarily designed for optical character recognition (OCR) tasks on text images. Its unique ability to incorporate linguistic knowledge implicitly makes it particularly effective in real-world scenarios where text variations and distortions are common. Examples include:
Before diving into the code snippet, let’s understand its purpose and prerequisites. This example demonstrates how to use the MGP-STR model to perform scene text recognition on a sample image. Ensure you have PyTorch, the Transformers library, and the required dependencies (like PIL and requests) installed in your environment to execute the code seamlessly. Below is an example of how to use the MGP-STR model in PyTorch (notebook).
Begin by importing the essential libraries and dependencies required for MGP-STR, including transformers
for model processing, PIL
for image manipulation, and requests
for fetching images online. These libraries provide the foundational tools to process and display text images effectively.
from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
import requests
import base64
from io import BytesIO
from PIL import Image
from IPython.display import display, Image as IPImage
Load the MGP-STR base model and its processor from the Hugging Face Transformers library. This initializes the pre-trained model and its accompanying utilities, enabling seamless processing and prediction of scene text from images.
processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')
Define a helper function to input image URLs, process the images using the MGP-STR model, and generate text predictions. The function handles image conversion, base64 encoding for display, and utilizes the model’s outputs to decode the recognized text efficiently.
def predict(url):
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
# Process the image to prepare it for the model
pixel_values = processor(images=image, return_tensors="pt").pixel_values
# Generate the text from the model
outputs = model(pixel_values)
generated_text = processor.batch_decode(outputs.logits)['generated_text']
# Convert the image to base64 for transmission
buffered = BytesIO()
image.save(buffered, format="PNG")
image_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")
display(IPImage(data=base64.b64decode(image_base64)))
print("\n\n")
return generated_text
predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_7.png?raw=true")
['7']
predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_BAR.png?raw=true")
['bar']
predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_CROCODILES.png?raw=true")
['crocodiles']
predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_DAY.png?raw=true")
['day']
From the nature of the images, you will see that the prediction is efficient. With this kind of accuracy, it becomes very easy to implement this model and get a good response. You will also see that the model can run on only a CPU and uses less than 3GB of RAM. This makes it even more efficient to further be fine-tuned for other use cases on domain-specific tasks.
MGP-STR exemplifies the combination of vision and language knowledge within a unified framework. By innovatively integrating multi-granularity predictions into the STR pipeline, MGP-STR ensures a holistic approach to scene text recognition by blending character, subword, and word-level insights. This results in enhanced accuracy, adaptability to diverse datasets, and efficient performance without reliance on external language models. It simplifies the architecture while achieving remarkable accuracy. For researchers and developers in OCR and STR, MGP-STR offers a state-of-the-art tool that is both effective and accessible. With its open-source implementation and comprehensive documentation, MGP-STR is poised to drive further advancements in the field of scene text recognition.
A1: MGP-STR is a scene text recognition model that integrates linguistic predictions directly into its vision-based framework using Multi-Granularity Prediction (MGP). Unlike traditional STR models, it eliminates the need for independent language models, simplifying the pipeline and enhancing accuracy.
A2: The base-sized MGP-STR model was trained on the MJSynth and SynthText datasets, which are widely used for scene text recognition tasks.
A3: Yes, MGP-STR’s multi-granularity prediction mechanism enables it to handle diverse challenges, including distorted or low-quality text images.
A4: While the current implementation is optimized for English, the architecture can be adapted to support other languages by training it on relevant datasets.
A5: The A³ module refines ViT outputs by mapping token combinations to characters and enabling subword-level predictions, embedding linguistic insights into the model.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.