Scene Text Recognition (STR) Using Vision-Based Text Recognition

Mobarak Inuwa Last Updated : 21 Dec, 2024
6 min read

Scene text recognition (STR) continues challenging researchers due to the diversity of text appearances in natural environments. It is one thing to detect text on images on documents and another thing when the text is in an image on a person’s T-shirt. The introduction of Multi-Granularity Prediction for Scene Text Recognition (MGP-STR), presented at ECCV 2022, represents a transformative approach in this domain. MGP-STR merges the robustness of Vision Transformers (ViT) with innovative multi-granularity linguistic predictions. This enhances its ability to handle complex scene text recognition tasks. This ensures improved accuracy and usability across a variety of challenging real-world scenarios creating a simple yet powerful solution for STR tasks.

Learning Objectives

  • Understand the architecture and components of MGP-STR, including Vision Transformers (ViT).
  • Learn how multi-granularity predictions enhance the accuracy and versatility of scene text recognition.
  • Explore the practical applications of MGP-STR in real-world OCR tasks.
  • Gain hands-on experience in implementing and using MGP-STR with PyTorch for scene text recognition.

This article was published as a part of the Data Science Blogathon.

What is MGP-STR?

MGP-STR is a vision-based STR model designed to excel without relying on an independent language model. Instead, it integrates linguistic information directly within its architecture through the Multi-Granularity Prediction (MGP) strategy. This implicit approach allows MGP-STR to outperform both pure vision models and language-augmented methods, achieving state-of-the-art results in STR.

The architecture comprises two primary components, both of which are pivotal for ensuring the model’s exceptional performance and ability to handle diverse scene text challenges:

  • Vision Transformer (ViT)
  • A³ Modules

The fusion of predictions at character, subword, and word levels via a straightforward yet effective strategy ensures that MGP-STR captures the intricacies of both visual and linguistic features.

Understanding MGP-STR: Scene Text Recognition

Applications and Use Cases of MGP-STR

MGP-STR is primarily designed for optical character recognition (OCR) tasks on text images. Its unique ability to incorporate linguistic knowledge implicitly makes it particularly effective in real-world scenarios where text variations and distortions are common. Examples include:

  • Reading text from natural scenes, such as street signs, billboards, and store names in outdoor environments.
  • Extracting handwritten or printed text from scanned forms and official documents.
  • Analyzing text in industrial applications, such as reading labels, barcodes, or serial numbers on products.
  • Translating or transcribing text in augmented reality (AR) applications for travel or education. such as street signs and billboards.
  • Extracting information from scanned documents or photographs of printed materials.
  • Assisting accessibility solutions, such as screen readers for visually impaired users.
Applications and Use Cases of MGP-STR : Scene Text Recognition

Key Features and Advantages

  • Elimination of Independent Language Models
  • Multi-Granularity Predictions
  • State-of-the-Art Performance
  • Ease of Use

Getting Started with MGP-STR

Before diving into the code snippet, let’s understand its purpose and prerequisites. This example demonstrates how to use the MGP-STR model to perform scene text recognition on a sample image. Ensure you have PyTorch, the Transformers library, and the required dependencies (like PIL and requests) installed in your environment to execute the code seamlessly. Below is an example of how to use the MGP-STR model in PyTorch (notebook).

Step1: Importing Dependencies

Begin by importing the essential libraries and dependencies required for MGP-STR, including transformers for model processing, PIL for image manipulation, and requests for fetching images online. These libraries provide the foundational tools to process and display text images effectively.

from transformers import MgpstrProcessor, MgpstrForSceneTextRecognition
import requests
import base64
from io import BytesIO
from PIL import Image
from IPython.display import display, Image as IPImage

Step2: Loading Base Model

Load the MGP-STR base model and its processor from the Hugging Face Transformers library. This initializes the pre-trained model and its accompanying utilities, enabling seamless processing and prediction of scene text from images.

processor = MgpstrProcessor.from_pretrained('alibaba-damo/mgp-str-base')
model = MgpstrForSceneTextRecognition.from_pretrained('alibaba-damo/mgp-str-base')

Step3: Helper Function for Predicting Text on the Image

Define a helper function to input image URLs, process the images using the MGP-STR model, and generate text predictions. The function handles image conversion, base64 encoding for display, and utilizes the model’s outputs to decode the recognized text efficiently.

def predict(url):
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

    # Process the image to prepare it for the model
    pixel_values = processor(images=image, return_tensors="pt").pixel_values

    # Generate the text from the model
    outputs = model(pixel_values)
    generated_text = processor.batch_decode(outputs.logits)['generated_text']

    # Convert the image to base64 for transmission
    buffered = BytesIO()
    image.save(buffered, format="PNG")
    image_base64 = base64.b64encode(buffered.getvalue()).decode("utf-8")

    display(IPImage(data=base64.b64decode(image_base64)))
    print("\n\n")

    return generated_text

Example1:

predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_7.png?raw=true")
Example1
['7']

Example2:

predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_BAR.png?raw=true")
Example1
['bar']

Example3:

predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_CROCODILES.png?raw=true")
example3
['crocodiles']

Example4:

predict("https://github.com/AlibabaResearch/AdvancedLiterateMachinery/blob/main/OCR/MGP-STR/demo_imgs/CUTE80_DAY.png?raw=true")
example4
['day']

From the nature of the images, you will see that the prediction is efficient. With this kind of accuracy, it becomes very easy to implement this model and get a good response. You will also see that the model can run on only a CPU and uses less than 3GB of RAM. This makes it even more efficient to further be fine-tuned for other use cases on domain-specific tasks.

output: Scene Text Recognition

Conclusion

MGP-STR exemplifies the combination of vision and language knowledge within a unified framework. By innovatively integrating multi-granularity predictions into the STR pipeline, MGP-STR ensures a holistic approach to scene text recognition by blending character, subword, and word-level insights. This results in enhanced accuracy, adaptability to diverse datasets, and efficient performance without reliance on external language models. It simplifies the architecture while achieving remarkable accuracy. For researchers and developers in OCR and STR, MGP-STR offers a state-of-the-art tool that is both effective and accessible. With its open-source implementation and comprehensive documentation, MGP-STR is poised to drive further advancements in the field of scene text recognition.

Key Takeaways

  • MGP-STR integrates vision and linguistic knowledge without relying on independent language models, streamlining the STR process.
  • The use of multi-granularity predictions enables MGP-STR to excel across diverse text recognition challenges.
  • MGP-STR sets a new benchmark for STR models by achieving state-of-the-art results with a simple and effective architecture.
  • Developers can easily adapt and deploy MGP-STR for a variety of OCR tasks, enhancing both research and practical applications.

Frequently Asked Questions

Q1: What is MGP-STR, and how does it differ from traditional STR models?

A1: MGP-STR is a scene text recognition model that integrates linguistic predictions directly into its vision-based framework using Multi-Granularity Prediction (MGP). Unlike traditional STR models, it eliminates the need for independent language models, simplifying the pipeline and enhancing accuracy.

Q2: What datasets were used to train MGP-STR?

A2: The base-sized MGP-STR model was trained on the MJSynth and SynthText datasets, which are widely used for scene text recognition tasks.

Q3. Can MGP-STR handle distorted or low-quality text images?

A3: Yes, MGP-STR’s multi-granularity prediction mechanism enables it to handle diverse challenges, including distorted or low-quality text images.

Q4. Is MGP-STR suitable for languages other than English?

A4: While the current implementation is optimized for English, the architecture can be adapted to support other languages by training it on relevant datasets.

Q5. How does the A³ module contribute to MGP-STR’s performance?

A5: The A³ module refines ViT outputs by mapping token combinations to characters and enabling subword-level predictions, embedding linguistic insights into the model.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I am an AI Engineer with a deep passion for research, and solving complex problems. I provide AI solutions leveraging Large Language Models (LLMs), GenAI, Transformer Models, and Stable Diffusion.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details