TrOCR and ZhEn Latex OCR: A Comparison of Image-to-Text and Latex Models

Maigari David Last Updated : 25 Sep, 2024

9 min read

Introduction

Diving into the world of AI models, language models and other software that can be applied in real tasks like virtual assistance and content creation are very popular. However, there is still a lot to explore with image-to-text models. Optimal Character Recognition (OCR) is the foundation of building vast encoder-decoder models.

So, when you present images to this model as a sequence, the text decoder generates tokens and displays the characters shown in the image.

Many of these kinds of models have different performance metrics in various specializations. Two popular image-to-text models with great potential are TrOCR and ZhEn Latex OCR; they are distinctively efficient for carrying out different image-to-text tasks.

New Feature

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Learning Objective

Learn about the optimal use of both TrOCR and ZhEn Latext OCR.
Gain insight into the architecture of this model.
Run inference for image-to-text models and explore the use cases.
Understanding the real-life application of this model.

This article was published as a part of the Data Science Blogathon.

TrOCR: Encoder-Decoder Model for Image-to-Text
Architecture of TrOCR
How About Zhen Latex OCR?
TrOCR Vs. Zhen Latex OCR
How to Use TrOCR?
Using Zhen Latex OCR for Mathematical and Latex Image Recognition
Improvements in TrOCR and Zhen Latex OCR
Real-Life Application of OCR Models
Frequently Asked Questions

TrOCR: Encoder-Decoder Model for Image-to-Text

Traditional-based Optimal Character Recognition (TrOCR) is an encoder-decoder model that can read content in an image using an effective sequence mechanism. This model has an image and text transform; the image transformer is the encoder, while the text transfer acts as the decoder.

With OCR models like this, much goes unnoticed when looking into the training of this mode. TrOCR could consist of two categories: the pre-trained models, also known as stage 1 models. These TrOCR models are trained on synthetic data generated on a large scale, which means their data set could include millions of images of printed text lines.

Another important family of the TrOCR model is the fine-tuned models that come after pre-training. These models are usually fine-tuned on the IAM Handwritten text images and SROIE printed receipts dataset. The SROIE consists of samples of thousands of printed texts on small, base, and large scales. So, you have these printed text on scales like this: TrOCR-small-SROIE, TROCR-base-SROIE, TrOCR-SROIE.

TrOCR: Encoder-Decoder Model for Image-to-text

Architecture of TrOCR

OCR models usually use CNN and RNN architectures. CNN was a popular architecture for computer vision and image processing, while RNN was a great system with robust deep learning capabilities. However, in the case of the TrOCR model, the authors (Li et al.) opted for something different.

The vision and language transformer model was used to construct the TrOCR architecture. And that brings to light the encoder-decoder mechanism we mentioned earlier. This architecture prints the data sequence in two stages;

The encoder stage has a pre-trained vision transformer model.
The decoder stage consists of a pre-trained language transformer model.

The TrOCR model first encodes the image and breaks it into patches that pass through a multi-head attention block. This is followed by a feed-forward block that produces image embeddings. After this, the language transformer model processes these embeddings. The decoder within the transformer generates encoded text outputs.

Finally, these encoded outputs are decoded to extract the text from the image. One important part of this process is that images are resized to fixed-sized patches of 16×16 resolution before they are taken into the text decoder in the transformer model.

How About Zhen Latex OCR?

Mixtex’s Zhen Latex OCR is another fascinating open-source model with great specialization. It employs an encoder-decoder model to convert images to text. However, it is highly specialized in generating latex code images from mathematical formulas and text. The Zhen Latex OCR can almost accurately recognize complex latex maths formulas and tables. It can also recognize and generate latex table codes.

A fascinating feature of this model is that it can recognize and differentiate between words, text, formulas, and tables while providing accurate recognition results. Zhen Latex OCR is also bilingual, providing recognition in English and Chinese environments.

TrOCR Vs. Zhen Latex OCR

TrOCR is great but can work efficiently for single-line text images. However, due to its effective pre-training, this model is accurate regarding run time speed compared to other OCR models like Easy OCR. But GPTO remains the most balanced in all aspects.

On the other hand, Zhen Latex OCR works for mathematical formulas and codes. There are software like Anki and MathpixSnip to help with mathematical equations. But the former can be stressful when retyping the latex formula, while the latter is limited with the free plan and has an expensive paid package.

Zhen comes in handy to solve this problem. You can input images on the encoder, and the decoder transformer can convert them to latex. Gemini is another alternative to this model but is only great for solving general maths problems. Zhen Latex’s excellent specialization in converting images to latex makes it stand out. Also, this model is multimodal to recognize and process equations containing words, formulas, tables, and text.

TrOCR is efficient for printing from images with single-line text. For mathematical problems, you have many options, but Zhen can help you with latex recognitions.

How to Use TrOCR?

We will explore using the TrOCR model, which is fine-tuned with SRIOE datasets. This model is already tailored to deliver accurate results with one-line text images, and we will look at a few steps that make it run.

Step1: Importing tools from Transformer Libraries

In summary, this code sets up the environment for OCR using the TrOCR model. It imports the necessary tools for loading images, processing them, and making HTTP requests to fetch images from the internet.

from transformers import TrOCRProcessor, VisionEncoderDecoderModel
from PIL import Image
import requests

Step2: Loading Image from the Database

To load an image from this database, you have to define the URL of an image from the IAM handwriting database, use the `requests` library to download the image from the specified URL, open the image using the `PIL.Image` module, and convert it to RGB format for consistent color processing. This is the first step of input to get the transformer model to encode the text on the image.

# load image from the IAM database (actually this model is meant to be used on printed text)
url = 'https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

Step3: Initializing the TrOCR Model from its Pre-trained Processor

processor = TrOCRProcessor.from_pretrained('microsoft/trocr-base-printed')
model = VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-base-printed')
pixel_values = processor(images=image, return_tensors="pt").pixel_values

This step is to initialize the TrOCR model by loading the pre-trained processor. The TrOCRProcessor processes the input image, converting it into a format the model can understand. The processed image is then converted into a tensor format with pixel values, which are necessary for the model to perform OCR on the image. The final output, pixel_values, is the tensor representation of the image, ready to be fed into the model for text recognition.

Step4: Text Generation

This step involves the model taking the image input and generating a text output (in pixels). The text generation is done in token IDs, which are taken back into decoded and readable text. The code would look like this:

generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]

You can view the image below with the ‘image’ prompt. This can help us confirm the output.

image

This is a one-line text image; with TrOCR, you can use ‘generated_text.lower()’. You get the text here as ‘INDLUS THE.’

generated_text

generated_text.lower()

Note: the second line brings output in lowercase.

Using Zhen Latex OCR for Mathematical and Latex Image Recognition

Zhen Latex OCR can also recognize Mathematical formulas and equations. Its architecture is similar to that of TrOCR models, employing a vision encoder-decoder model.

Let us look at a few steps for running this model to recognize images with latex.

Step1: Importing the Necessary Module

from transformers import AutoTokenizer, VisionEncoderDecoderModel, AutoImageProcessor
from PIL import Image
import requests


feature_extractor = AutoImageProcessor.from_pretrained("MixTex/ZhEn-Latex-OCR")
tokenizer = AutoTokenizer.from_pretrained("MixTex/ZhEn-Latex-OCR", max_len=296)
model = VisionEncoderDecoderModel.from_pretrained("MixTex/ZhEn-Latex-OCR")

This code initializes an OCR pipeline using the ZhEn Latex OCR model. It imports the necessary modules and loads a pre-trained image processor (`AutoImageProcessor`) and tokenizer (`AutoTokenizer`) from the Zhen Latex model. These components are configured to handle images and text tokens for LaTeX symbol recognition.

The `VisionEncoderDecoderModel` is also loaded from the same Zhen Latex checkpoint. These components combined would help process images and generate LaTeX-formatted text.

Step2: Loading Image and Printing through the Model Decoder

imgen = Image.open(requests.get('https://cdn-uploads.huggingface.co/production/uploads/62dbaade36292040577d2d4f/eOAym7FZDsjic_8ptsC-H.png', stream=True).raw)
#imgzh = Image.open(requests.get('https://cdn-uploads.huggingface.co/production/uploads/62dbaade36292040577d2d4f/m-oVg8dsQbQZ1fDWbwKtO.png', stream=True).raw)
print(tokenizer.decode(model.generate(feature_extractor(imgen, return_tensors="pt").pixel_values)[0]).replace('\\[','\\begin{align*}').replace('\\]','\\end{align*}'))

In this step, we load the image using the ‘Pil.Image’ module before processing it. The ‘feature extractor’ function in this code helps to convert it to a tensor format suitable to Zhen Latex.

The model.generate() function then generates LaTeX code from the image, and the resulting token IDs are decoded into a readable format using the tokenizer.decode() method. Finally, the decoded LaTeX code is printed, with specific replacements made to format the output with \begin{align*} and \end{align*} tags.

The output of the image with latex is in the screenshot and code block below:

begin{align*} 
\widetilde{t}_{j,k}^{\left[ p,q,L1\right] }=\frac{t_{j,k+\widetilde{p}-1}-t_{j,k+1}}{t_{j,k+\widetilde{p}}-t_{j,k}}\widetilde{t}_{j,k}^{\left[ p,q,L1b\right] }, 
 \end{align*} 
capabilities and protocols that employ the XOR operator can be modeled by these theories. Our 
 \begin{align*} 
\mathrm{eu}\,\,\mathbb{H}^{*}\left(S^3_{-d}(K),a\right)=-\sum_{\substack{j\equiv a(\mathrm{mod}\,d)\\ 0\leq j\leq M}}\mathrm{eu}\,\,\mathbb{H}^{*}\left(T_j,W\right).
 \end{align*} 
reduction allows us to carry out protocol analysis by  \(-537\) tools, such as ProVerif, that cannot deal with XOR, but are very efficient in the XORfree case. We

If you enter the ‘image’ prompt, you can see the image of the equation with latex.

imgen

Improvements in TrOCR and Zhen Latex OCR

Both models have some limitations, which can be improved in future updates. TrOCR cannot effectively recognize curved texts and images. It also has limitations with images of natural scenes such as banners, billboards, and costumes.

This problem concerns the vision and language transformer models. If the vision transformer model has seen curved texts, it could recognize such images. Similarly, the language transformer would need to understand the different tokens within the texts.

On the other hand, Zhen Latex OCR could also use some updates. This model currently supports only formulas in printed fonts and simple tables. An upgrade would help it convert complex tables into latex code and work with handwritten mathematical formulas.

Real-Life Application of OCR Models

Many use cases and applications of OCR models exist in the modern digital space. The best part is how useful OCR models can be to different industries. Here are just a few applications of this technology in different industries.

Finance: This technology can help extract data from receipts, invoices, and bank statements. The process has a huge advantage, as accuracy and efficiency can be improved.
Healthcare: This is another vital industry that needs the accuracy of records that OCR technology brings. OCR software can help by converting patients’ records into digital formats. It can also extract data from handwritten prescriptions, streamlining the medication process and minimizing errors.
Government: Public offices can use this technology to enhance various application processes. OCR models can be helpful in record keeping, form processing, and digitizing all government documents.

Conclusion

OCR models like TrOCR and Zhen Latex efficiently perform image-to-text/latex code tasks. They reduce errors and provide useful applications in different industries. However, it is important to note that these models have strengths and weaknesses, so optimizing each of them for what they do best would be the best way to achieve accuracy.

Key Takeaways

These models have many talking points as they have unique and specific strengths with their architecture. Here are some of the key takeaways from the use cases of TrOCR and Zhen Latex OCR models:

TrOCR is suitable for processing single-line text images, using its encoder-decoder architecture to generate accurate text outputs.
ZhEn Latex OCR excels at recognizing and converting complex mathematical formulas and LaTeX code from images, making it highly specialized for academic and technical purposes.
While both models have unique strengths, optimizing them for specific use cases—like TrOCR for printed text and ZhEn Latex OCR for LaTeX and mathematical content—yields the best results.

Frequently Asked Questions

Q1: What is the primary difference between TrOCR and Zhen Latex OCR?

A: TrOCR specializes in writing text from printed fonts and handwritten images. On the other hand, Zhen Latex OCR helps convert images using mathematical equations and latex code.

Q2: When Should I use Zhen Latex OCR over TrOCR?

A: Use TrOCR when extracting text from images, especially single-line text, as it is optimized for this task. Zhen Latex OCR should be used when dealing with mathematical formulas or LaTeX code.

Q3: Can Zhen OCR handle handwritten mathematical equations?

A. Zhen Latex OCR currently does not support handwritten mathematical equations. However, upgrades being considered would bring improvements, such as multimodal features, bilingual support, and a handwritten database for mathematical equations.

Q4: What Industries can benefit from OCR models?

A: OCR models benefit industries like finance for data extraction, healthcare for digitizing patient records, banking for customer transactional records, and government for processing and digitizing documents.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Maigari David

Hey there! I'm David Maigari, a dynamic professional with a passion for technical writing, Web Development, and the AI world. David is also an enthusiast of ML/AI innovations. Reach out to me on X (Twitter) at @maigari_david

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

TrOCR and ZhEn Latex OCR: A Comparison of Image-to-Text and Latex Models

Introduction

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Learning Objective

Table of contents

TrOCR: Encoder-Decoder Model for Image-to-Text

Architecture of TrOCR

How About Zhen Latex OCR?

TrOCR Vs. Zhen Latex OCR

How to Use TrOCR?

Step1: Importing tools from Transformer Libraries

Step2: Loading Image from the Database

Step3: Initializing the TrOCR Model from its Pre-trained Processor

Step4: Text Generation

Using Zhen Latex OCR for Mathematical and Latex Image Recognition

Step1: Importing the Necessary Module

Step2: Loading Image and Printing through the Model Decoder

Improvements in TrOCR and Zhen Latex OCR

Real-Life Application of OCR Models

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect