Exploring Microsoft’s UDOP: Integrated DocumentAI

Mobarak Inuwa Last Updated : 24 Jun, 2024

6 min read

Introduction

Microsoft Research has introduced a groundbreaking Document AI model called Universal Document Processing (UDOP), which represents a significant leap in AI capabilities. UDOP integrates text, image, and layout analysis in a single framework, enabling the understanding and generation of documents with unprecedented accuracy and efficiency. This article delves into the technical intricacies, capabilities, and real-world applications of UDOP, a game-changing model that is set to revolutionize document processing workflows.

As we would see, the UDOP model is a groundbreaking Document AI model that unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. It leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation.

Learning Objectives

Understand the architecture and capabilities of UDOP in handling diverse document modalities.
Explore the real-world applications and use cases of UDOP across industries.
Learn how UDOP sets new benchmarks in Document AI tasks and its impact on streamlining document processing workflows.

This article was published as a part of the Data Science Blogathon.

Introduction
UDOP Architecture Overview
UDOP’s Technical Foundation
Key Features and Benefits of UDOP
Use Cases of UDOP
Building a Document AI with UDOP
How to Leverage UDOP?
Conclusion
Frequently Asked Questions

UDOP Architecture Overview

UDOP’s architecture is based on a Vision-Text-Layout (VTL) Transformer, which dynamically fuses and unites image pixels and text tokens based on layout information. The VTL encoder consists of a unified representation for vision, text, and layout, which is achieved through layout-induced vision-text embedding. This embedding is then fed into the VTL transformer encoder.

The VTL decoder consists of a text-layout decoder and a vision decoder. The text-layout decoder is a uni-directional Transformer decoder that generates text and layout tokens in a sequence-to-sequence manner. The vision decoder, on the other hand, adopts the decoder of MAE and directly generates image pixels with text and layout information.

As a visionary model, that unifies text, image, and layout analysis in a single framework, leveraging a cutting-edge Vision-Text-Layout Transformer, it has a large model size of 742M parameters. It achieves unparalleled performance in document processing tasks, outperforming similar models like Donut and Pix2Struct on Hugging Face. By integrating a traditional OCR engine, UDOP combines the strengths of generative models and OCR technologies, enabling advanced document processing and customization capabilities.

UDOP’s Technical Foundation

UDOP’s Vision-Text-Layout Transformer is based on the T5 architecture, seamlessly integrating text, image, and layout modalities. This unified representation simplifies preprocessing and enhances model performance. The model is pre-trained on large-scale unlabeled document corpora using innovative self-supervised objectives, achieving state-of-the-art performance across diverse Document AI tasks.

UDOP uses a unified generative pretraining approach, which includes both self-supervised and supervised pretraining tasks. The self-supervised pretraining tasks include layout modeling, visual text recognition, joint text-layout reconstruction, and masked image reconstruction with text and layout. The supervised pretraining tasks include classification, layout analysis, information extraction, question answering, and document natural language inference.

Key Features and Benefits of UDOP

Unified Representation: UDOP’s Vision-Text-Layout Transformer provides a unified representation of documents, simplifying preprocessing and enhancing model performance.
State-of-the-Art Performance: It achieves state-of-the-art performance across diverse Document AI tasks, outperforming similar models.
Neural Document Editing: It excels in neural document editing and content customization, surpassing previous benchmarks and setting new standards in the field.
Vision-Text-Layout Transformer: An encoder-decoder Transformer architecture based on T5, seamlessly integrating text, image, and layout modalities. This unified representation simplifies preprocessing and enhances model performance.
Self-Supervised Pretraining: Pretrained on large-scale unlabeled document corpora, UDOP employs innovative self-supervised objectives to achieve state-of-the-art performance across diverse Document AI tasks.

Use Cases of UDOP

Classification Task: The task uses the RVL-CDIP dataset to predict document types, with the prompt “Document Classification on (Dataset Name)” followed by text tokens, aiming to determine the document class.
Layout Analysis Task: In this task, the goal is to predict the locations of entities within the document, such as titles, paragraphs, etc. UDOP utilizes the PubLayNet dataset. The task prompt format is “Layout Analysis on (Dataset Name),” followed by the entity name. The target output includes all bounding boxes covering the specified entity.
Information Extraction Task: The task uses UDOP datasets like DocBank, Kleister Charity (KLC), PWC, and DeepForm to predict entity type and location based on text queries, with the output consisting of entity label and bounding box.
Question Answering Task: Here, the aim is to answer questions related to the document image. UDOP uses datasets such as WebSRC, VisualMRC, DocVQA, InfographicsVQA, and WTQ (WikiTableQuestions) for this task. The task prompt format is “Question Answering on (Dataset Name),” followed by the question and all document tokens. The target is to provide the answer.
Document NLI Task: This task involves predicting the entailment relationship between two sentences within a document. The task prompt format is “Document NLI on (Dataset Name),” followed by the sentence pair. The target output indicates whether the sentences entail each other or not. For this task, UDOP utilizes the TabFact dataset.

The results of UDOP on various datasets, including DUE-Benchmark, FUNSD, CORD, and RVL-CDIP, demonstrate its state-of-the-art performance in document AI tasks. UDOP achieves high-quality neural document editing and content customization, making it a versatile and powerful tool for document processing tasks.

Building a Document AI with UDOP

Let us now build a document AI with UDOP:

Running inference with UDOP

Here we will write a wrapper script and run an inference to see how well the model works and what it needs to work. I have also prepared a Django project to help you start with the model usage here.

Step 1: Install Dependencies

Let us first install all dependencies that we require to build a document AI with UDOP.

# Install equired libraries
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q sentencepiece
!sudo apt install tesseract-ocr
!pip install -q pytesseract

# Install for GPU Device
!pip install -U accelerate

# Check Tesseract version
!tesseract --version

Step2: Importing Dependencies and Load model

Let us now load model.

# Import necessary libraries
from transformers import UdopProcessor, UdopForConditionalGeneration
from huggingface_hub import hf_hub_download
from PIL import Image
import torch

# Load processor and model
repo_id = "microsoft/udop-large"

# Load processor and model
processor = UdopProcessor.from_pretrained(repo_id)
model = UdopForConditionalGeneration.from_pretrained(repo_id)

Now we will write a function for inference.

def perform_inference(prompt, image_path, max_tokens=200):
    # Load image
    image = Image.open(image_path).convert("RGB")

    # Encode prompt and image
    encoding = processor(images=image, text=prompt, return_tensors="pt")

    # Generate outputs using the model
    outputs = model.generate(**encoding, max_new_tokens=max_tokens)

    # Decode the generated text
    generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

    return generated_text

Step3: Loading the Document

Next step will be to load the document.

import matplotlib.pyplot as plt

# Example usage:
image_path = hf_hub_download(
    repo_id="hf-internal-testing/fixtures_docvqa", filename="document_2.png", repo_type="dataset"
)

fig, ax = plt.subplots(figsize=(12, 10))  # adjust the figure size to 4x3 inches
ax.imshow(plt.imread(image_path))
plt.show()

Step4: Testing the Model

Now let us test the model.

Example1:

prompt = "Question answering. What is the title of the presentation?"
result = perform_inference(prompt, image_path)
print(result)

ITC's Brands: An Asset for the Nation

Example2:

prompt = "Layout Analysis. Where are the cover images located?"
result = perform_inference(prompt, image_path)
print(result)

India

Example3:

prompt = "Information Extraction. What is the text content?"
result = perform_inference(prompt, image_path)
print(result)

ITC's new FMCG businesses are the fastest growing among the top consumer goods companies operating in India.

How to Leverage UDOP?

To achieve optimal results with UDOP, users should:

Input bounding boxes alongside input IDs using UdopForConditionalGeneration
Normalize bounding boxes using the provided function
Appropriately prepare images and text with UdopProcessor

UDOP’s impact extends to various domains and use cases, including:

Automated Data Extraction: Convert paper-based documents into editable and searchable text, streamlining data entry processes.
Compliance Checks: Scan and review legal and regulatory documents, ensuring compliance with industry standards.
Analytical Insights: Perform deep analysis of business reports and financial statements, extracting actionable insights for decision-making.

Conclusion

UDOP is a revolutionary Document AI model that unifies text, image, and layout modalities together with varied task formats. Its architecture, based on a Vision-Text-Layout Transformer, enables it to dynamically fuse and unite image pixels and text tokens based on layout information. With its unified generative pretraining approach and state-of-the-art performance, UDOP is poised to transform the field of document AI. UDOP represents a paradigm shift in Document AI, offering a unified solution handling tasks seamlessly. Its versatility, performance, and real-world applications make it an indispensable tool for businesses and organizations seeking to enhance document processing efficiency and accuracy.

Key Takeaway

UDOP’s results on various datasets demonstrate its state-of-the-art performance in document AI tasks.
UDOP’s impact extends to various domains and use cases, including automated data extraction, compliance checks, and analytical insights.
UDOP’s modular architecture and flexibility make it easy to integrate with other AI models and customize for specific use cases.

Frequently Asked Questions

Q1. What distinguishes UDOP from other Document AI models?

A. UDOP’s integration of text, image, and layout modalities sets it apart, enabling comprehensive document understanding and generation.

Q2. Can UDOP handle complex document structures and formats?

A. Yes, it’s Vision-Text-Layout Transformer architecture is designed to handle complex document structures effectively.

Q3. Is UDOP available for commercial use?

A. Yes, it is available for commercial use and can be accessed through the provided documentation and resources.

Q4. What are some best practices for integrating UDOP into existing workflows?

A. Best practices include thorough data preprocessing, leveraging task-specific prompts, and fine-tuning the model for domain-specific tasks.

Reference

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Mobarak Inuwa

I am an AI Engineer with a deep passion for research, and solving complex problems. I provide AI solutions leveraging Large Language Models (LLMs), GenAI, Transformer Models, and Stable Diffusion.

Advanced Classification Supervised

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Exploring Microsoft’s UDOP: Integrated DocumentAI

Introduction

Learning Objectives

Table of contents

UDOP Architecture Overview

UDOP’s Technical Foundation

Key Features and Benefits of UDOP

Use Cases of UDOP

Building a Document AI with UDOP

Step 1: Install Dependencies

Step2: Importing Dependencies and Load model

Step3: Loading the Document

Step4: Testing the Model

How to Leverage UDOP?

Conclusion

Key Takeaway

Frequently Asked Questions

Reference

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at