Exploring Microsoft’s UDOP: Integrated DocumentAI

Mobarak Inuwa Last Updated : 24 Jun, 2024
6 min read

Introduction

Microsoft Research has introduced a groundbreaking Document AI model called Universal Document Processing (UDOP), which represents a significant leap in AI capabilities. UDOP integrates text, image, and layout analysis in a single framework, enabling the understanding and generation of documents with unprecedented accuracy and efficiency. This article delves into the technical intricacies, capabilities, and real-world applications of UDOP, a game-changing model that is set to revolutionize document processing workflows.

As we would see, the UDOP model is a groundbreaking Document AI model that unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. It leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation.

Learning Objectives

  • Understand the architecture and capabilities of UDOP in handling diverse document modalities.
  • Explore the real-world applications and use cases of UDOP across industries.
  • Learn how UDOP sets new benchmarks in Document AI tasks and its impact on streamlining document processing workflows.

This article was published as a part of the Data Science Blogathon.

UDOP Architecture Overview

UDOP’s architecture is based on a Vision-Text-Layout (VTL) Transformer, which dynamically fuses and unites image pixels and text tokens based on layout information. The VTL encoder consists of a unified representation for vision, text, and layout, which is achieved through layout-induced vision-text embedding. This embedding is then fed into the VTL transformer encoder.

The VTL decoder consists of a text-layout decoder and a vision decoder. The text-layout decoder is a uni-directional Transformer decoder that generates text and layout tokens in a sequence-to-sequence manner. The vision decoder, on the other hand, adopts the decoder of MAE and directly generates image pixels with text and layout information.

UDOP Architecture Overview

As a visionary model, that unifies text, image, and layout analysis in a single framework, leveraging a cutting-edge Vision-Text-Layout Transformer, it has a large model size of 742M parameters. It achieves unparalleled performance in document processing tasks, outperforming similar models like Donut and Pix2Struct on Hugging Face. By integrating a traditional OCR engine, UDOP combines the strengths of generative models and OCR technologies, enabling advanced document processing and customization capabilities.

UDOP’s Technical Foundation

UDOP’s Vision-Text-Layout Transformer is based on the T5 architecture, seamlessly integrating text, image, and layout modalities. This unified representation simplifies preprocessing and enhances model performance. The model is pre-trained on large-scale unlabeled document corpora using innovative self-supervised objectives, achieving state-of-the-art performance across diverse Document AI tasks.

UDOP uses a unified generative pretraining approach, which includes both self-supervised and supervised pretraining tasks. The self-supervised pretraining tasks include layout modeling, visual text recognition, joint text-layout reconstruction, and masked image reconstruction with text and layout. The supervised pretraining tasks include classification, layout analysis, information extraction, question answering, and document natural language inference.

Key Features and Benefits of UDOP

  • Unified Representation: UDOP’s Vision-Text-Layout Transformer provides a unified representation of documents, simplifying preprocessing and enhancing model performance.
  • State-of-the-Art Performance: It achieves state-of-the-art performance across diverse Document AI tasks, outperforming similar models.
  • Neural Document Editing: It excels in neural document editing and content customization, surpassing previous benchmarks and setting new standards in the field.
  • Vision-Text-Layout Transformer: An encoder-decoder Transformer architecture based on T5, seamlessly integrating text, image, and layout modalities. This unified representation simplifies preprocessing and enhances model performance.
  • Self-Supervised Pretraining: Pretrained on large-scale unlabeled document corpora, UDOP employs innovative self-supervised objectives to achieve state-of-the-art performance across diverse Document AI tasks.
Key Features and Benefits of UDOP

Use Cases of UDOP

  • Classification Task: The task uses the RVL-CDIP dataset to predict document types, with the prompt “Document Classification on (Dataset Name)” followed by text tokens, aiming to determine the document class.
  • Layout Analysis Task: In this task, the goal is to predict the locations of entities within the document, such as titles, paragraphs, etc. UDOP utilizes the PubLayNet dataset. The task prompt format is “Layout Analysis on (Dataset Name),” followed by the entity name. The target output includes all bounding boxes covering the specified entity.
  • Information Extraction Task: The task uses UDOP datasets like DocBank, Kleister Charity (KLC), PWC, and DeepForm to predict entity type and location based on text queries, with the output consisting of entity label and bounding box.
  • Question Answering Task: Here, the aim is to answer questions related to the document image. UDOP uses datasets such as WebSRC, VisualMRC, DocVQA, InfographicsVQA, and WTQ (WikiTableQuestions) for this task. The task prompt format is “Question Answering on (Dataset Name),” followed by the question and all document tokens. The target is to provide the answer.
  • Document NLI Task: This task involves predicting the entailment relationship between two sentences within a document. The task prompt format is “Document NLI on (Dataset Name),” followed by the sentence pair. The target output indicates whether the sentences entail each other or not. For this task, UDOP utilizes the TabFact dataset.

The results of UDOP on various datasets, including DUE-Benchmark, FUNSD, CORD, and RVL-CDIP, demonstrate its state-of-the-art performance in document AI tasks. UDOP achieves high-quality neural document editing and content customization, making it a versatile and powerful tool for document processing tasks.

Building a Document AI with UDOP

Let us now build a document AI with UDOP:

Running inference with UDOP

Here we will write a wrapper script and run an inference to see how well the model works and what it needs to work. I have also prepared a Django project to help you start with the model usage here.

Step 1: Install Dependencies

Let us first install all dependencies that we require to build a document AI with UDOP.

# Install equired libraries
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q sentencepiece
!sudo apt install tesseract-ocr
!pip install -q pytesseract

# Install for GPU Device
!pip install -U accelerate

# Check Tesseract version
!tesseract --version

Step2: Importing Dependencies and Load model

Let us now load model.

# Import necessary libraries
from transformers import UdopProcessor, UdopForConditionalGeneration
from huggingface_hub import hf_hub_download
from PIL import Image
import torch

# Load processor and model
repo_id = "microsoft/udop-large"

# Load processor and model
processor = UdopProcessor.from_pretrained(repo_id)
model = UdopForConditionalGeneration.from_pretrained(repo_id)

Now we will write a function for inference.

def perform_inference(prompt, image_path, max_tokens=200):
    # Load image
    image = Image.open(image_path).convert("RGB")

    # Encode prompt and image
    encoding = processor(images=image, text=prompt, return_tensors="pt")

    # Generate outputs using the model
    outputs = model.generate(**encoding, max_new_tokens=max_tokens)

    # Decode the generated text
    generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]

    return generated_text

Step3: Loading the Document

Next step will be to load the document.

import matplotlib.pyplot as plt

# Example usage:
image_path = hf_hub_download(
    repo_id="hf-internal-testing/fixtures_docvqa", filename="document_2.png", repo_type="dataset"
)

fig, ax = plt.subplots(figsize=(12, 10))  # adjust the figure size to 4x3 inches
ax.imshow(plt.imread(image_path))
plt.show()

Step4: Testing the Model

Now let us test the model.

Example1:

prompt = "Question answering. What is the title of the presentation?"
result = perform_inference(prompt, image_path)
print(result)
ITC's Brands: An Asset for the Nation

Example2:

prompt = "Layout Analysis. Where are the cover images located?"
result = perform_inference(prompt, image_path)
print(result)
India

Example3:

prompt = "Information Extraction. What is the text content?"
result = perform_inference(prompt, image_path)
print(result)
ITC's new FMCG businesses are the fastest growing among the top consumer goods companies operating in India.

How to Leverage UDOP?

To achieve optimal results with UDOP, users should:

  • Input bounding boxes alongside input IDs using UdopForConditionalGeneration
  • Normalize bounding boxes using the provided function
  • Appropriately prepare images and text with UdopProcessor

UDOP’s impact extends to various domains and use cases, including:

  • Automated Data Extraction: Convert paper-based documents into editable and searchable text, streamlining data entry processes.
  • Compliance Checks: Scan and review legal and regulatory documents, ensuring compliance with industry standards.
  • Analytical Insights: Perform deep analysis of business reports and financial statements, extracting actionable insights for decision-making.

Conclusion

UDOP is a revolutionary Document AI model that unifies text, image, and layout modalities together with varied task formats. Its architecture, based on a Vision-Text-Layout Transformer, enables it to dynamically fuse and unite image pixels and text tokens based on layout information. With its unified generative pretraining approach and state-of-the-art performance, UDOP is poised to transform the field of document AI. UDOP represents a paradigm shift in Document AI, offering a unified solution handling tasks seamlessly. Its versatility, performance, and real-world applications make it an indispensable tool for businesses and organizations seeking to enhance document processing efficiency and accuracy.

Key Takeaway

  • UDOP’s results on various datasets demonstrate its state-of-the-art performance in document AI tasks.
  • UDOP’s impact extends to various domains and use cases, including automated data extraction, compliance checks, and analytical insights.
  • UDOP’s modular architecture and flexibility make it easy to integrate with other AI models and customize for specific use cases.

Frequently Asked Questions

Q1. What distinguishes UDOP from other Document AI models?

A. UDOP’s integration of text, image, and layout modalities sets it apart, enabling comprehensive document understanding and generation.

Q2. Can UDOP handle complex document structures and formats?

A. Yes, it’s Vision-Text-Layout Transformer architecture is designed to handle complex document structures effectively.

Q3. Is UDOP available for commercial use?

A. Yes, it is available for commercial use and can be accessed through the provided documentation and resources.

Q4. What are some best practices for integrating UDOP into existing workflows?

A. Best practices include thorough data preprocessing, leveraging task-specific prompts, and fine-tuning the model for domain-specific tasks.

Reference

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

I am an AI Engineer with a deep passion for research, and solving complex problems. I provide AI solutions leveraging Large Language Models (LLMs), GenAI, Transformer Models, and Stable Diffusion.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details