Microsoft Research has introduced a groundbreaking Document AI model called Universal Document Processing (UDOP), which represents a significant leap in AI capabilities. UDOP integrates text, image, and layout analysis in a single framework, enabling the understanding and generation of documents with unprecedented accuracy and efficiency. This article delves into the technical intricacies, capabilities, and real-world applications of UDOP, a game-changing model that is set to revolutionize document processing workflows.
As we would see, the UDOP model is a groundbreaking Document AI model that unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. It leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation.
This article was published as a part of the Data Science Blogathon.
UDOP’s architecture is based on a Vision-Text-Layout (VTL) Transformer, which dynamically fuses and unites image pixels and text tokens based on layout information. The VTL encoder consists of a unified representation for vision, text, and layout, which is achieved through layout-induced vision-text embedding. This embedding is then fed into the VTL transformer encoder.
The VTL decoder consists of a text-layout decoder and a vision decoder. The text-layout decoder is a uni-directional Transformer decoder that generates text and layout tokens in a sequence-to-sequence manner. The vision decoder, on the other hand, adopts the decoder of MAE and directly generates image pixels with text and layout information.
As a visionary model, that unifies text, image, and layout analysis in a single framework, leveraging a cutting-edge Vision-Text-Layout Transformer, it has a large model size of 742M parameters. It achieves unparalleled performance in document processing tasks, outperforming similar models like Donut and Pix2Struct on Hugging Face. By integrating a traditional OCR engine, UDOP combines the strengths of generative models and OCR technologies, enabling advanced document processing and customization capabilities.
UDOP’s Vision-Text-Layout Transformer is based on the T5 architecture, seamlessly integrating text, image, and layout modalities. This unified representation simplifies preprocessing and enhances model performance. The model is pre-trained on large-scale unlabeled document corpora using innovative self-supervised objectives, achieving state-of-the-art performance across diverse Document AI tasks.
UDOP uses a unified generative pretraining approach, which includes both self-supervised and supervised pretraining tasks. The self-supervised pretraining tasks include layout modeling, visual text recognition, joint text-layout reconstruction, and masked image reconstruction with text and layout. The supervised pretraining tasks include classification, layout analysis, information extraction, question answering, and document natural language inference.
The results of UDOP on various datasets, including DUE-Benchmark, FUNSD, CORD, and RVL-CDIP, demonstrate its state-of-the-art performance in document AI tasks. UDOP achieves high-quality neural document editing and content customization, making it a versatile and powerful tool for document processing tasks.
Let us now build a document AI with UDOP:
Running inference with UDOP
Here we will write a wrapper script and run an inference to see how well the model works and what it needs to work. I have also prepared a Django project to help you start with the model usage here.
Let us first install all dependencies that we require to build a document AI with UDOP.
# Install equired libraries
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q sentencepiece
!sudo apt install tesseract-ocr
!pip install -q pytesseract
# Install for GPU Device
!pip install -U accelerate
# Check Tesseract version
!tesseract --version
Let us now load model.
# Import necessary libraries
from transformers import UdopProcessor, UdopForConditionalGeneration
from huggingface_hub import hf_hub_download
from PIL import Image
import torch
# Load processor and model
repo_id = "microsoft/udop-large"
# Load processor and model
processor = UdopProcessor.from_pretrained(repo_id)
model = UdopForConditionalGeneration.from_pretrained(repo_id)
Now we will write a function for inference.
def perform_inference(prompt, image_path, max_tokens=200):
# Load image
image = Image.open(image_path).convert("RGB")
# Encode prompt and image
encoding = processor(images=image, text=prompt, return_tensors="pt")
# Generate outputs using the model
outputs = model.generate(**encoding, max_new_tokens=max_tokens)
# Decode the generated text
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
return generated_text
Next step will be to load the document.
import matplotlib.pyplot as plt
# Example usage:
image_path = hf_hub_download(
repo_id="hf-internal-testing/fixtures_docvqa", filename="document_2.png", repo_type="dataset"
)
fig, ax = plt.subplots(figsize=(12, 10)) # adjust the figure size to 4x3 inches
ax.imshow(plt.imread(image_path))
plt.show()
Now let us test the model.
Example1:
prompt = "Question answering. What is the title of the presentation?"
result = perform_inference(prompt, image_path)
print(result)
ITC's Brands: An Asset for the Nation
Example2:
prompt = "Layout Analysis. Where are the cover images located?"
result = perform_inference(prompt, image_path)
print(result)
India
Example3:
prompt = "Information Extraction. What is the text content?"
result = perform_inference(prompt, image_path)
print(result)
ITC's new FMCG businesses are the fastest growing among the top consumer goods companies operating in India.
To achieve optimal results with UDOP, users should:
UDOP’s impact extends to various domains and use cases, including:
UDOP is a revolutionary Document AI model that unifies text, image, and layout modalities together with varied task formats. Its architecture, based on a Vision-Text-Layout Transformer, enables it to dynamically fuse and unite image pixels and text tokens based on layout information. With its unified generative pretraining approach and state-of-the-art performance, UDOP is poised to transform the field of document AI. UDOP represents a paradigm shift in Document AI, offering a unified solution handling tasks seamlessly. Its versatility, performance, and real-world applications make it an indispensable tool for businesses and organizations seeking to enhance document processing efficiency and accuracy.
A. UDOP’s integration of text, image, and layout modalities sets it apart, enabling comprehensive document understanding and generation.
A. Yes, it’s Vision-Text-Layout Transformer architecture is designed to handle complex document structures effectively.
A. Yes, it is available for commercial use and can be accessed through the provided documentation and resources.
A. Best practices include thorough data preprocessing, leveraging task-specific prompts, and fine-tuning the model for domain-specific tasks.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.