Document information extraction involves using computer algorithms to extract structured data (like employee name, address, designation, phone number, etc.) from unstructured or semi-structured documents, such as reports, emails, and web pages. The extracted information can be used for various purposes, such as analysis and classification. DocVQA(Document Visual Question Answering) is a cutting-edge approach combining computer vision and natural language processing techniques to automatically answer questions about a document’s content. This article will explore information extraction using DocVQA with Google’s Pix2Struct package.
Learning Objectives
This article was published as a part of the Data Science Blogathon.
Document extraction automatically extracts relevant information from unstructured documents, such as invoices, receipts, contracts, and forms. The following sector will get benefited because of this:
Document extraction has many applications in industries that deal with large volumes of unstructured data. Automating document processing tasks can help organizations save time, reduce errors, and improve efficiency.
There are several challenges associated with document information extraction. The major challenge is the variability in document formats and structures. For example, different documents may have various forms and layouts, making it difficult to extract information consistently. Another challenge is noise in the data, such as spelling errors and irrelevant information. This can lead to inaccurate or incomplete extraction results.
The process of document information extraction involves several steps.
Researchers are developing new algorithms and techniques for document information extraction to address these challenges. These include techniques for handling variability in document structures, such as using deep learning algorithms to learn document structures automatically. They also include techniques for handling noisy data, such as using natural language processing techniques to identify and correct spelling errors.
DocVQA stands for Document Visual Question Answering. It is a task in computer vision and natural language processing that aims to answer questions about the content of a given document image. The questions can be about any aspect of the document text. DocVQA is a challenging task because it requires understanding the document’s visual content and the ability to read and comprehend the text in it. This task has numerous real-world applications, such as document retrieval, information extraction, etc.
LayoutLM, Flan-T5, and Donut are three approaches to document layout analysis and text recognition for Document Visual Question Answering (DOCVQA).
It is a pre-trained language model incorporating visual information such as document layout, OCR text positions, and textual content. LayoutLM can be fine-tuned for various NLP tasks, including DOCVQA. For example, LayoutLM in DOCVQA can help accurately locate the document’s relevant text and other visual elements, which is essential for answering questions requiring context-specific information.
Flan-T5 is a method that uses a transformer-based architecture to perform both text recognition and layout analysis. This model is trained end-to-end on document images and can handle multi-lingual documents, making it suitable for various applications. For example, using Flan-T5 in DOCVQA allows for accurate text recognition and layout analysis, which can help improve the system’s performance.
Donut is a deep learning model that uses a novel architecture to perform text recognition on documents with irregular layouts. The use of Donut in DOCVQA can help to accurately extract text from documents with complex layouts, which is essential for answering questions that require specific information. The significant advantage is it is OCR-free.
Overall, using these models in DOCVQA can improve the accuracy and performance of the system by accurately extracting text and other relevant information from the document images. Please check out my previous blogs on DONUTand FLAN -T5 and LAYOUTLM.
The paper presents Pix2Struct from Google, a pre-trained image-to-text model for understanding visually-situated language. The model is trained using the novel learning technique to parse masked screenshots of web pages into simplified HTML, providing a significantly well-suited pretraining data source for the range of downstream activities. In addition to the novel pretraining strategy, the paper introduces a more flexible integration of linguistic and visual inputs and variable resolution input representation. As a result, the model achieves state-of-the-art results in six out of nine tasks in 4 domains like documents, illustrations, user interfaces, and natural images. The following image shows the detail about the considered domains. (The picture below is on the 5th page of the pix2struct research paper)
Pix2Struct is a pre-trained model that combines the simplicity of purely pixel-level inputs with the generality and scalability provided by self-supervised pretraining from diverse and abundant web data. The model does this by recommending a screenshot parsing objective that needs predicting an HTML-based parse from a screenshot of a web page that has been partially masked. With the diversity and complexity of textual and visual elements found on the web, Pix2Struct learns rich representations of the underlying structure of web pages, which can effectively transfer to various downstream visual language understanding tasks.
Pix2Struct is based on the Vision Transformer (ViT), an image-encoder-text-decoder model. However, Pix2Struct proposes a small but impactful change to the input representation to make the model more robust to various forms of visually-situated language. Standard ViT extracts fixed-size patches after scaling input images to a predetermined resolution. This distorts the proper aspect ratio of the image, which can be highly variable for documents, mobile UIs, and figures.
Also, transferring these models to downstream tasks with higher resolution is challenging, as the model only observes one specific resolution during pretraining. Pix2Struct proposes to scale the input image up or down to extract the maximum number of patches that fit within the given sequence length. This approach is more robust to extreme aspect ratios, common in the domains Pix2Struct experiments with. Additionally, the model can handle on-the-fly changes to the sequence length and resolution. To handle variable resolutions unambiguously, 2-dimensional absolute positional embeddings are used for the input patches.
Results
The Pix2Struct-Large model has outperformed the previous state-of-the-art Donut model on the DocVQA dataset. The LayoutLMv3 model achieves high performance on this task using three components, including an OCR system and pre-trained encoders. However, the Pix2Struct model performs competitively without using in-domain pretraining data and relies solely on visual representations. (We consider only DocVQA results.)
Implementation
Let us walk through with implementation for DocVQA. For the demo purpose, let us consider the sample invoice from Mendeley Data.
1. Install the packages
!pip install git+https://github.com/huggingface/transformers pdf2image
!sudo apt install poppler-utils12diff
2. Import the packages
from pdf2image import convert_from_path, convert_from_bytes
import torch
from functools import partial
from PIL import Image
from transformers import Pix2StructForConditionalGeneration as psg
from transformers import Pix2StructProcessor as psp
3. Initialize the model with pretrained weights
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = psg.from_pretrained("google/pix2struct-docvqa-large").to(DEVICE)
processor = psp.from_pretrained("google/pix2struct-docvqa-large")
4. Processing functions
def generate(model, processor, img, questions):
inputs = processor(images=[img for _ in range(len(questions))],
text=questions, return_tensors="pt").to(DEVICE)
predictions = model.generate(**inputs, max_new_tokens=256)
return zip(questions, processor.batch_decode(predictions, skip_special_tokens=True))
def convert_pdf_to_image(filename, page_no):
return convert_from_path(filename)[page_no-1]
5. Specify the exact the path and page number for pdf file.
questions = ["what is the seller name?",
"what is the date of issue?",
"What is Delivery address?",
"What is Tax Id of client?"]
FILENAME = "/content/invoice_107_charspace_108.pdf"
PAGE_NO = 1
6. Generate the answers
image = convert_pdf_to_image(FILENAME, PAGE_NO)
print("pdf to image conversion complete.")
generator = partial(generate, model, processor)
completions = generator(image, questions)
for completion in completions:
print(f"{completion}")
## answers
('what is the seller name?', 'Campbell, Callahan and Gomez')
('what is the date of issue?', '09/25/2011')
('What is Delivery address?', '2969 Todd Orchard Apt. 721')
('What is Tax Id of client?', '941-79-6209')
Try out your example on hugging face spaces.
Notebooks: pix2struck notebook
In conclusion, document information extraction is an essential area of research with applications in many domains. It involves using computer algorithms to identify and extract relevant information from text-based documents. Although several challenges are associated with document information extraction, researchers are developing new algorithms and techniques to address these challenges and improve the accuracy and reliability of the extracted information.
However, like all deep learning models, DocVQA has some limitations. For example, it requires a lot of training data to perform well and may need help with complex documents or rare symbols and fonts. It may also be sensitive to the quality of the input image and the accuracy of the OCR (optical character recognition) system used to extract text from the document.
Key Takeaways
To learn more about it, kindly get in contact on Linkedin. Please acknowledge if you are citing this article or repo.
Reference
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.