Transforming PDF Images into Interactive Dialogues with AI

sagar tate Last Updated : 24 May, 2024
7 min read

Introduction

In our digital era, where information is predominantly shared through electronic formats, PDFs serve as a crucial medium. However, the data within them, especially images, often remain underutilized due to format constraints. This blog post introduces a pioneering approach that liberates and not only liberates but also maximizes the utility of data from PDFs. By employing Python and advanced AI technologies, we’ll demonstrate how to extract images from PDF files and interact with them using sophisticated AI models like LLava and the module LangChain. This innovative method opens up new avenues for data interaction, enhancing our ability to analyze and utilize information locked away in PDFs.

Transforming PDF Images into Interactive Dialogues with AI

Learning Objectives

  1. Extract and categorize elements from PDFs using the unstructured library.
  2. Set up a Python environment for PDF data extraction and AI interaction.
  3. Isolate and convert PDF images to base64 format for AI analysis.
  4. Use AI models like LLavA and LangChain to analyze and interact with PDF images.
  5. Integrate conversational AI into applications for enhanced data utility.
  6. Explore practical applications of AI-driven PDF content analysis.

This article was published as a part of the Data Science Blogathon.

Unstructured Data Extraction from PDFs

Setting Up the Environment

The first step in transforming PDF content involves preparing your computing environment with essential software tools. This setup is crucial for handling and extracting unstructured data from PDFs efficiently.

!pip install "unstructured[all-docs]" unstructured-client

Installing these packages equips your Python environment with the unstructured library, a powerful tool for dissecting and extracting diverse elements from PDF documents.

Extracting Data from PDF

The process of extracting data begins by dissecting the PDF into individual manageable elements. Using the unstructured library, you can easily partition a PDF into different elements, including text and images. The function partition_pdf from the unstructured.partition.pdf module is pivotal here.

from unstructured.partition.pdf import partition_pdf

# Specify the path to your PDF file
filename = "data/gpt4all.pdf"

# Extract elements from the PDF
path = "images"
raw_pdf_elements = partition_pdf(filename=filename,
                                 # Unstructured first finds embedded image blocks
                                 # Only applicable if `strategy=hi_res`
                                 extract_images_in_pdf=True,
                                 strategy = "hi_res",
                                 infer_table_structure=True,
                                 # Only applicable if `strategy=hi_res`
                                 extract_image_block_output_dir = path,
                                 )

This function returns a list of elements present in the PDF. Each element could be text, image, or other types of content embedded within the document. Images in the PDF are stored in the ‘image’ folder.

Identifying and Extracting Images

Once we have identified all the elements within the PDF, the next crucial step is to isolate the images for further interaction:

images = [el for el in elements if el.category == "Image"]

This list now contains all the images extracted from the PDF, which can be further processed or analyzed.

Below are the Images extracted:

Code to show images in notebook file:

"
LLavA and LangChain

This simple yet effective line of code filters out the images from a mix of different elements, setting the stage for more sophisticated data handling and analysis.

Conversational AI with LLavA and LangChain

Setup and Configuration

To interact with the extracted images, we employ advanced AI technologies. Installing langchain and its community features is pivotal for facilitating AI-driven dialogues with the images.

Please check the link to set up Llava and Ollama in detail. Also, please install the package below.

!pip install langchain langchain_core langchain_community

This installation introduces essential tools for integrating conversational AI capabilities into our application.

Convert saved images to base64:

To make the images understandable to AI, we convert them into a format that AI models can interpret—base64 strings. 

import base64
from io import BytesIO

from IPython.display import HTML, display
from PIL import Image


def convert_to_base64(pil_image):
    """
    Convert PIL images to Base64 encoded strings

    :param pil_image: PIL image
    :return: Re-sized Base64 string
    """

    buffered = BytesIO()
    pil_image.save(buffered, format="JPEG")  # You can change the format if needed
    img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
    return img_str


def plt_img_base64(img_base64):
    """
    Display base64 encoded string as image

    :param img_base64:  Base64 string
    """
    # Create an HTML img tag with the base64 string as the source
    image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
    # Display the image by rendering the HTML
    display(HTML(image_html))


file_path = "./images/figure2.jpg"
pil_image = Image.open(file_path)
image_b64 = convert_to_base64(pil_image)

Analyzing Image with Llava and Ollama via langchain

LLaVa is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model based on transformer architecture. In other words, it is a multi-modal version of LLMs fine-tuned for chat/instructions.

The images converted into a suitable format (base64 strings) can be used as a context for LLavA to provide descriptions or other relevant information.

from langchain_community.llms import Ollama
llm = Ollama(model="llava:7b")

# Use LLavA to interpret the image
llm_with_image_context = llm.bind(images=[image_b64])
response = llm_with_image_context.invoke("Explain the image")

Output:

‘ The image is a graph showing the growth of GitHub repositories over time. The graph includes three lines, each representing different types of repositories:\n\n1. Lama: This line represents a single repository called “Lama,” which appears to be growing steadily over the given period, starting at 0 and increasing to just under 5,00 by the end of the timeframe shown on the graph.\n\n2. Alpaca: Similar to the Lama repository, this line also represents a single repository called “Alpaca.” It also starts at 0 but grows more quickly than Lama, reaching approximately 75,00 by the end of the period.\n\n3. All repositories (average): This line represents an average growth rate across all repositories on GitHub. It shows a gradual increase in the number of repositories over time, with less variability than the other two lines.\n\nThe graph is marked with a timestamp ranging from the start to the end of the data, which is not explicitly labeled. The vertical axis represents the number of repositories, while the horizontal axis indicates time.\n\nAdditionally, there are some annotations on the image:\n\n- “GitHub repo growth” suggests that this graph illustrates the growth of repositories on GitHub.\n- “Lama, Alpaca, all repositories (average)” labels each line to indicate which set of repositories it represents.\n- “100s,” “1k,” “10k,” “100k,” and “1M” are milestones marked on the graph, indicating the number of repositories at specific points in time.\n\nThe source code for GitHub is not visible in the image, but it could be an important aspect to consider when discussing this graph. The growth trend shown suggests that the number of new repositories being created or contributed to is increasing over time on this platform. ‘

This integration allows the model to “see” the image and provide insights, descriptions, or answer questions related to the image content.

Conclusion

The ability to extract images from PDFs and then utilize AI to engage with these images opens up numerous possibilities for data analysis, content management, and automated processing. The techniques described here leverage powerful libraries and AI models to effectively handle and interpret unstructured data.

Key Takeaways

  • Efficient Extraction: The unstructured library provides a seamless method to extract and categorize different elements within PDF documents.
  • Advanced AI Interaction: Converting images to a suitable format and using models like LLavA can enable sophisticated AI-driven interactions with document content.
  • Broad Applications: These capabilities are applicable across various fields, from automated document processing to AI-based content analysis.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What types of content can the unstructured library extract from PDFs?

A. The unstructured library is designed to handle many elements embedded within PDF documents. Specifically, it can extract:

a. Text: Any textual content, including paragraphs, headers, footers, and annotations.
b. Images: Embedded images within the PDF, including photos, graphics, and diagrams.
c. Tables: Structured data presented in tabular format.

This versatility makes the unstructured library a powerful, comprehensive PDF data extraction tool.

Q2. How does LLavA interact with images?

A. LLavA, a conversational AI model, interacts with images by first requiring them to be converted into a format it can process, typically base64 encoded strings. Once images are encoded:

a. Description Generation: LLavA can describe the contents of the image in natural language.
b. Question Answering: It can answer questions about the image, providing insights or explanations based on its visual content.
c. Contextual Analysis: LLavA can integrate the image context into broader conversational interactions, enhancing the understanding of complex documents that combine text and visuals.

Q3. Are there limitations to the image quality that can be extracted?

A. Yes, there are several factors that can affect the quality of images extracted from PDFs:

a. Original Image Quality: The resolution and clarity of the original images in the PDF.
b. PDF Compression: Some PDFs use compression techniques that can reduce image quality.
c. Extraction Settings: The settings used in the unstructured library (e.g., strategy=hi_res for high-resolution extraction) can impact the quality.
d. File Format: The format in which images are saved after extraction (e.g., JPEG, PNG) can affect the fidelity of the extracted images.

Q4. Can I use other AI models besides LLavA for image interaction?

A. Yes, you can use other AI models besides LLavA for image interaction. Here are some alternative language models (LLMs) that support image interaction:

a. CLIP (Contrastive Language-Image Pre-Training) by OpenAI: CLIP is a versatile model that understands images and their textual descriptions. It can generate image captions, classify images, and retrieve images based on textual queries.
b. DALL-E by OpenAI: DALL-E generates images from textual descriptions. While primarily used for creating images from text, it can also provide detailed descriptions of images based on their understanding.
c. VisualGPT: This variant of GPT-3 integrates image understanding capabilities, allowing it to generate descriptive text based on images.
d. Florence by Microsoft: Florence is a multimodal image and text understanding model. It can perform tasks such as image captioning, object detection, and answering questions about images.

These models, like LLavA, enable sophisticated interactions with images by providing descriptions, answering questions, and performing analyses based on visual content.

Q5. Is programming knowledge necessary to implement these solutions?

A. Basic programming knowledge, particularly in Python, is essential to implement these solutions effectively. Key skills include:

a. Setting Up the Environment: Installing necessary libraries and configuring the environment.
b. Writing and Running Code: Using Python to write data extraction and interaction scripts.
c. Understanding AI Models: Integrating and utilizing AI models like LLavA or others.
d. Debugging: Troubleshooting and resolving issues that may arise during implementation.

While some familiarity with programming is required, the process can be streamlined with clear documentation and examples, making it accessible to those with fundamental coding skills.

Hi,
I am a certified TensorFlow Developer, GCP Associate Engineer, and GCP Machine Learning Engineer.

In terms of GCP knowledge, I have experience working with various GCP services such as Compute Engine, Kubernetes Engine, App Engine, Cloud Storage, BigQuery, and Cloud SQL. I have experience with cloud-native data processing tools such as Dataflow and Apache Beam. I am also proficient in using Cloud SDK and Cloud Shell for deploying and managing GCP resources. I have hands-on experience in setting up and managing GCP projects, creating and managing virtual machines, configuring load balancers, and managing storage.

In terms of machine learning, I have experience working with a wide range of algorithms, including supervised and unsupervised learning, deep learning, and natural language processing. I have also worked on a variety of projects, including image classification, sentiment analysis, and predictive modeling.

As for web scraping, I have experience using a variety of tools and libraries, including Scrapy, BeautifulSoup, and Selenium. I have also worked with APIs and can handle data cleaning, preprocessing, and visualization.

Thank you for your time.

Responses From Readers

Clear

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details