The latest frontier in the evolution of Large Language Models (LLMs) is the integration of multimodality, spearheaded initially by OpenAI’s GPT-4. However, Google has recently entered the arena with the launch of the Gemini Version of their model, unveiling its API to the public on December 13th. This marks a pivotal moment for the LLM landscape, and though certain features are currently restricted, the prospect of exploring the capabilities of the new Gemini model in terms of multimodality is undeniably intriguing.
This article centers on delving into the Gemini Vision model, specifically designed for image prompting. By focusing on this aspect, we aim to unravel the potential applications of Gemini in various image-based use cases, particularly emphasizing its role in data extraction and also building an app using the model. As the field of LLMs advances into the realm of multimodality, understanding the nuances of the Gemini model’s capabilities provides a glimpse into the future of language models that seamlessly integrate diverse data types, promising a more comprehensive and interactive user experience.
This article was published as a part of the Data Science Blogathon.
Current LLMs mainly focus on text-based interactions which means we are using only single mode of input and output. Multimodal LLMs are expanding the boundaries of current LLMs by allowing us to give inputs beyond text that is image, video, audio, etc. and we can get outputs not only text but image, video, or audio.
In the past, distinct models were designed for specific media types, such as Imagen, DALL-E, and Stability Diffusion for generating images from text prompts. Simultaneously, dedicated audio models like OpenAI’s Whisper excelled at understanding audio content and generating concise summaries. The landscape has evolved, ushering in a new era of unified models capable of seamlessly handling diverse tasks across multiple mediums. Examples include OpenAI’s ChatGPT-4 and Google’s Gemini, marking a significant leap towards comprehensive AI frameworks that transcend traditional media-centric boundaries.
Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Dive in today! Explore Now
Gemini, the revolutionary multimodal model family from Google, stands at the forefront of AI innovation. This cutting-edge series demonstrates remarkable progress in natural language understanding, code interpretation, image analysis, audio processing, and video comprehension. Crafted to redefine the limits of AI capabilities, Gemini sets its sights on achieving state-of-the-art performance across a spectrum of benchmarks. With versatility as its hallmark, Gemini seamlessly navigates through multiple modalities, establishing itself as a powerhouse in the realm of artificial intelligence.
Gemini is slated to be offered in 3 distinct model sizes as of now. They are –
1. Gemini Nano: It is a compact version of model which can be run on edge devices. Currently, this model is being used by google on its Pixel Phone can read more about it here. It is Competent in various tasks, including natural language understanding, code interpretation, and basic image and audio comprehension.
2. Gemini Pro: It is the model version which has been made available to public by Google. It is medium scaled model similar to text bison of PaLM model but with several enhanced capabilities. The Gemini Pro model comes currently with 2 variants – one for text input – models/gemini-pro and other for image-based input along with text – models/gemini-pro-vision.
3. Gemini Ultra: It is the largest model of the Gemini series with large-scale architecture. It can complexly handle video and audio processing tasks and rates high on human expert performance.
Also Read: Building an LLM Model using Google Gemini API
Unlocking diverse possibilities, Gemini’s image prompting feature lends itself to numerous use cases. In this exploration, we’ll focus on extracting data from images in various formats, offering insights alongside practical Python code implementations. From structured data extraction to image analysis, Gemini simplifies technical complexities for seamless applications.
Imagine you have a complex chart or graph embedded in an image, and you need to extract the underlying data for analysis. This is where the multimodal capabilities of Large Language Models (LLMs) like Gemini come into play. By utilizing Gemini’s image prompting feature, you can instruct the model to interpret the chart or graph and provide a textual representation of the data.
Through a carefully crafted prompt, you can guide Gemini to not only recognize the visual elements of the chart but also comprehend the data it represents. The model’s multimodal prowess allows it to combine visual and language understanding, making it proficient in tasks like reading and extracting numerical or categorical information from graphical representations. The extracted data can then be easily processed for further analysis or integration into other applications, streamlining the workflow of data extraction from visual elements with the power of multimodal LLMs like Gemini.
We will use the below chart image sourced from Governments Factsheet website .
Using Gemini API Directly
# Install Google's Gemini Libraries
!pip install -q -U google-generativeai
#Import libraries
import google.generativeai as genai
from google.colab import userdata
from IPython.display import display
from IPython.display import Markdown
import PIL.Image
# configure api key and initialise model
from google.colab import userdata
import os
if "GOOGLE_API_KEY" not in os.environ:
os.environ["GOOGLE_API_KEY"] = userdata.get('api_key')
genai.configure(api_key=os.environ['GOOGLE_API_KEY'])
for m in genai.list_models():
if 'generateContent' in m.supported_generation_methods:
print(m.name)
Now add the image to your directory or upload to colab if you are using colab. “gemini_image.png” is the image file name which is being opened using PIL library. We then load the vision model using Generative Model function of genai library. And we then pass the image along with text question to the model using generate_content.
# Directly Calling API
image = PIL.Image.open('gemini_image.png')
vision_model = genai.GenerativeModel('gemini-pro-vision')
response = vision_model.generate_content(["What was the budget in the year 2021-22?",image])
print(response.text)
We can observe that the model gives correct answer. Now we will try to extract the chart data into json format and with just one query the model is easily able to convert it without much effort and accurately.
response = vision_model.generate_content(["Convert the chart in image into data in json format?",image])
print(response.text)
LangChain has launched a standalone package for gemini api integration. Currently it supports limited langchain core functionality. We will use the package langchain-google-genai .
#install Langchain Standalone package
!pip install -U --quiet langchain-google-genai pillow
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage
from langchain_google_genai import ChatGoogleGenerativeAI
llm = ChatGoogleGenerativeAI(model="gemini-pro-vision")
# example
image = PIL.Image.open('gemini_image.png')
hmessage1 = HumanMessage(
content=[
{
"type": "text",
"text": "What's in this image?",
}, # You can optionally provide text parts
{"type": "image_url", "image_url": image},
]
)
message1 = llm.invoke([hmessage1])
print(message1.content)
Consider a scenario where a business receives a multitude of invoices or bills in various formats. Extracting crucial information, such as vendor details, invoice amounts, and due dates, can be time-consuming. Leveraging a multimodal Large Language Model (LLM) like Gemini can streamline this process. By providing the model with image prompts of the invoices, it can intelligently recognize and extract relevant data, transforming the information into a structured format for easy integration into financial systems.
We will use the below invoice of Amazon to demonstrate this use case. And we will try to extract the data in the invoice into a json format.
# example
image = PIL.Image.open('invoice_bill.jpg')
hmessage1 = HumanMessage(
content=[
{
"type": "text",
"text": "Convert Invoice data into json format with appropriate json tags as required for the data in image ",
}, # You can optionally provide text parts
{"type": "image_url", "image_url": image},
]
)
message1 = llm.invoke([hmessage1])
print(message1.content)
# OUTPUT
```json
{
"GSTIN": "26ADCDER3836R1Z",
"Invoice Number": "INV-13",
"Invoice Date": "06 April 2022",
"Billed To": {
"Name": "Gaurav Gupta",
"Billing Address": "Babuganj, Hasanganj, Lucknow, 226007 Uttar Pradesh"
},
"Shipped To": {
"Name": "Gaurav Gupta",
"Shipping Address": "Babuganj, Hasanganj, Lucknow, 226007 Uttar Pradesh"
},
"Place of Supply": "UTTAR PRADESH",
"Items": [
{
"Item": "Samsung Galaxy F23",
"HSN": "8517",
"Color": "Aqua Green",
"Storage": "128 GB",
"Rate": 15677.10,
"Quantity": 1,
"Taxable Value": 15677.10,
"Tax Amount": 2821.88,
"Tax Rate": 18,
"Amount": 18499.00
},
{
"Item": "Samsung 45 Watt Travel Adapter",
"HSN": "8504",
"Model": "EP-TA845XBNGIN",
"Color": "Black",
"Rate": 2541.53,
"Quantity": 1,
"Taxable Value": 2541.53,
"Tax Amount": 457.48,
"Tax Rate": 18,
"Amount": 2999.00
}
],
"Total Amount": 21498.00,
"Taxable Amount": 18218.63,
"IGST": 18.00,
"SGST": 18.00,
"CGST": 18.00,
"Payment Method": "UPI",
"Bank Details": {
"Bank": "Yes Bank",
"Account Number": "9999999999",
"IFSC": "YESB0000009",
"Branch": "Kodihalli"
},
"Notes": "Thank you for the business"
}
````
As we can see the model was able to identify all elements correctly and also it creates its own json structure based on given image and forms appropriate nested tags such as billed to has two nested tags name & billing address. Thus, the model is able to logically group the data and create appropriate output format without requiring us to provide any parser structure.
Consider a scenario in retail or manufacturing where companies deal with a variety of products, each with unique labels containing crucial information. Extracting data from product label images can be a labor-intensive task. However, by employing a multimodal Large Language Model (LLM) like Gemini, businesses can streamline this process. The model, with its image prompting feature, can intelligently recognize and extract key information from product labels, such as product names, ingredients, nutritional facts, and expiration dates. This data extraction can be automated, saving time and minimizing errors associated with manual transcription.
Applications
We will use the below image of product label to illustrate this use case. Note that the image is a bit blurry so let us see if the model can interpret the information from blurry image.
# example
image = Image.open('product_1 (1).png')
product_msg = HumanMessage(
content=[
{
"type": "text",
"text": "Create a json with following tags extracted from image and use information only from image for value of each tag - 'product_name','manufactured_date','expiry_date','manufactured_by','marketed_by','ingredients'",
}, # You can optionally provide text parts
{"type": "image_url", "image_url": image},
]
)
prod_output = llm.invoke([product_msg])
print(prod_output.content)
The below output of above code shows that the model was hallucinating while providing output for ingredients tag as actual ingredients were quite different in the image.
Also it is able to accurately get the year if manufacturing and expiry date correct however month of manufactured by is wrong but month of expiry is correct.
Also product name has also been wrongly written as body wash when it is actually a face wash.
```json
{
"product_name": "Nivea Milk Delight Nourishing Body Wash",
"manufactured_date": "07.07.2022",
"expiry_date": "06.07.2024",
"manufactured_by": "Nivea India Pvt. Ltd.",
"marketed_by": "Nivea India Pvt. Ltd.",
"ingredients": "Aqua, Sodium Laureth Sulfate, Glycerin, Sodium Chloride, Cocamidopropyl Betaine, Prunus Amygdalus Dulcis (Sweet Almond) Oil, Glyceryl Glucoside, Sodium Benzoate, Salicylic Acid, Parfum, Sodium Acetate, Tetrasodium EDTA, Citric Acid, Sodium Hydroxide, Phenoxyethanol, Methylparaben, Propylparaben, Butylparaben, Ethylparaben, Isobutylparaben"
}
```
Now asking another question with specific input being passed such as instead of expiry date we write Use before (U.B) as this is what is given in the product label.
product_msg = HumanMessage(
content=[
{
"type": "text",
"text": "Extract Product Name, MFD Date,U.B. Use Before, Price,Marketed By , Quantity from Image and display in json format",
}, # You can optionally provide text parts
{"type": "image_url", "image_url": image},
]
)
prod_output = llm.invoke([product_msg])
print(prod_output.content)
As we can see this time the model predicting everything correctly as we passed exact names as was present in Product Label.
Based on our experiments above we have the following important observations :
1. The model is able to interpret visual graphs and easily convert them into appropriate json tag structure without having to explicitly provide the json template.
2. The model can is easily able to find insights from graph like below it was able to deduce that there was 7 fold increase in budget allocated in Jal Jeevan Mission .
3. The model was able to determine text from blurry image easily as was seen in product label use case. However it did hallucinate in some places.
Some important limitations for current version of model and availability:
1. The Vision model of Gemini does not support currently multiturn chat conversations that is we can only pass single human message but not list of messages with alternate Human AI Human format, hence it can’t have history of conversations . Each query or message is treated as independent query.
2. The System Messages are not accepted by model only Human and AI messages.
3. The model is only able to understand images currently as input we cannot provide videos yet. This feature they might release next year soon.
4. The model hallucinates in situations where the text might not be that much clear( as we saw in ingredients list in product label use case)
In conclusion, the advent of multimodal Large Language Models (LLMs) like Google’s Gemini series represents a pivotal shift in AI capabilities, enabling seamless integration of text, images, audio, and video inputs. These models, such as Gemini Nano, Pro, and Ultra, showcase unprecedented versatility in tasks ranging from natural language understanding to complex video and audio processing.
The integration of Gemini into real-world scenarios, demonstrated through use cases like financial analysis, invoice parsing, and product label interpretation, highlights its transformative potential in automating diverse tasks. Despite notable achievements, current limitations, such as occasional hallucinations in image interpretation, underscore the ongoing evolution of multimodal LLMs.
Although we can expect that these limitations will not be there with advanced feature access in future.
Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.
A. Currently they are free to use and they released it on 13th Dec for developer access and in future it might be charged. View pricing details here.
A. Yes Gemini is available in their Vertex AI offering in Google Cloud Platform sample tutorials and notebooks available here.
A. Yes, API request is currently limited to 60 requests per minute
A. Currently Lang Chain’s separate package offering for google gemini integration does not support any LLM Chains.
A. Yes , but ready made package support for RAG is not yet available but it can be implemented using Vertex AI platform functions sample notebook can be found here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.