Have you ever found yourself staring at a product’s ingredients list, googling unfamiliar chemical names to figure out what they mean? It’s a common struggle – deciphering complex product information on the spot can be overwhelming and time-consuming. Traditional methods, like searching for each ingredient individually, often lead to fragmented and confusing results. But what if there was a smarter, faster way to analyze product ingredients and get clear, actionable insights instantly? In this article, we’ll walk you through building a Product Ingredients Analyzer using Gemini 2.0, Phidata, and Tavily Web Search. Let’s dive in and make sense of those ingredient lists once and for all!
This article was published as a part of the Data Science Blogathon.
Multimodal systems process and understand multiple types of input data—like text, images, audio, and video—simultaneously. Vision-language models, such as Gemini 2.0 Flash, GPT-4o, Claude Sonnet 3.5, and Pixtral-12B, excel at understanding relationships between these modalities, extracting meaningful insights from complex inputs.
In this context, we focus on vision-language models that analyze images and generate textual insights. These systems combine computer vision and natural language processing to interpret visual information based on user prompts.
Multimodal systems are transforming industries:
The shift from single-mode AI to multimodal agents marks a major leap in how we interact with AI systems. Here’s what makes multimodal agents so effective:
Let’s break down the implementation of a Product Ingredient Analysis Agent:
!pip install phidata google-generativeai tavily-python streamlit pillow
In this step, we will set up the environment variables and gather the required API credentials to run this use case.
from phi.agent import Agent
from phi.model.google import Gemini # needs a api key
from phi.tools.tavily import TavilyTools # also needs a api key
import os
TAVILY_API_KEY = "<replace-your-api-key>"
GOOGLE_API_KEY = "<replace-your-api-key>"
os.environ['TAVILY_API_KEY'] = TAVILY_API_KEY
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY
To get better responses from language models, you need to write better prompts. This involves clearly defining the role and providing detailed instructions in the system prompt for the LLM.
Let’s define the role and responsibilities of an Agent with expertise in ingredient analysis and nutrition. The instructions should guide the Agent to systematically analyze food products, assess ingredients, consider dietary restrictions, and evaluate health implications.
SYSTEM_PROMPT = """
You are an expert Food Product Analyst specialized in ingredient analysis and nutrition science.
Your role is to analyze product ingredients, provide health insights, and identify potential concerns by combining ingredient analysis with scientific research.
You utilize your nutritional knowledge and research works to provide evidence-based insights, making complex ingredient information accessible and actionable for users.
Return your response in Markdown format.
"""
INSTRUCTIONS = """
* Read ingredient list from product image
* Remember the user may not be educated about the product, break it down in simple words like explaining to 10 year kid
* Identify artificial additives and preservatives
* Check against major dietary restrictions (vegan, halal, kosher). Include this in response.
* Rate nutritional value on scale of 1-5
* Highlight key health implications or concerns
* Suggest healthier alternatives if needed
* Provide brief evidence-based recommendations
* Use Search tool for getting context
"""
The Agent, built using Phidata, is configured to process markdown formatting and operate based on the system prompt and instructions defined earlier. The reasoning model used in this example is Gemini 2.0 Flash, known for its superior ability to understand images and videos compared to other models.
For tool integration, we will use Tavily Search, an advanced web search engine that provides relevant context directly in response to user queries, avoiding unnecessary descriptions, URLs, and irrelevant parameters.
agent = Agent(
model = Gemini(id="gemini-2.0-flash-exp"),
tools = [TavilyTools()],
markdown=True,
system_prompt = SYSTEM_PROMPT,
instructions = INSTRUCTIONS
)
With the Agent components now in place, the next step is to provide user input. This can be done in two ways: either by passing the image path or the URL, along with a user prompt specifying what information needs to be extracted from the provided image.
agent.print_response(
"Analyze the product image",
images = ["images/bournvita.jpg"],
stream=True
)
Output:
agent.print_response(
"Analyze the product image",
images = ["https://beardo.in/cdn/shop/products/9_2ba7ece4-0372-4a34-8040-5dc40c89f103.jpg?v=1703589764&width=1946"],
stream=True
)
Output:
Now that we know how to execute the Multimodal Agent, let’s build the UI part using Streamlit.
import streamlit as st
from PIL import Image
from io import BytesIO
from tempfile import NamedTemporaryFile
st.title("🔍 Product Ingredient Analyzer")
To optimize performance, define the Agent inference under a cached function. The cache decorator helps improve efficiency by reusing the Agent instance.
Since we are using Streamlit, which refreshes the entire page after each event loop or widget trigger, adding st.cache_resource ensures the function is not refreshed and saves it in the cache.
@st.cache_resource
def get_agent():
return Agent(
model=Gemini(id="gemini-2.0-flash-exp"),
system_prompt=SYSTEM_PROMPT,
instructions=INSTRUCTIONS,
tools=[TavilyTools(api_key=os.getenv("TAVILY_API_KEY"))],
markdown=True,
)
When a new image path is provided by the user, the analyze_image function runs and executes the Agent object defined in get_agent. For real-time capture and the option to upload images, the uploaded file needs to be saved temporarily for processing.
The image is stored in a temporary file, and once the execution is completed, the temporary file is deleted to free up resources. This can be done using the NamedTemporaryFile function from the tempfile library.
def analyze_image(image_path):
agent = get_agent()
with st.spinner('Analyzing image...'):
response = agent.run(
"Analyze the given image",
images=[image_path],
)
st.markdown(response.content)
def save_uploaded_file(uploaded_file):
with NamedTemporaryFile(dir='.', suffix='.jpg', delete=False) as f:
f.write(uploaded_file.getbuffer())
return f.name
For a better user interface, when a user selects an image, it is likely to have varying resolutions and sizes. To maintain a consistent layout and properly display the image, we can resize the uploaded or captured image to ensure it fits clearly on the screen.
The LANCZOS resampling algorithm provides high-quality resizing, particularly beneficial for product images where text clarity is crucial for ingredient analysis.
MAX_IMAGE_WIDTH = 300
def resize_image_for_display(image_file):
img = Image.open(image_file)
aspect_ratio = img.height / img.width
new_height = int(MAX_IMAGE_WIDTH * aspect_ratio)
img = img.resize((MAX_IMAGE_WIDTH, new_height), Image.Resampling.LANCZOS)
buf = BytesIO()
img.save(buf, format="PNG")
return buf.getvalue()
The interface is divided into three navigation tabs where the user can pick his choice of interests:
We repeat the same logical flow for all the 3 tabs:
State management is handled through Streamlit’s session state, tracking selected examples and analysis status.
def main():
if 'selected_example' not in st.session_state:
st.session_state.selected_example = None
if 'analyze_clicked' not in st.session_state:
st.session_state.analyze_clicked = False
tab_examples, tab_upload, tab_camera = st.tabs([
"📚 Example Products",
"📤 Upload Image",
"📸 Take Photo"
])
with tab_examples:
example_images = {
"🥤 Energy Drink": "images/bournvita.jpg",
"🥔 Potato Chips": "images/lays.jpg",
"🧴 Shampoo": "images/shampoo.jpg"
}
cols = st.columns(3)
for idx, (name, path) in enumerate(example_images.items()):
with cols[idx]:
if st.button(name, use_container_width=True):
st.session_state.selected_example = path
st.session_state.analyze_clicked = False
with tab_upload:
uploaded_file = st.file_uploader(
"Upload product image",
type=["jpg", "jpeg", "png"],
help="Upload a clear image of the product's ingredient list"
)
if uploaded_file:
resized_image = resize_image_for_display(uploaded_file)
st.image(resized_image, caption="Uploaded Image", use_container_width=False, width=MAX_IMAGE_WIDTH)
if st.button("🔍 Analyze Uploaded Image", key="analyze_upload"):
temp_path = save_uploaded_file(uploaded_file)
analyze_image(temp_path)
os.unlink(temp_path)
with tab_camera:
camera_photo = st.camera_input("Take a picture of the product")
if camera_photo:
resized_image = resize_image_for_display(camera_photo)
st.image(resized_image, caption="Captured Photo", use_container_width=False, width=MAX_IMAGE_WIDTH)
if st.button("🔍 Analyze Captured Photo", key="analyze_camera"):
temp_path = save_uploaded_file(camera_photo)
analyze_image(temp_path)
os.unlink(temp_path)
if st.session_state.selected_example:
st.divider()
st.subheader("Selected Product")
resized_image = resize_image_for_display(st.session_state.selected_example)
st.image(resized_image, caption="Selected Example", use_container_width=False, width=MAX_IMAGE_WIDTH)
if st.button("🔍 Analyze Example", key="analyze_example") and not st.session_state.analyze_clicked:
st.session_state.analyze_clicked = True
analyze_image(st.session_state.selected_example)
Multimodal AI agents represent a greater leap forward in how we can interact with and understand complex information in our daily lives. By combining vision processing, natural language understanding, and web search capabilities, these systems, like the Product Ingredient Analyzer, can provide instant, comprehensive analysis of products and their ingredients, making informed decision-making more accessible to everyone.
A. LLaVA (Large Language and Vision Assistant), Pixtral-12B by Mistral.AI, Multimodal-GPT by OpenFlamingo, NVILA by Nvidia, and Qwen model are a few open source or weights multimodal vision language models that process text and images for tasks like visual question answering.
A. Yes, Llama 3 is multimodal, and also Llama 3.2 Vision models (11B and 90B parameters) process both text and images, enabling tasks like image captioning and visual reasoning.
A. A Multimodal Large Language Model (LLM) processes and generates data across various modalities, such as text, images, and audio. In contrast, a Multimodal Agent utilizes such models to interact with its environment, perform tasks, and make decisions based on multimodal inputs, often integrating additional tools and systems to execute complex actions.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.