Build a Multimodal Agent for Product Ingredient Analysis

Tarun R Jain Last Updated : 20 Jan, 2025

8 min read

Have you ever found yourself staring at a product’s ingredients list, googling unfamiliar chemical names to figure out what they mean? It’s a common struggle – deciphering complex product information on the spot can be overwhelming and time-consuming. Traditional methods, like searching for each ingredient individually, often lead to fragmented and confusing results. But what if there was a smarter, faster way to analyze product ingredients and get clear, actionable insights instantly? In this article, we’ll walk you through building a Product Ingredients Analyzer using Gemini 2.0, Phidata, and Tavily Web Search. Let’s dive in and make sense of those ingredient lists once and for all!

Learning Objectives

Design a Multimodal AI Agent architecture using Phidata and Gemini 2.0 for vision-language tasks.
Integrate Tavily Web Search into agent workflows for better context and information retrieval.
Build a Product Ingredient Analyzer Agent that combines image processing and web search for detailed product insights.
Learn how system prompts and instructions guide agent behavior in multimodal tasks.
Develop a Streamlit UI for real-time image analysis, nutrition details, and health-based suggestions.

This article was published as a part of the Data Science Blogathon.

What are Multimodal Systems?
Multimodal Real-world Use Cases
Why Multimodal Agent?
Building Product Ingredient Analyzer Agent
Important Links
Conclusion
Frequently Asked Questions

What are Multimodal Systems?

Multimodal systems process and understand multiple types of input data—like text, images, audio, and video—simultaneously. Vision-language models, such as Gemini 2.0 Flash, GPT-4o, Claude Sonnet 3.5, and Pixtral-12B, excel at understanding relationships between these modalities, extracting meaningful insights from complex inputs.

In this context, we focus on vision-language models that analyze images and generate textual insights. These systems combine computer vision and natural language processing to interpret visual information based on user prompts.

Multimodal Real-world Use Cases

Multimodal systems are transforming industries:

Finance: Users can take screenshots of unfamiliar terms in online forms and get instant explanations.
E-commerce: Shoppers can photograph product labels to receive detailed ingredient analysis and health insights.
Education: Students can capture textbook diagrams and receive simplified explanations.
Healthcare: Patients can scan medical reports or prescription labels for simplified explanations of terms and dosage instructions.

Why Multimodal Agent?

The shift from single-mode AI to multimodal agents marks a major leap in how we interact with AI systems. Here’s what makes multimodal agents so effective:

They process both visual and textual information simultaneously, delivering more accurate and context-aware responses.
They simplify complex information, making it accessible to users who may struggle with technical terms or detailed content.
Instead of manually searching for individual components, users can upload an image and receive comprehensive analysis in one step.
By combining tools like web search and image analysis, they provide more complete and reliable insights.

Building Product Ingredient Analyzer Agent

Let’s break down the implementation of a Product Ingredient Analysis Agent:

Step 1: Setup Dependencies

Gemini 2.0 Flash: Handles multimodal processing with enhanced vision capabilities
Tavily Search: Provides web search integration for additional context
Phidata: Orchestrates the Agent system and manages workflows
Streamlit: To develop the prototype into Web-based applications.

!pip install phidata google-generativeai tavily-python streamlit pillow

Step 2: API Setup and Configuration

In this step, we will set up the environment variables and gather the required API credentials to run this use case.

For the Gemini API key, visit: https://aistudio.google.com/
For the Tavily API key, visit: https://app.tavily.com/

from phi.agent import Agent
from phi.model.google import Gemini # needs a api key
from phi.tools.tavily import TavilyTools # also needs a api key

import os
TAVILY_API_KEY = "<replace-your-api-key>"
GOOGLE_API_KEY = "<replace-your-api-key>"
os.environ['TAVILY_API_KEY'] = TAVILY_API_KEY
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

Step 3: System prompt and Instructions

To get better responses from language models, you need to write better prompts. This involves clearly defining the role and providing detailed instructions in the system prompt for the LLM.

Let’s define the role and responsibilities of an Agent with expertise in ingredient analysis and nutrition. The instructions should guide the Agent to systematically analyze food products, assess ingredients, consider dietary restrictions, and evaluate health implications.

SYSTEM_PROMPT = """
You are an expert Food Product Analyst specialized in ingredient analysis and nutrition science. 
Your role is to analyze product ingredients, provide health insights, and identify potential concerns by combining ingredient analysis with scientific research. 
You utilize your nutritional knowledge and research works to provide evidence-based insights, making complex ingredient information accessible and actionable for users.
Return your response in Markdown format. 
"""

INSTRUCTIONS = """
* Read ingredient list from product image 
* Remember the user may not be educated about the product, break it down in simple words like explaining to 10 year kid
* Identify artificial additives and preservatives
* Check against major dietary restrictions (vegan, halal, kosher). Include this in response. 
* Rate nutritional value on scale of 1-5
* Highlight key health implications or concerns
* Suggest healthier alternatives if needed
* Provide brief evidence-based recommendations
* Use Search tool for getting context
"""

Step 4: Define the Agent Object

The Agent, built using Phidata, is configured to process markdown formatting and operate based on the system prompt and instructions defined earlier. The reasoning model used in this example is Gemini 2.0 Flash, known for its superior ability to understand images and videos compared to other models.

For tool integration, we will use Tavily Search, an advanced web search engine that provides relevant context directly in response to user queries, avoiding unnecessary descriptions, URLs, and irrelevant parameters.

agent = Agent(
    model = Gemini(id="gemini-2.0-flash-exp"),
    tools = [TavilyTools()],
    markdown=True,
    system_prompt = SYSTEM_PROMPT,
    instructions = INSTRUCTIONS
)

Step 5: Multimodal – Understanding the Image

With the Agent components now in place, the next step is to provide user input. This can be done in two ways: either by passing the image path or the URL, along with a user prompt specifying what information needs to be extracted from the provided image.

Approach: 1 Using Image Path

agent.print_response(
    "Analyze the product image",
    images = ["images/bournvita.jpg"],
    stream=True
)

Output:

Approach: 2 Using URL

agent.print_response(
    "Analyze the product image",
    images = ["https://beardo.in/cdn/shop/products/9_2ba7ece4-0372-4a34-8040-5dc40c89f103.jpg?v=1703589764&width=1946"],
    stream=True
)

Output:

Step 6: Develop the Web App using Streamlit

Now that we know how to execute the Multimodal Agent, let’s build the UI part using Streamlit.

import streamlit as st
from PIL import Image
from io import BytesIO
from tempfile import NamedTemporaryFile

st.title("🔍 Product Ingredient Analyzer")

To optimize performance, define the Agent inference under a cached function. The cache decorator helps improve efficiency by reusing the Agent instance.

Since we are using Streamlit, which refreshes the entire page after each event loop or widget trigger, adding st.cache_resource ensures the function is not refreshed and saves it in the cache.

@st.cache_resource
def get_agent():
    return Agent(
        model=Gemini(id="gemini-2.0-flash-exp"),
        system_prompt=SYSTEM_PROMPT,
        instructions=INSTRUCTIONS,
        tools=[TavilyTools(api_key=os.getenv("TAVILY_API_KEY"))],
        markdown=True,
    )

When a new image path is provided by the user, the analyze_image function runs and executes the Agent object defined in get_agent. For real-time capture and the option to upload images, the uploaded file needs to be saved temporarily for processing.

The image is stored in a temporary file, and once the execution is completed, the temporary file is deleted to free up resources. This can be done using the NamedTemporaryFile function from the tempfile library.

def analyze_image(image_path):
    agent = get_agent()
    with st.spinner('Analyzing image...'):
        response = agent.run(
            "Analyze the given image",
            images=[image_path],
        )
        st.markdown(response.content)

def save_uploaded_file(uploaded_file):
    with NamedTemporaryFile(dir='.', suffix='.jpg', delete=False) as f:
        f.write(uploaded_file.getbuffer())
        return f.name

For a better user interface, when a user selects an image, it is likely to have varying resolutions and sizes. To maintain a consistent layout and properly display the image, we can resize the uploaded or captured image to ensure it fits clearly on the screen.

The LANCZOS resampling algorithm provides high-quality resizing, particularly beneficial for product images where text clarity is crucial for ingredient analysis.

MAX_IMAGE_WIDTH = 300

def resize_image_for_display(image_file):
    img = Image.open(image_file)
    
    aspect_ratio = img.height / img.width
    new_height = int(MAX_IMAGE_WIDTH * aspect_ratio)
    img = img.resize((MAX_IMAGE_WIDTH, new_height), Image.Resampling.LANCZOS)
    
    buf = BytesIO()
    img.save(buf, format="PNG")
    return buf.getvalue()

Step 7: UI Features for Streamlit

The interface is divided into three navigation tabs where the user can pick his choice of interests:

Tab-1: Example Products that users can select to test the app
Tab-2: Upload an Image of your choice if it’s already saved.
Tab-3: Capture or Take a live photo and analyze the product.

We repeat the same logical flow for all the 3 tabs:

First, choose the image of your choice and resize it to display on the Streamlit UI using st.image.
Second, save that image in a temporary directory to process it to the Agent object.
Third, analyze the image where the Agent execution will take place using Gemini 2.0 LLM and Tavily Search tool.

State management is handled through Streamlit’s session state, tracking selected examples and analysis status.

def main():
    if 'selected_example' not in st.session_state:
        st.session_state.selected_example = None
    if 'analyze_clicked' not in st.session_state:
        st.session_state.analyze_clicked = False
    
    tab_examples, tab_upload, tab_camera = st.tabs([
        "📚 Example Products", 
        "📤 Upload Image", 
        "📸 Take Photo"
    ])
    
    with tab_examples:
        example_images = {
            "🥤 Energy Drink": "images/bournvita.jpg",
            "🥔 Potato Chips": "images/lays.jpg",
            "🧴 Shampoo": "images/shampoo.jpg"
        }
        
        cols = st.columns(3)
        for idx, (name, path) in enumerate(example_images.items()):
            with cols[idx]:
                if st.button(name, use_container_width=True):
                    st.session_state.selected_example = path
                    st.session_state.analyze_clicked = False
    
    with tab_upload:
        uploaded_file = st.file_uploader(
            "Upload product image", 
            type=["jpg", "jpeg", "png"],
            help="Upload a clear image of the product's ingredient list"
        )
        if uploaded_file:
            resized_image = resize_image_for_display(uploaded_file)
            st.image(resized_image, caption="Uploaded Image", use_container_width=False, width=MAX_IMAGE_WIDTH)
            if st.button("🔍 Analyze Uploaded Image", key="analyze_upload"):
                temp_path = save_uploaded_file(uploaded_file)
                analyze_image(temp_path)
                os.unlink(temp_path) 
    
    with tab_camera:
        camera_photo = st.camera_input("Take a picture of the product")
        if camera_photo:
            resized_image = resize_image_for_display(camera_photo)
            st.image(resized_image, caption="Captured Photo", use_container_width=False, width=MAX_IMAGE_WIDTH)
            if st.button("🔍 Analyze Captured Photo", key="analyze_camera"):
                temp_path = save_uploaded_file(camera_photo)
                analyze_image(temp_path)
                os.unlink(temp_path) 
    
    if st.session_state.selected_example:
        st.divider()
        st.subheader("Selected Product")
        resized_image = resize_image_for_display(st.session_state.selected_example)
        st.image(resized_image, caption="Selected Example", use_container_width=False, width=MAX_IMAGE_WIDTH)
        
        if st.button("🔍 Analyze Example", key="analyze_example") and not st.session_state.analyze_clicked:
            st.session_state.analyze_clicked = True
            analyze_image(st.session_state.selected_example)

Important Links

You can find the full code here.
Replace the “<replace-with-api-key>” placeholder with your keys.
For tab_examples, you need to have a folder image. And save the images over there. Here is the GitHub URL with images directory here.
If you are interested in using the use case, here is the deployed App here.

Conclusion

Multimodal AI agents represent a greater leap forward in how we can interact with and understand complex information in our daily lives. By combining vision processing, natural language understanding, and web search capabilities, these systems, like the Product Ingredient Analyzer, can provide instant, comprehensive analysis of products and their ingredients, making informed decision-making more accessible to everyone.

Key Takeaways

Multimodal AI agents improve how we understand product information. They combine text and image analysis.
With Phidata, an open-source framework, we can build and manage agent systems. These systems use models like GPT-4o and Gemini 2.0.
Agents use tools like vision processing and web search. This makes their analysis more complete and accurate. LLMs have limited knowledge, so agents use tools to handle complex tasks better.
Streamlit makes it easy to build web apps for LLM-based tools. Examples include RAG and multimodal agents.
Good system prompts and instructions guide the agent. This ensures useful and accurate responses.

Frequently Asked Questions

Q1. Mention Multimodal Vision language models that are Open-Source

A. LLaVA (Large Language and Vision Assistant), Pixtral-12B by Mistral.AI, Multimodal-GPT by OpenFlamingo, NVILA by Nvidia, and Qwen model are a few open source or weights multimodal vision language models that process text and images for tasks like visual question answering.

Q2. Is Llama3 Multimodal?

A. Yes, Llama 3 is multimodal, and also Llama 3.2 Vision models (11B and 90B parameters) process both text and images, enabling tasks like image captioning and visual reasoning.

Q3. How is Multimodal LLM different from Multimodal Agent?

A. A Multimodal Large Language Model (LLM) processes and generates data across various modalities, such as text, images, and audio. In contrast, a Multimodal Agent utilizes such models to interact with its environment, perform tasks, and make decisions based on multimodal inputs, often integrating additional tools and systems to execute complex actions.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Tarun R Jain

Data Scientist at AI Planet || YouTube- AIWithTarun || Google Developer Expert in ML || Won 5 AI hackathons || Co-organizer of TensorFlow User Group Bangalore || Pie & AI Ambassador at DeepLearningAI

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

seo specialists germany

"Your brilliant analysis of business patterns resonates deeply! Through our work at ExplodingBrands (explodingbrands.de), Germany's trusted directory, we've witnessed similar trends. Our platform's extensive network across marketing, home services, and professional sectors validates your observations. As Deutschland's leading resource for business listings, we appreciate your exceptional insights!"

Build a Multimodal Agent for Product Ingredient Analysis

Learning Objectives

Table of contents

What are Multimodal Systems?

Multimodal Real-world Use Cases

Why Multimodal Agent?

Building Product Ingredient Analyzer Agent

Step 1: Setup Dependencies

Step 2: API Setup and Configuration

Step 3: System prompt and Instructions

Step 4: Define the Agent Object

Step 5: Multimodal – Understanding the Image

Approach: 1 Using Image Path

Approach: 2 Using URL

Step 6: Develop the Web App using Streamlit

Step 7: UI Features for Streamlit

Important Links

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV