Build a Multimodal Agent for Product Ingredient Analysis

Tarun R Jain Last Updated : 20 Jan, 2025
8 min read

Have you ever found yourself staring at a product’s ingredients list, googling unfamiliar chemical names to figure out what they mean? It’s a common struggle – deciphering complex product information on the spot can be overwhelming and time-consuming. Traditional methods, like searching for each ingredient individually, often lead to fragmented and confusing results. But what if there was a smarter, faster way to analyze product ingredients and get clear, actionable insights instantly? In this article, we’ll walk you through building a Product Ingredients Analyzer using Gemini 2.0, Phidata, and Tavily Web Search. Let’s dive in and make sense of those ingredient lists once and for all!

Learning Objectives

  • Design a Multimodal AI Agent architecture using Phidata and Gemini 2.0 for vision-language tasks.
  • Integrate Tavily Web Search into agent workflows for better context and information retrieval.
  • Build a Product Ingredient Analyzer Agent that combines image processing and web search for detailed product insights.
  • Learn how system prompts and instructions guide agent behavior in multimodal tasks.
  • Develop a Streamlit UI for real-time image analysis, nutrition details, and health-based suggestions.

This article was published as a part of the Data Science Blogathon.

What are Multimodal Systems?

Multimodal systems process and understand multiple types of input data—like text, images, audio, and video—simultaneously. Vision-language models, such as Gemini 2.0 Flash, GPT-4o, Claude Sonnet 3.5, and Pixtral-12B, excel at understanding relationships between these modalities, extracting meaningful insights from complex inputs.

In this context, we focus on vision-language models that analyze images and generate textual insights. These systems combine computer vision and natural language processing to interpret visual information based on user prompts.

Multimodal Real-world Use Cases

Multimodal systems are transforming industries:

  • Finance: Users can take screenshots of unfamiliar terms in online forms and get instant explanations.
  • E-commerce: Shoppers can photograph product labels to receive detailed ingredient analysis and health insights.
  • Education: Students can capture textbook diagrams and receive simplified explanations.
  • Healthcare: Patients can scan medical reports or prescription labels for simplified explanations of terms and dosage instructions.

Why Multimodal Agent?

The shift from single-mode AI to multimodal agents marks a major leap in how we interact with AI systems. Here’s what makes multimodal agents so effective:

  • They process both visual and textual information simultaneously, delivering more accurate and context-aware responses.
  • They simplify complex information, making it accessible to users who may struggle with technical terms or detailed content.
  • Instead of manually searching for individual components, users can upload an image and receive comprehensive analysis in one step.
  • By combining tools like web search and image analysis, they provide more complete and reliable insights.

Building Product Ingredient Analyzer Agent

Source: Author

Let’s break down the implementation of a Product Ingredient Analysis Agent:

Step 1: Setup Dependencies 

  • Gemini 2.0 Flash: Handles multimodal processing with enhanced vision capabilities
  • Tavily Search: Provides web search integration for additional context
  • Phidata: Orchestrates the Agent system and manages workflows
  • Streamlit: To develop the prototype into Web-based applications.
!pip install phidata google-generativeai tavily-python streamlit pillow

Step 2: API Setup and Configuration 

In this step, we will set up the environment variables and gather the required API credentials to run this use case. 

from phi.agent import Agent
from phi.model.google import Gemini # needs a api key
from phi.tools.tavily import TavilyTools # also needs a api key

import os
TAVILY_API_KEY = "<replace-your-api-key>"
GOOGLE_API_KEY = "<replace-your-api-key>"
os.environ['TAVILY_API_KEY'] = TAVILY_API_KEY
os.environ['GOOGLE_API_KEY'] = GOOGLE_API_KEY

Step 3: System prompt and Instructions

To get better responses from language models, you need to write better prompts. This involves clearly defining the role and providing detailed instructions in the system prompt for the LLM.

Let’s define the role and responsibilities of an Agent with expertise in ingredient analysis and nutrition. The instructions should guide the Agent to systematically analyze food products, assess ingredients, consider dietary restrictions, and evaluate health implications.

SYSTEM_PROMPT = """
You are an expert Food Product Analyst specialized in ingredient analysis and nutrition science. 
Your role is to analyze product ingredients, provide health insights, and identify potential concerns by combining ingredient analysis with scientific research. 
You utilize your nutritional knowledge and research works to provide evidence-based insights, making complex ingredient information accessible and actionable for users.
Return your response in Markdown format. 
"""

INSTRUCTIONS = """
* Read ingredient list from product image 
* Remember the user may not be educated about the product, break it down in simple words like explaining to 10 year kid
* Identify artificial additives and preservatives
* Check against major dietary restrictions (vegan, halal, kosher). Include this in response. 
* Rate nutritional value on scale of 1-5
* Highlight key health implications or concerns
* Suggest healthier alternatives if needed
* Provide brief evidence-based recommendations
* Use Search tool for getting context
"""

Step 4: Define the Agent Object

The Agent, built using Phidata, is configured to process markdown formatting and operate based on the system prompt and instructions defined earlier. The reasoning model used in this example is Gemini 2.0 Flash, known for its superior ability to understand images and videos compared to other models.

For tool integration, we will use Tavily Search, an advanced web search engine that provides relevant context directly in response to user queries, avoiding unnecessary descriptions, URLs, and irrelevant parameters.

agent = Agent(
    model = Gemini(id="gemini-2.0-flash-exp"),
    tools = [TavilyTools()],
    markdown=True,
    system_prompt = SYSTEM_PROMPT,
    instructions = INSTRUCTIONS
)

Step 5: Multimodal – Understanding the Image

With the Agent components now in place, the next step is to provide user input. This can be done in two ways: either by passing the image path or the URL, along with a user prompt specifying what information needs to be extracted from the provided image.

Approach: 1 Using Image Path

Multimodal - Understanding the Image
agent.print_response(
    "Analyze the product image",
    images = ["images/bournvita.jpg"],
    stream=True
)

Output:

Approach: 2 Using URL

agent.print_response(
    "Analyze the product image",
    images = ["https://beardo.in/cdn/shop/products/9_2ba7ece4-0372-4a34-8040-5dc40c89f103.jpg?v=1703589764&width=1946"],
    stream=True
)

Output:

beardo

Step 6: Develop the Web App using Streamlit

Now that we know how to execute the Multimodal Agent, let’s build the UI part using Streamlit. 

import streamlit as st
from PIL import Image
from io import BytesIO
from tempfile import NamedTemporaryFile

st.title("🔍 Product Ingredient Analyzer")

To optimize performance, define the Agent inference under a cached function. The cache decorator helps improve efficiency by reusing the Agent instance.

Since we are using Streamlit, which refreshes the entire page after each event loop or widget trigger, adding st.cache_resource ensures the function is not refreshed and saves it in the cache.

@st.cache_resource
def get_agent():
    return Agent(
        model=Gemini(id="gemini-2.0-flash-exp"),
        system_prompt=SYSTEM_PROMPT,
        instructions=INSTRUCTIONS,
        tools=[TavilyTools(api_key=os.getenv("TAVILY_API_KEY"))],
        markdown=True,
    )

When a new image path is provided by the user, the analyze_image function runs and executes the Agent object defined in get_agent. For real-time capture and the option to upload images, the uploaded file needs to be saved temporarily for processing.

The image is stored in a temporary file, and once the execution is completed, the temporary file is deleted to free up resources. This can be done using the NamedTemporaryFile function from the tempfile library. 

def analyze_image(image_path):
    agent = get_agent()
    with st.spinner('Analyzing image...'):
        response = agent.run(
            "Analyze the given image",
            images=[image_path],
        )
        st.markdown(response.content)

def save_uploaded_file(uploaded_file):
    with NamedTemporaryFile(dir='.', suffix='.jpg', delete=False) as f:
        f.write(uploaded_file.getbuffer())
        return f.name

For a better user interface, when a user selects an image, it is likely to have varying resolutions and sizes. To maintain a consistent layout and properly display the image, we can resize the uploaded or captured image to ensure it fits clearly on the screen.

The LANCZOS resampling algorithm provides high-quality resizing, particularly beneficial for product images where text clarity is crucial for ingredient analysis.

MAX_IMAGE_WIDTH = 300

def resize_image_for_display(image_file):
    img = Image.open(image_file)
    
    aspect_ratio = img.height / img.width
    new_height = int(MAX_IMAGE_WIDTH * aspect_ratio)
    img = img.resize((MAX_IMAGE_WIDTH, new_height), Image.Resampling.LANCZOS)
    
    buf = BytesIO()
    img.save(buf, format="PNG")
    return buf.getvalue()

Step 7: UI Features for Streamlit 

The interface is divided into three navigation tabs where the user can pick his choice of interests:

  • Tab-1: Example Products that users can select to test the app
  • Tab-2: Upload an Image of your choice if it’s already saved.
  • Tab-3: Capture or Take a live photo and analyze the product.

We repeat the same logical flow for all the 3 tabs:

  • First, choose the image of your choice and resize it to display on the Streamlit UI using st.image.
  • Second, save that image in a temporary directory to process it to the Agent object.
  • Third, analyze the image where the Agent execution will take place using Gemini 2.0 LLM and Tavily Search tool.

State management is handled through Streamlit’s session state, tracking selected examples and analysis status. 

UI Features for Streamlit 
def main():
    if 'selected_example' not in st.session_state:
        st.session_state.selected_example = None
    if 'analyze_clicked' not in st.session_state:
        st.session_state.analyze_clicked = False
    
    tab_examples, tab_upload, tab_camera = st.tabs([
        "📚 Example Products", 
        "📤 Upload Image", 
        "📸 Take Photo"
    ])
    
    with tab_examples:
        example_images = {
            "🥤 Energy Drink": "images/bournvita.jpg",
            "🥔 Potato Chips": "images/lays.jpg",
            "🧴 Shampoo": "images/shampoo.jpg"
        }
        
        cols = st.columns(3)
        for idx, (name, path) in enumerate(example_images.items()):
            with cols[idx]:
                if st.button(name, use_container_width=True):
                    st.session_state.selected_example = path
                    st.session_state.analyze_clicked = False
    
    with tab_upload:
        uploaded_file = st.file_uploader(
            "Upload product image", 
            type=["jpg", "jpeg", "png"],
            help="Upload a clear image of the product's ingredient list"
        )
        if uploaded_file:
            resized_image = resize_image_for_display(uploaded_file)
            st.image(resized_image, caption="Uploaded Image", use_container_width=False, width=MAX_IMAGE_WIDTH)
            if st.button("🔍 Analyze Uploaded Image", key="analyze_upload"):
                temp_path = save_uploaded_file(uploaded_file)
                analyze_image(temp_path)
                os.unlink(temp_path) 
    
    with tab_camera:
        camera_photo = st.camera_input("Take a picture of the product")
        if camera_photo:
            resized_image = resize_image_for_display(camera_photo)
            st.image(resized_image, caption="Captured Photo", use_container_width=False, width=MAX_IMAGE_WIDTH)
            if st.button("🔍 Analyze Captured Photo", key="analyze_camera"):
                temp_path = save_uploaded_file(camera_photo)
                analyze_image(temp_path)
                os.unlink(temp_path) 
    
    if st.session_state.selected_example:
        st.divider()
        st.subheader("Selected Product")
        resized_image = resize_image_for_display(st.session_state.selected_example)
        st.image(resized_image, caption="Selected Example", use_container_width=False, width=MAX_IMAGE_WIDTH)
        
        if st.button("🔍 Analyze Example", key="analyze_example") and not st.session_state.analyze_clicked:
            st.session_state.analyze_clicked = True
            analyze_image(st.session_state.selected_example)
  • You can find the full code here.
  • Replace the “<replace-with-api-key>” placeholder with your keys.
  • For tab_examples, you need to have a folder image. And save the images over there. Here is the GitHub URL with images directory here.
  • If you are interested in using the use case, here is the deployed App here.

Conclusion 

Multimodal AI agents represent a greater leap forward in how we can interact with and understand complex information in our daily lives. By combining vision processing, natural language understanding, and web search capabilities, these systems, like the Product Ingredient Analyzer, can provide instant, comprehensive analysis of products and their ingredients, making informed decision-making more accessible to everyone.

Key Takeaways

  • Multimodal AI agents improve how we understand product information. They combine text and image analysis.
  • With Phidata, an open-source framework, we can build and manage agent systems. These systems use models like GPT-4o and Gemini 2.0.
  • Agents use tools like vision processing and web search. This makes their analysis more complete and accurate. LLMs have limited knowledge, so agents use tools to handle complex tasks better.
  • Streamlit makes it easy to build web apps for LLM-based tools. Examples include RAG and multimodal agents.
  • Good system prompts and instructions guide the agent. This ensures useful and accurate responses.

Frequently Asked Questions

Q1. Mention Multimodal Vision language models that are Open-Source

A. LLaVA (Large Language and Vision Assistant), Pixtral-12B by Mistral.AI, Multimodal-GPT by OpenFlamingo, NVILA by Nvidia, and Qwen model are a few open source or weights multimodal vision language models that process text and images for tasks like visual question answering.

Q2. Is Llama3 Multimodal?

A. Yes, Llama 3 is multimodal, and also Llama 3.2 Vision models (11B and 90B parameters) process both text and images, enabling tasks like image captioning and visual reasoning.

Q3. How is Multimodal LLM different from Multimodal Agent?

A. A Multimodal Large Language Model (LLM) processes and generates data across various modalities, such as text, images, and audio. In contrast, a Multimodal Agent utilizes such models to interact with its environment, perform tasks, and make decisions based on multimodal inputs, often integrating additional tools and systems to execute complex actions.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Data Scientist at AI Planet || YouTube- AIWithTarun || Google Developer Expert in ML || Won 5 AI hackathons || Co-organizer of TensorFlow User Group Bangalore || Pie & AI Ambassador at DeepLearningAI

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details