Gemini 2.0: Google’s New Model for the Agentic Era

Janvi Kumari Last Updated : 13 Dec, 2024
11 min read

Google DeepMind has launched Gemini 2.0. It is latest milestone in artificial intelligence, marking the beginning of a new era in Agentic AI. The announcement was made by Demis Hassabis, CEO of Google DeepMind, and Koray Kavukcuoglu, CTO of Google DeepMind, on behalf of the Gemini team.

A Note from Sundar Pichai

Sundar Pichai, CEO of Google and Alphabet, highlighted how Gemini 2.0 advances Google’s mission of organizing the world’s information to make it both accessible and actionable. Gemini 2.0 represents a leap in making technology more useful and impactful by processing information across diverse inputs and outputs.

Pichai highlighted the introduction of Gemini 1.0 last December as a milestone in multimodal AI. It is capable of understanding and processing data across text, video, images, audio, and code. Along with Gemini 1.5, these models have enabled millions of developers to innovate within Google’s ecosystem, including its seven products with over 2 billion users. NotebookLM was cited as a prime example of the transformative power of multimodality and long-context capabilities.

Reflecting on the past year, Pichai discussed Google’s focus on agentic AI—models designed to understand their environment, plan multiple steps ahead, and take supervised actions. For instance, agentic AI could power tools like universal assistants that organize schedules, offer real-time navigation suggestions, or perform complex data analysis for businesses. The launch of Gemini 2.0 marks a significant leap forward, showcasing Google’s progress toward these practical and impactful applications.

The experimental release of Gemini 2.0 Flash is now available to developers and testers. It introduces advanced features such as Deep Research, a capability for exploring complex topics and compiling reports. Additionally, AI Overviews, a popular feature reaching 1 billion users, will now leverage Gemini 2.0’s reasoning capabilities to tackle complex queries, with broader availability planned for early next year.

Pichai also mentioned that Gemini 2.0 is built on a decade of innovation and powered entirely by Trillium, Google’s sixth-generation TPUs. This technological foundation represents a major step in making information not only accessible but also actionable and impactful.

What is Gemini 2.0 Flash?

The first release in the Gemini 2.0 family is an experimental model called Gemini 2.0 Flash. Designed as a workhorse model, it delivers low latency and enhanced performance, embodying cutting-edge technology at scale. This model sets a new benchmark for efficiency and capability in AI applications.

Gemini 2.0 Flash builds on the success of 1.5 Flash, a widely popular model among developers, by delivering not only enhanced performance but also twice the speed on key benchmarks compared to 1.5 Pro. This improvement ensures similarly fast response times while introducing advanced multimodal capabilities that set a new standard for efficiency. Notably, 2.0 Flash outperforms 1.5 Pro on key benchmarks at twice the speed. It also introduces new capabilities: support for multimodal inputs like images, video, and audio, and multimodal outputs such as natively generated images combined with text and steerable text-to-speech (TTS) multilingual audio. Additionally, it can natively call tools like Google Search, execute code, and interact with third-party user-defined functions.

The goal is to make these models accessible safely and quickly. Over the past month, early experimental versions of Gemini 2.0 were shared, receiving valuable feedback from developers. Gemini 2.0 Flash is now available as an experimental model to developers via the Gemini API in Google AI Studio and Vertex AI. Multimodal input and text output are accessible to all developers, while TTS and native image generation are available to early-access partners. General availability is set for January, alongside additional model sizes.

To support dynamic and interactive applications, a new Multimodal Live API is also being released. It features real-time audio and video streaming input and the ability to use multiple, combined tools. For example, telehealth applications could leverage this API to seamlessly integrate real-time patient video feeds with diagnostic tools and conversational AI for instant medical consultations.

Also Read: 4 Gemini Models by Google that you Must Know About

Key Features of Gemini 2.0 Flash

  • Better Performance Gemini 2.0 Flash is more powerful than 1.5 Pro while maintaining speed and efficiency. Key improvements include enhanced multimodal text, code, video, spatial understanding, and reasoning performance. Spatial understanding advancements allow for more accurate bounding box generation and better object identification in cluttered images.
  • New Output Modalities Gemini 2.0 Flash enables developers to generate integrated responses combining text, audio, and images through a single API call. Features include:
    • Multilingual native audio output: Fine-grained control over text-to-speech with high-quality voices and multiple languages.
    • Native image output: Support for conversational, multi-turn editing with interleaved text and images, ideal for multimodal content like recipes.
  • Native Tool Use Gemini 2.0 Flash can natively call tools like Google Search and code execution, as well as custom third-party functions. This leads to more factual and comprehensive answers and enhanced information retrieval. Parallel searches improve accuracy by integrating multiple relevant facts.

Multimodal Live API The API supports real-time multimodal applications with audio and video streaming inputs. It integrates tools for complex use cases, enabling conversational patterns like interruptions and voice activity detection.

Benchmark Comparison: Gemini 2.0 Flash vs. Previous Models

Gemini 2.0
Source: Google

Gemini 2.0 Flash demonstrates significant improvements across multiple benchmarks compared to its predecessors, Gemini 1.5 Flash and Gemini 1.5 Pro. Key highlights include:

  • General Performance (MMLU-Pro): Gemini 2.0 Flash scores 76.4%, outperforming Gemini 1.5 Pro’s 75.8%.
  • Code Generation (Natural2Code): A substantial leap to 92.9%, compared to 85.4% for Gemini 1.5 Pro.
  • Factuality (FACTS Grounding): Achieves 83.6%, indicating enhanced accuracy in generating factual responses.
  • Math Reasoning (MATH): Scores 89.7%, excelling in complex problem-solving tasks.
  • Image Understanding (MIMVU): Demonstrates multimodal advancements with a 70.7% score, surpassing Gemini 1.5 models.
  • Audio Processing (CoVoST2): Significant improvement to 71.5%, reflecting its enhanced multilingual capabilities.

These results showcase Gemini 2.0 Flash’s enhanced multimodal capabilities, reasoning skills, and ability to tackle complex tasks with greater precision and efficiency.

Gemini 2.0 in the Gemini App

Starting today, Gemini users globally can access a chat-optimized version of 2.0 Flash by selecting it in the model drop-down on desktop and mobile web. It will soon be available in the Gemini mobile app, offering an enhanced AI assistant experience. Early next year, Gemini 2.0 will be expanded to more Google products.

Agentic Experiences Powered by Gemini 2.0

Gemini 2.0 Flash’s advanced capabilities including multimodal reasoning, long-context understanding, complex instruction following, and native tool use enable a new class of agentic experiences. These advancements are being explored through research prototypes:

Project Astra

A universal AI assistant with enhanced dialogue, memory, and tool use, now being tested on prototype glasses.

Project Mariner

A browser-focused AI agent capable of understanding and interacting with web elements.

Jules

An AI-powered code agent integrated into GitHub workflows to assist developers.

Agents in Games and Beyond

Google DeepMind has a history of using games to refine AI models’ abilities in logic, planning, and rule-following. Recently, the Genie 2 model was introduced, capable of generating diverse 3D worlds from a single image. Building on this tradition, Gemini 2.0 powers agents that assist in navigating video games, reasoning from screen actions, and offering real-time suggestions.
In collaboration with developers like Supercell, Gemini-powered agents are being tested on games ranging from strategy titles like “Clash of Clans” to simulators like “Hay Day.” These agents can also access Google Search to connect users with extensive gaming knowledge.
Beyond gaming, these agents demonstrate potential across domains, including web navigation and robotics, highlighting AI’s growing ability to assist in complex tasks.

These projects highlight the potential of AI agents to accomplish tasks and assist in various domains, including gaming, web navigation, and physical robotics.

Gemini 2.0 Flash: Experimental Preview Release

Gemini 2.0 Flash is now available as an experimental preview release through the Vertex AI Gemini API and Vertex AI Studio. The model introduces new features and enhanced core capabilities:

Multimodal Live API: This new API helps create real-time vision and audio streaming applications with tool use.

Let’s Try Gemini 2.0 Flash

Task 1. Generating Content with Gemini 2.0

You can use the Gemini 2.0 API to generate content by providing a prompt. Here’s how to do it using the Google Gen AI SDK:

Setup

First, install the SDK:

pip install google-genai

Then, use the SDK in Python:

from google import genai

# Initialize the client for Vertex AI
client = genai.Client(
    vertexai=True, project='YOUR_CLOUD_PROJECT', location='us-central1'
)

# Generate content using the Gemini 2.0 model
response = client.models.generate_content(
    model='gemini-2.0-flash-exp', contents='How does AI work?'
)

# Print the generated content
print(response.text)

Output:

Alright, let's dive into how AI works. It's a broad topic, but we can break it down
into key concepts.
The Core Idea: Learning from Data
At its heart, most AI today operates on the principle of learning from data. Instead
of being explicitly programmed with rules for every situation, AI systems are
designed to identify patterns, make predictions, and learn from examples. Think of
it like teaching a child by showing them lots of pictures and labeling them.

Key Concepts and Techniques
Here's a breakdown of some of the core elements involved:
Data:
The Fuel: AI algorithms are hungry for data. The more data they have, the better
they can learn and perform.
Variety: Data can come in many forms: text, images, audio, video, numerical data,
and more.
Quality: The quality of the data is crucial. Noisy, biased, or incomplete data can
lead to poor AI performance.
Algorithms:
The Brains: Algorithms are the set of instructions that AI systems follow to process
data and learn.
Different Types: There are many different types of algorithms, each suited for
different tasks:
Supervised Learning: The algorithm learns from labeled data (e.g., "this is a cat,"
"this is a dog"). It's like being shown the answer key.
Unsupervised Learning: The algorithm learns from unlabeled data, trying to find
patterns and structure on its own. Think of grouping similar items without being
told what the categories are.
Reinforcement Learning: The algorithm learns by trial and error, receiving rewards
or penalties for its actions. This is common in game-playing AI.
Machine Learning (ML):
The Learning Process: ML is the primary method that powers much of AI today. It
encompasses various techniques for enabling computers to learn from data without
explicit programming.
Common Techniques:
Linear Regression: Predicting a numerical output based on a linear relationship with
input variables (e.g., house price based on size).
Logistic Regression: Predicting a categorical output (e.g., spam or not spam).
Decision Trees: Creating tree-like structures to classify or predict outcomes based
on a series of decisions.
Support Vector Machines (SVMs): Finding the optimal boundary to separate different
classes of data.
Clustering Algorithms: Grouping similar data points together (e.g., customer
segmentation).
Neural Networks: Complex interconnected networks of nodes (inspired by the human
brain) that are particularly powerful for complex pattern recognition.
Deep Learning (DL):
A Subset of ML: Deep learning is a specific type of machine learning that uses
artificial neural networks with multiple layers (hence "deep").
Powerful Feature Extraction: Deep learning excels at automatically learning
hierarchical features from raw data, reducing the need for manual feature
engineering.
Applications: Used in tasks like image recognition, natural language processing, and
speech synthesis.
Examples of Deep Learning Architectures:
Convolutional Neural Networks (CNNs): Used for image and video analysis.
Recurrent Neural Networks (RNNs): Used for sequence data like text and time series.
Transformers: Powerful neural network architecture used for natural language
processing.
Training:
The Learning Phase: During training, the AI algorithm adjusts its internal
parameters based on the data it's fed, attempting to minimize errors.
Iterations: Training often involves multiple iterations over the data.
Validation: Data is often split into training and validation sets to avoid
overfitting (where the model performs well on the training data but poorly on new
data).
Inference:
Using the Learned Model: Once the model is trained, it can be used to make
predictions or classifications on new, unseen data.
Simplified Analogy
Imagine you want to teach a computer to identify cats.
Data: You provide thousands of pictures of cats (and maybe some non-cat pictures
too, labeled correctly).
Algorithm: You choose a neural network algorithm suitable for image recognition.
Training: The algorithm looks at the pictures, learns patterns (edges, shapes,
colors), and adjusts its internal parameters to distinguish cats from other objects.
Inference: Now, when you show the trained AI a new picture, it can (hopefully)
correctly identify whether there's a cat in it.
Beyond the Basics
It's worth noting that the field of AI is constantly evolving, and other key areas
include:
Natural Language Processing (NLP): Enabling computers to understand, interpret, and
generate human language.
Computer Vision: Enabling computers to "see" and interpret images and videos.
Robotics: Combining AI with physical robots to perform tasks in the real world.
Explainable AI (XAI): Making AI decisions more transparent and understandable.
Ethical Considerations: Addressing issues like bias, privacy, and the societal
impact of AI.
In a Nutshell
AI works by leveraging large amounts of data, powerful algorithms, and learning
techniques to enable computers to perform tasks that typically require human
intelligence. It's a rapidly advancing field with a wide range of applications and
potential to transform various aspects of our lives.
Let me know if you have any specific areas you'd like to explore further!

Task 2. Multimodal Live API Example (Real-time Interaction)

The Multimodal Live API allows you to interact with the model using voice, video, and text. Below is an example of a simple text-to-text interaction where you ask a question and receive a response:

from google import genai

# Initialize the client for live API
client = genai.Client()

# Define the model ID and configuration for text responses
model_id = "gemini-2.0-flash-exp"
config = {"response_modalities": ["TEXT"]}

# Start a real-time session
async with client.aio.live.connect(model=model_id, config=config) as session:
    message = "Hello? Gemini, are you there?"
    print("> ", message, "\n")
    
    # Send the message and await a response
    await session.send(message, end_of_turn=True)

    # Receive and print responses
    async for response in session.receive():
        print(response.text)

Output:

Yes,

I am here.

How can I help you today?

This code demonstrates a real-time conversation using the Multimodal Live API, where you send a message, and the model responds interactively.

Task 3. Using Google Search as a Tool

To improve the accuracy and recency of responses, you can use Google Search as a tool. Here’s how to implement Search as a Tool:

from google import genai
from google.genai.types import Tool, GenerateContentConfig, GoogleSearch

# Initialize the client
client = genai.Client()

# Define the Search tool
google_search_tool = Tool(
    google_search=GoogleSearch()
)

# Generate content using Gemini 2.0, enhanced with Google Search
response = client.models.generate_content(
    model="gemini-2.0-flash-exp",
    contents="When is the next total solar eclipse in the United States?",
    config=GenerateContentConfig(
        tools=[google_search_tool],
        response_modalities=["TEXT"]
    )
)

# Print the response, including search grounding
for each in response.candidates[0].content.parts:
    print(each.text)

# Access grounding metadata for further information
print(response.candidates[0].grounding_metadata.search_entry_point.rendered_content)

Output:

The next total solar eclipse visible in the United States will occur on April 8, 
2024.
<https://www.timeanddate.com/eclipse/solar/2024-april-8>The next total solar eclipse
in the US will be on April 8, 2024, and will be visible across the eastern half of
the United States. It will be the first coast-to-coast total eclipse visible in the
US in seven years. It will enter the US in Texas, travel through Oklahoma,
Arkansas, Missouri, Illinois, Kentucky, Indiana, Ohio, Pennsylvania, New York,
Vermont, and New Hampshire. Then it will exit the US through Maine.

In this example, users utilize Google Search to fetch real-time information, improving the model’s ability to answer questions about specific events or topics with up-to-date data.

Task 4. Bounding Box Detection in Images

For object detection and localization within images or video frames, Gemini 2.0 supports bounding box detection. Here’s how you can use it:

from google import genai

# Initialize the client for Vertex AI
client = genai.Client()

# Specify the model ID and provide an image URL or image data
model_id = "gemini-2.0-flash-exp"
image_url = "https://example.com/image.jpg"

# Generate bounding box predictions for an image
response = client.models.generate_content(
    model=model_id,
    contents="Detect the objects in this image and draw bounding boxes.",
    config={"input": image_url}
)

# Output bounding box coordinates [y_min, x_min, y_max, x_max]
for each in response.bounding_boxes:
    print(each)

This code detects objects within an image and returns bounding boxes with coordinates that can be used for further analysis or visualization.

Notes

  • Image and Audio Generation: Currently in private experimental access (allowlist), so you may need special permissions to use image generation or text-to-speech features.
  • Real-Time Interaction: The Multimodal Live API allows real-time voice and video interactions but limits session durations to 2 minutes.
  • Google Search Integration: With Search as a Tool, you can enhance model responses with up-to-date information retrieved from the web.

These examples demonstrate the flexibility and power of the Gemini 2.0 Flash model for handling multimodal tasks and providing advanced agentic experiences. Be sure to check the official documentation for the latest updates and features.

Responsible Development in the Agentic Era

As AI technology advances, Google DeepMind remains committed to safety and responsibility. Measures include:

  • Collaborating with the Responsibility and Safety Committee to identify and mitigate risks.
  • Enhancing red-teaming approaches to optimize models for safety.
  • Implementing privacy controls, such as session deletion, to protect user data.
  • Ensuring AI agents prioritize user instructions over external malicious inputs.

Looking Ahead

The release of Gemini 2.0 Flash and the series of agentic prototypes represent an exciting milestone in AI. As researchers further explore these possibilities, Google DeepMind actively advances AI responsibly and shapes the future of the Gemini era.

Conclusion

Gemini 2.0 represents a significant leap forward in the field of Agentic AI. It is ushering us in a new era of intelligent, interactive systems. With its advanced multimodal capabilities, improved reasoning, and the ability to execute complex tasks, Gemini 2.0 sets a new benchmark for AI performance. The launch of Gemini 2.0 Flash, along with its experimental features, offers developers powerful tools to create innovative applications across diverse domains. As Google DeepMind continues to prioritize safety and responsibility, Gemini 2.0 lays the foundation for the future of AI. A future where intelligent agents seamlessly assist in both everyday tasks and specialized applications, from gaming to web navigation.

Hi, I am Janvi, a passionate data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how we can extract meaningful insights from complex datasets.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details