Google’s Gemma 3: Features, Benchmarks, Performance and Implementation

Shaik Hamzah Last Updated : 13 Mar, 2025
14 min read

Google’s commitment to making AI accessible leaps forward with Gemma 3, the latest addition to the Gemma family of open models. After an impressive first year—marked by over 100 million downloads and more than 60,000 community-created variants—the Gemmaverse continues to expand.

With Gemma 3, developers gain access to state-of-the-art, lightweight AI models that run efficiently on a variety of devices, from smartphones to high-end workstations. Built on the same technological foundations as Google’s powerful Gemini 2.0 models, Gemma 3 is designed for speed, portability, and responsible AI development. Also Gemma 3 comes in a range of sizes (1B, 4B, 12B and 27B) and allows the user to choose the best model for specific hardware and performance needs. Intriguing right?

This article digs into Gemma 3’s capabilities and implementation, the introduction of ShieldGemma 2 for AI safety, and how developers can integrate these tools into their workflows.

What is Gemma 3?

Gemma 3 is Google’s latest leap in open AI. Gemma 3 is categorized under Dense models. It comes in four distinct sizes – 1B, 4B, 12B, and 27B parameters with both base (pre-trained) and instruction-tuned variants. Key highlights include:

  • Context Window:
    • 1B model: 32K tokens
    • 4B, 12B, 27B models: 128K tokens
  • Multimodality:
    • 1B variant: Text-only
    • 4B, 12B, 27B variants: Capable of processing both images and text using the SigLIP image encoder
  • Multilingual Support:
    • English only for 1B
    • Over 140 languages for larger models
  • Integration:
    • Models are hosted on the Hub and are seamlessly integrated with Hugging Face, making experimentation and deployment simple.

A Leap Forward in Open Models

Gemma 3 models are well-suited for various text generation and image-understanding tasks, including question answering, summarization, and reasoning. Built on the same research that powers the Gemini 2.0 models, Gemma 3 is our most advanced, portable, and responsibly developed open model collection yet. Available in various sizes (1B, 4B, 12B, and 27B), it provides developers the flexibility to select the best option for their hardware and performance requirements. Whether it’s about deploying the model on a smartphone, laptop, etc., Gemma 3 is designed to run fast directly on devices.

A Leap Forward in Open Models
Source: Hugging Face

Cutting-Edge Capabilities

Gemma 3 isn’t just about size; it’s packed with features that empower developers to build next-generation AI applications:

  • Unmatched Performance: Gemma 3 delivers state-of-the-art performance for its size. In preliminary evaluations, it has outperformed models like Llama-405B, DeepSeek-V3, and o3-mini, allowing you to create engaging user experiences using just a single GPU or TPU host.
  • Multilingual Prowess: With out-of-the-box support for over 35 languages and pre-trained support for more than 140 languages, Gemma 3 helps you build applications that speak to a global audience.
  • Advanced Reasoning & Multimodality: Analyze images, text, and short videos seamlessly. The model introduces vision understanding via a tailored SigLIP encoder, enabling a broad range of interactive applications.
  • Expanded Context Window: A massive 128K-token context window allows your applications to process and understand vast amounts of data in one go.
  • Innovative Function Calling: Built-in support for function calling and structured outputs lets developers automate complex workflows with ease.
  • Efficiency Through Quantization: Official quantized versions(available on Hugging Face) reduce model size and computational demands without sacrificing accuracy.

Technical Enhancements in Gemma 3

Gemma 3 builds on the success of its predecessor by focusing on three core enhancements: longer context length, multimodality, and multilinguality. Let’s dive into what makes Gemma 3 a technical marvel.

Longer Context Length

  • Scaling Without Re-training from Scratch: Models are initially pre-trained with 32K sequences. For the 4B, 12B, and 27B variants, the context length is efficiently scaled to 128K tokens post pre-training, saving significant compute.
  • Enhanced Positional Embeddings: The RoPE (Rotary Positional Embedding) base frequency is upgraded from 10K in Gemma 2 to 1 M in Gemma 3 and then scaled by a factor of 8. This enables the models to maintain high performance even with extended context.
  • Optimized KV Cache Management: By interleaving multiple local attention layers (with a sliding window of 1024 tokens) between global layers (at a 5:1 ratio), Gemma 3 dramatically reduces the KV cache memory overhead during inference from around 60% in global-only setups to less than 15%.
KV Caching
KV Caching | Source – Link

Multimodality

  • Vision Encoder Integration: Gemma 3 leverages the SigLIP image encoder to process images. All images are resized to a fixed 896×896 resolution for consistency. To handle non-square aspect ratios and high-resolution inputs, an adaptive “pan and scan” algorithm crops and resizes images on the fly, ensuring that critical visual details are preserved.
  • Distinct Attention Mechanisms: While text tokens use one-way (causal) attention, image tokens receive bidirectional attention. This allows the model to build a complete and unrestricted understanding of visual inputs while maintaining efficient text processing.
Multimodality
Source – Link

Multilinguality

  • Expanded Data and Tokenizer Improvements: Gemma 3’s training dataset now includes double the amount of multilingual content compared to Gemma 2. The same SentencePiece tokenizer (with 262K entries) is used, but it now encodes Chinese, Japanese, and Korean with improved fidelity, empowering the models to support over 140 languages for the larger variants.
Multilinguality
Source – Link

Architectural Enhancements: What’s New in Gemma 3

Gemma 3 comes with significant architectural updates that address key challenges, especially when handling long contexts and multimodal inputs. Here’s what’s new:

  • Optimized Attention Mechanism: To support an extended context length of 128K tokens (with the 1B model at 32K tokens), Gemma 3 re-engineers its transformer architecture. By increasing the ratio of local to global attention layers to 5:1, the design ensures that only the global layers handle long-range dependencies while local layers operate over a shorter span (1024 tokens). This change drastically reduces the KV-cache memory overhead during inference—from a 60% increase in “global only” configurations to less than 15% with the new design.
  • Enhanced Positional Encoding: Gemma 3 upgrades the RoPE (Rotary Positional Embedding) for global self-attention layers by increasing the base frequency from 10K to 1M while keeping it at 10K for local layers. This adjustment enables better scaling for long-context scenarios without compromising performance.
RoPE
Source – Link
  • Improved Norm Techniques: Moving beyond the soft-capping method used in Gemma 2, the new architecture incorporates QK-norm to stabilize the attention scores. Additionally, it utilizes Grouped-Query Attention (GQA) combined with both post-norm and pre-norm RMSNorm to ensure consistency and efficiency during training.
    • QK-Norm for Attention Scores: Stabilizes the model’s attention weights, reducing inconsistencies seen in prior iterations.
    • Grouped-Query Attention (GQA): Combined with both post-norm and pre-norm RMSNorm, this technique enhances training efficiency and output reliability.
  • Vision Modality Integration: Gemma 3 expands into the multimodal arena by incorporating a vision encoder based on SigLIP. This encoder processes images as sequences of soft tokens, while a Pan & Scan (P&S) method optimizes image input by adaptively cropping and resizing non-standard aspect ratios, ensuring that the visual details remain intact.
Input

Output

Output

These architectural changes not only boost performance but also significantly enhance efficiency, enabling Gemma 3 to handle longer contexts and integrate image data seamlessly, all while reducing memory overhead.

Benchmarking Success

Recent performance comparisons on the Chatbot Arena have positioned Gemma 3 27B IT among the top contenders. As shown in the leaderboard images below, Gemma 3 27B IT stands out with a score of 1338, competing closely with and in some cases, outperforming other leading models. For example:

  • Early Grok-3 registers an overall score of 1402, but Gemma 3’s performance in challenging categories such as Instruction Following and Multi-Turn interactions remains remarkably robust.
  • Gemini-2.0 Flash Thinking and Gemini-2.0 Pro variants post scores in the 1380–1400 range, while Gemma 3 offers balanced performance across multiple testing dimensions.
  • ChatGPT-4o and DeepSeek R1 have competitive scores, but Gemma 3 excels in maintaining consistency even with a smaller model size, showcasing its efficiency and versatility.

Below are some example images from the Chatbot Arena leaderboard, demonstrating the rank and arena scores across various test scenarios:

For a deeper dive into the performance metrics and to explore the leaderboard interactively, check out the Chatbot Arena Leaderboard on Hugging Face.

Performance Metrics Breakdown

In addition to its impressive overall Elo score, Gemma 3-27B-IT excels in various subcategories of the Chatbot Arena. The bar chart below illustrates how the model performs on metrics such as Hard Prompts, Math, Coding, Creative Writing, and more. Notably, Gemma 3-27B-IT showcases strong performance in Creative Writing (1348) and Multi-Turn dialogues (1336), reflecting its ability to maintain coherent, context-rich conversations.

performance metrics for Gemma

Gemma 3 27B-IT is not only a top contender in head-to-head Chatbot Arena evaluations but also shines in creative writing tasks across other Comparison Leaderboards. According to the latest EQ-Bench result for creative writing, Gemma 3 27B-IT currently holds 2nd place on the leaderboard. Although the evaluation was based on only one iteration owing to the slow performance on OpenRouter, the early results are highly encouraging. The team is planning to benchmark the 12B variant soon, and early expectations suggest promising performance across other creative domains.

Creative writing benchmark
Source – Link

LMSYS Elo Scores vs. Parameter Size

LMSYS Elo Scores vs. Parameter Size
Source – Link

In the chart above, each point represents a model’s parameter count (x-axis) and its corresponding Elo score (y-axis). Notice how Gemma 3-27B IT hits a “Pareto Sweet Spot,” offering high Elo performance with a relatively smaller model size compared to others like Qwen 2.5-72B, DeepSeek R1, and DeepSeek V3.

Beyond these head-to-head matchups, Gemma 3 also excels across a variety of standardized benchmarks. The table below compares the performance of Gemma 3 to earlier Gemma versions and Gemini models on tasks such as MMLU-Pro, LiveCodeBench, Bird-SQL, and more.

Performance Across Multiple Benchmarks

Performance Across Multiple Benchmarks
Source – Link

In this table, you can see how Gemma 3 stands out on tasks like MATH and FACTS Grounding while showing competitive results on Bird-SQL and GPQA Diamond. Although SimpleQA scores may appear modest, Gemma 3’s overall performance highlights its balanced approach to language understanding, code generation, and factual grounding.

GEMMA 2 AND GEMMA 3
Source – Link

These visuals underscore Gemma 3’s ability to balance performance and efficiency, particularly the 27B variant, which provides state-of-the-art capabilities without the massive computational requirements of some competing models.

Also read: Gemma 3 vs DeepSeek-R1: Is Google’s New 27B Model a Tough Competition to the 671B Giant?

A Responsible Approach to AI Development

With greater AI capabilities comes the responsibility to ensure safe and ethical deployment. Gemma 3 has undergone rigorous testing to maintain Google’s high safety standards:

  • Comprehensive risk assessments tailored to model capability.
  • Fine-tuning and benchmark evaluations aligned with Google’s safety policies.
  • Specific evaluations on STEM-related content to assess risks associated with misuse in potentially harmful applications.

Google aims to set a new industry standard for open models.

Rigorous Safety Protocols

Innovation goes hand in hand with responsibility. Gemma 3’s development was guided by rigorous safety protocols, including extensive data governance, fine-tuning, and robust benchmark evaluations. Special evaluations focusing on its STEM capabilities confirm a low risk of misuse. Additionally, the launch of ShieldGemma 2, a 4B image safety checker is built on the Gemma 3 foundation, which ensures that the built-in safety measures categorize and mitigate potentially unsafe content.

Seamless Integration with Your Favorite Tools

Gemma 3 is engineered to fit effortlessly into your existing workflows:

  • Developer-Friendly Ecosystem: Support for tools like Hugging Face Transformers, Ollama, JAX, Keras, PyTorch, and more means you can experiment and integrate with ease.
  • Optimized for Multiple Platforms: Whether you’re working with NVIDIA GPUs, Google Cloud TPUs, AMD GPUs via the ROCm stack, or local environments, Gemma 3’s performance is maximized.
  • Flexible Deployment Options: With options ranging from Vertex AI and Cloud Run to the Google GenAI API and local setups, deploying Gemma 3 is both flexible and straightforward.

Exploring the Gemmaverse

Beyond the model itself lies the Gemmaverse, a thriving ecosystem of community-created models and tools that continue to push the boundaries of AI innovation. From AI Singapore’s SEA-LION v3 breaking down language barriers to INSAIT’s BgGPT supporting diverse languages, the Gemmaverse is a testament to collaborative progress. Moreover, the Gemma 3 Academic Program offers researchers Google Cloud credits to fuel further breakthroughs.

Get Started with Gemma 3

Ready to explore the full potential of Gemma 3? Here’s how you can dive in:

  • Instant Exploration:
    Try Gemma 3 at full precision directly in your browser via Google AI Studio, no setup required.
  • API Access:
    Get an API key from Google AI Studio and integrate Gemma 3 into your applications using the Google GenAI SDK.
  • Download and Customize:
    Access the models through platforms like Hugging Face, Ollama, or Kaggle and fine-tune them to suit your project needs.

Gemma 3 marks a significant milestone in our journey to democratize high-quality AI. Its blend of performance, efficiency, and safety is set to inspire a new wave of innovation. Whether you’re an experienced developer or just starting your AI journey, Gemma 3 offers the tools you need to build the future of intelligent applications.

How to Run Gemma 3 Locally with Ollama?

Leverage the power of Gemma 3 right from your local machine using Ollama. Follow these steps:

  1. Install Ollama:
    Download and install Ollama from the official website. This lightweight framework allows you to run AI models locally with ease.
    Pull the Gemma 3 Model:
    Once Ollama is installed, use the command-line interface to pull the desired Gemma 3 variant. For example:  ollama pull gemma3:4b
  2. Run the Model:
    Start the model locally by executing:
    ollama run gemma3:4b
  3.  You can then interact with Gemma 3 directly from your terminal or through any local interface provided by Ollama.
  4. Customize & Experiment:
    Adjust settings or integrate with your preferred tools for a seamless local deployment experience.
Ollama

How to Run Gemma 3 on Your System or via Colab with Hugging Face?

For those who prefer a more flexible setup or want to take advantage of GPU acceleration, you can run Gemma 3 on your system or use Google Colab with Hugging Face’s support:

1. Set Up Your Environment

  • Local System: Ensure you have Python installed along with necessary libraries.
  • Google Colab: Open a new notebook and enable GPU acceleration from the runtime settings.

2. Install Dependencies

Use pip to install the Hugging Face Transformers library and any other dependencies:

!pip install git+https://github.com/huggingface/[email protected]

3. Load Gemma 3 from Hugging Face

In your script or Colab notebook, load the model and tokenizer with the following code snippet:

import torch
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from IPython.display import Markdown, display

# load LLM artifacts
processor = AutoProcessor.from_pretrained("unsloth/gemma-3-4b-it")
model = Gemma3ForConditionalGeneration.from_pretrained(
    "unsloth/gemma-3-4b-it",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

4. Run and Experiment

With the model loaded, start generating text or processing images. You can fine-tune parameters, integrate with your applications, or experiment with different input modalities.

input
# download img
!curl "https://vitapet.com/media/emhk5nz5/cat-playing-vs-fighting-1240x640.jpg" -o cats.jpg

# prompt LLM and get response
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "./cats.jpg"},
            {"type": "text", "text": """Extract the key details in this images, also guess what might be the reason for this action?"""}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
display(Markdown(decoded))

Output

Here's a breakdown of the key details in the image and a guess at the reason for the action:

Key Details:

Two Kittens: The image features two young kittens.
Orange Kitten: One kitten is mid-air, leaping dramatically with its paws outstretched. It's a warm orange color with tabby markings.
Brown Kitten: The other kitten is on the ground, moving quickly and looking slightly startled. It has a brown and white tabby pattern.
White Background: The kittens are set against a plain white background, which isolates them and makes them the focus.
Action: The orange kitten is in the middle of a jump, seemingly reacting to the movement of the brown kitten.
Possible Reason for the Action:

It's highly likely that these kittens are engaged in playful wrestling or chasing. Kittens, especially young ones, often engage in this type of behavior as a way to:

Exercise: It's a great way for them to burn energy.
Socialize: They're learning about boundaries and play interactions.
Bond: Play is a key part of kitten bonding.
Explore: They're investigating each other and their environment.
It's a common and adorable kitten behavior!

Would you like me to describe any specific aspect of the image in more detail?

Example 2

Input
# download img
!curl "https://static.standard.co.uk/2025/03/08/17/40/Screenshot-(34).png" -o sidemen.png

# prompt LLM and get response
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "./sidemen.png"},
            {"type": "text", "text": """What is going on in this image?"""}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
generation = model.generate(**inputs, max_new_tokens=1024, do_sample=False)
generation = generation[0][input_len:]

decoded = processor.decode(generation, skip_special_tokens=True)
display(Markdown(decoded))

Output

Here's a breakdown of what's happening in the image:

The Scene:

The image captures a moment of intense celebration. A group of men, all wearing red shirts with "FASTABLES" printed on them, are holding a large trophy aloft. They are surrounded by a shower of golden confetti.

Key Details:

The Trophy: The trophy is the focal point, suggesting a significant victory.
Celebration: The players are shouting, jumping, and clearly overjoyed. Their expressions show immense excitement and pride.
Confetti: The confetti indicates a momentous occasion and a celebratory atmosphere.
Background: In the blurred background, you can see other people (likely spectators) and what appears to be event staff.
Text: There's a small text overlay at the bottom: "TO DONATE PLEASE VISIT WWW.SIDEMENFC.COM". This suggests the team is associated with a charity or non-profit organization.
Likely Context:

Based on the team's shirts and the celebratory atmosphere, this image likely depicts a soccer (football) team winning a championship or major tournament.

Team:

The team is SideMen FC.

Do you want me to elaborate on any specific aspect of the image, such as the team's history or the significance of the trophy?

5. Utilize Hugging Face Resources:

Benefit from the vast Hugging Face community, documentation, and example notebooks to further customize and optimize your use of Gemma 3.

Here’s the full code in the Notebook: Gemma-Code

Optimizing Inference for Gemma 3

When using Gemma 3-27B-IT, it’s essential to configure the right sampling parameters to get the best results. According to insights from the Gemma team, optimal settings include:

  • Temperature: 1.0
  • Top-k: 64
  • Top-p: 0.95

Additionally, be cautious of double BOS (Beginning of Sequence) tokens, which can accidentally degrade output quality. For more detailed explanations and community discussions, check out this helpful post by danielhanchen on Reddit.

By fine-tuning these parameters and handling tokenization carefully, you can unlock Gemma 3’s full potential across a variety of tasks — from creative writing to complex coding challenges.

Some important links:

  1. GGUFs – Optimized GGUF model files for Gemma 3.
  2. Transformers – Official Hugging Face Transformers integration.
  3. MLX (coming soon) – Native support for Apple MLX coming soon.
  4. Blogpost – Overview and insights into Gemma 3.
  5. Transformers Release – Latest updates in the Transformers library.
  6. Tech Report – In-depth technical details on Gemma 3.

Notes on the Release

Evals:

  • MMLU-Pro: Gemma 3-27B-IT scores 67.5, close to Gemini 1.5 Pro’s 75.8.
  • Chatbot Arena: Gemma 3-27B-IT achieves an Elo score of 1338, outperforming larger models like LLaMA 3 405B (1257) and Qwen2.5-70B (1257).
  • Comparative Performance: Gemma 3-4B-IT is competitive with Gemma 2-27B-IT.

Multimodal:

  • Vision Understanding: Utilizes a tailored SigLIP vision encoder that processes images as sequences of soft tokens.
  • Pan & Scan (P&S): Implements an adaptive windowing algorithm to segment non-square images into 896×896 crops, enhancing performance on high-resolution images.

Long Context:

  • Extended Token Support: Models support up to 128K tokens (with the 1B variant supporting 32K).
  • Optimized Attention: Employs a 5:1 ratio of local to global attention layers to mitigate KV-cache memory explosion.
  • Attention Span: Local layers handle a 1024-token span, while global layers manage the extended context.

Memory Efficiency:

  • Reduced Overhead: The 5:1 attention ratio reduces KV-cache memory overhead from 60% (global-only) to less than 15%.
  • Quantization: Uses Quantization Aware Training (QAT) to offer models in int4, int4 (per-block), and switched fp8 formats, significantly lowering the memory footprint.

Training and Distillation:

  • Extensive Pre-training: The 27B model is pre-trained on 14T tokens, with an expanded multilingual dataset.
  • Knowledge Distillation: Employs a strategy with 256 logits per token, weighted by teacher probabilities.
  • Enhanced Post-training: Focuses on improving math, reasoning, and multilingual abilities, outperforming Gemma 2.

Vision Encoder Performance:

  • Higher Resolution Advantage: Encoders operating at 896×896 outperform those at lower resolutions (e.g., 256×256) on tasks like DocVQA (59.8 vs. 31.9).
  • Boosted Performance: Pan & Scan improves text recognition tasks (e.g., a +8.2 point improvement on DocVQA for the 4B model).

Long Context Scaling:

  • Efficient Scaling: Models are pre-trained on 32K sequences and then scaled to 128K tokens using RoPE rescaling with a factor of 8.
  • Context Limit: While performance drops rapidly beyond 128K tokens, the models generalize exceptionally well within this range.

Conclusion

Gemma 3 represents a revolutionary leap in open AI technology, pushing the boundaries of what is possible in a lightweight, accessible model. By integrating innovative techniques like enhanced multimodal processing with a tailored SigLIP vision encoder, extended context lengths up to 128K tokens, and a unique 5:1 local-to-global attention ratio, Gemma 3 not only achieves state-of-the-art performance but also dramatically improves memory efficiency. Its advanced training and distillation approaches have narrowed the performance gap with larger, closed-source models, making high-quality AI accessible to developers and researchers alike. This release sets a new benchmark in the democratization of AI, empowering users with a versatile and efficient tool for diverse applications.

GenAI Intern @ Analytics Vidhya | Final Year @ VIT Chennai
Passionate about AI and machine learning, I'm eager to dive into roles as an AI/ML Engineer or Data Scientist where I can make a real impact. With a knack for quick learning and a love for teamwork, I'm excited to bring innovative solutions and cutting-edge advancements to the table. My curiosity drives me to explore AI across various fields and take the initiative to delve into data engineering, ensuring I stay ahead and deliver impactful projects.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details