What is the Best Way to Use Gemma LLM?

Ajay Last Updated : 08 Aug, 2024

11 min read

Introduction

Large language models (LLMs) are increasingly becoming powerful tools for understanding and generating human language. These models have achieved state-of-the-art results on different natural language processing tasks, including text summarization, machine translation, question answering, and dialogue generation. LLMs have even shown promise in more specialized domains, like healthcare, finance, and law.

Google has been at the forefront of LLM research and development, releasing a series of open models that have pushed the boundaries of what is possible with this technology. These models include BERT, T5, and T5X, which have been widely adopted by researchers and practitioners alike. In this Guide, we introduce Gemma, a new family of open LLMs developed by Google.

In this article, we will look at how to use Gemma, a useful tool that includes Gemma LLM and Gemma NLP. By using the Gemma model, developers can easily add these features to their projects. Adjusting the Gemma model llm helps improve its performance for different tasks, making it a great resource for many applications.

Learning Objectives

Understand Gemma’s architecture and key features.
Explore Gemma’s training process and techniques.
Evaluate Gemma’s performance across NLP benchmarks.
Learn to use Gemma for inference tasks.
Recognize the importance of responsible deployment for Gemma.

This article was published as a part of the Data Science Blogathon.

What is Gemma?
Gemma – Model Architecture
How was Gemma Trained?
Benchmarks and Performance Metrics
Getting Started with Gemma
Frequently Asked Questions

What is Gemma?

Gemma is a family of open language models based on Google’s Gemini models, trained on up to 6T tokens of text. These are considered to be the lighter versions of Gemini models. The Gemma family consists of two sizes: a 7 billion parameter model for efficient deployment on GPU and TPU, and a 2 billion parameter model for CPU and on-device applications. Gemma exhibits strong generalist capabilities in text domains and state-of-the-art understanding and reasoning skills at scale. It achieves better performance compared to other open models of similar or larger scales across different domains, including question answering, commonsense reasoning, mathematics and science, and coding. For both the models, the pre-trained, finetune checkpoints and open-source codebase for inference and serving are released by the Google Team.

Gemma builds upon recent advancements in sequence models, transformers, deep learning, and large-scale training in a distributed manner. It continues Google’s history of releasing open models and ecosystems, following Word2Vec, Transformer, BERT, T5, and T5X. The responsible release of Gemma aims to improve the safety of frontier models, provide equitable access to this technology, give the path to rigorous evaluation and analysis of current techniques, and foster the development of future innovations. However, thorough safety testing specific to each Use Case is crucial before deploying or using Gemma.

Gemma – Model Architecture

Gemma follows the architecture of a decoder-only transformer that was introduced way back in 2017. Both the Gamma 2B and the 7B models have a vocabulary size of 256k. Both models even have a context length of 8192 tokens. The Gemma even includes the recent advancements made in the transformers’ architecture including:

Multi-Query Attention: The 7B model uses multi-head attention, while the 2B model implements multi-query attention (with num_kv_heads=1). This choice is based on performance improvements that were shown at each scale through ablation studies.
RoPE Embeddings: Instead of absolute positional embeddings, both models employ rotary positional embeddings in each layer. Additionally, embedding sharing across inputs and outputs minimizes model size.
GeGLU Activations: The regular ReLU activation function is replaced by the GeGLU activation function, giving good performance.
Normalizer Location: Gemma deviates from the goto practice by normalizing both the input and output of each transformer sub-layer, using RMSNorm for the normalization method.

How was Gemma Trained?

Gemma 2B and 7B models were trained on 2T and 6T tokens, respectively, of primarily-English data sourced from Web Docs, mathematics, and code. Unlike Gemini models, which include multimodal elements and are optimized for multilingual tasks, Gemma models focus is on processing English text. The training data underwent a careful filtering process to remove Unwanted or Unsafe Content, including personal information and sensitive data. This filtering involved both heuristic methods and model-based classifiers to ensure the quality and safety of the dataset.

Gemma 2B and 7B models underwent supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to further refine their performance. The supervised fine-tuning involved a mix of text-only, English-only synthetic, and human-generated prompt-response pairs. Data mixtures for fine-tuning were carefully selected based on LM-based side-by-side evaluations, with different Prompt sets designed to highlight specific capabilities like the instruction following, factuality, creativity, and safety.

Even, synthetic data underwent several stages of filtering to remove examples containing personal information or toxic outputs, following the approach established by Gemini for improving model performance without compromising safety. Finally, reinforcement learning from human feedback involved collecting pairs of preferences from human raters and training a reward function under the Bradley-Terry model. This function was then optimized using a type of REINFORCE to further refine the models’ performance and mitigate potential issues like reward hacking.

Also Watch this Video of Google Gemma Tutorial and How to use:

Benchmarks and Performance Metrics

Looking at the results, Gemma outperforms Mistral on five out of six benchmarks, with the sole exception being HellaSwag, where they get similar accuracy. This dominance is clearly evident in tasks like ARC-c and TruthfulQA, where Gemma surpasses Mistral by nearly 2% and 2.5% in accuracy and F1 score, respectively. Even on MMLU, where Perplexity scores are lower is better, Gemma achieves a prominently lower Perplexity, indicating a better grip of language patterns. These results solidify Gemma’s position in being a powerful language model, capable of handling complex NLP tasks with good accuracy and efficiency.

Getting Started with Gemma

In this section, we will get started with Gemma. We will be working with Google Colab because it comes with a free GPU. Before we get started, we need to accept Google’s Terms and Conditions to download the model.

Step 1: Opening Gemma

Click on this link to go to Gemma on HuggingFace. You will be presented with something like the below:

Step 2: Click on Acknowledge License

If you click on Acknowledge License , then you will see a page as below.

Click on Authorize. Done we are now ready to download the model. Before, let’s generate a new HuggingFace Token. For this, you can go to the HuggingFace Settings and Generate a new Token, this token will be useful because we need it to authorize inside Google Colab to download the Google Gemma Large Language Model.

Step 3: Installing Libraries

To get started, we first need to install the following libraries.

!pip install -U accelerate bitsandbytes transformers huggingface_hub

accelerate: Allows distributed training and mixed-precision training for faster and more efficient model training. The accelerate library even helps for faster inference of the Large Language Models.
bitsandbytes: Allows quantization of model weights to 4-bit or 8-bit precision, reducing memory footprint and computation requirements. Because we are dealing with a 7Billion Parameter model, which requires around 30-40GB of GPU VRAM, we need to quantize it to fit in the Colab GPU.
transformers: Provide pre-trained language models, tokenizers, and training tools for natural language processing tasks. We work with this library to download the Gemma model and start inferring it.
huggingface_hub: Facilitates access to the Hugging Face Hub, a platform for sharing and seeing language models and datasets. We need this library to login to huggingface so that we can verify that we are authorized to download the Google Gemma Large Language Model

The -U option after the install indicates that we are fetching the latest updated versions of all the libraries.

Step 4: Typing Important Command

Now, type the below command

!huggingface-cli login

The above command will ask you to provide the HuggingFace Token, which we can get from the HuggingFace website. Give this token and press the enter button and you will receive a Login Successful message. Now let’s move on to coding

# Import necessary classes for model loading and quantization
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure model quantization to 4-bit for memory and computation efficiency
quantization_config = BitsAndBytesConfig(load_in_4bit=True)

# Load the tokenizer for the Gemma 7B Italian model
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")

# Load the Gemma 7B Italian model itself, with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it",
                                             quantization_config=quantization_config)

AutoTokenizer: This class dynamically loads the pre-trained tokenizer associated with the given model, ensuring compatibility and avoiding manual config.
AutoModelForCausalLM: Similar to the tokenizer, this class automatically loads the pre-trained Causal Language Model architecture based on the provided model identifier.
quantization_config = BitsAndBytesConfig(load_in_4bit=True): This line creates a config object for quantization, telling that the model’s weights should be pushed in 4-bit precision instead of the original 32-bit. This to a great extent reduces memory consumption and potentially speeds up computations, making the model more efficient for resource-constrained environments.
tokenizer = AutoTokenizer.from_pretrained(“google/gemma-7b-it”): This line loads the pre-trained tokenizer specifically designed for the “google/gemma-7b-it” model. This tokenizer knows how to break down text into separate Tokens that the model can understand and process.
model = AutoModelForCausalLM.from_pretrained(“google/gemma-7b-it”, quantization_config=quantization_config): This line loads the actual “google/gemma-7b-it” model, but with the crucial addition of the quantization_config object. This ensures that the model weights are created in the 4-bit format that we have discussed earlier, adding the benefits of quantization.

Our Gemma Large Language Model is downloaded, converted into a 4-bit quantized model, and loaded into the GPU.

Step 5: Inferencing the model

Now let’s try inferencing the model.

# Define input text:
input_text = "List the key points about Responsible AI"

# Tokenize the input text:
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

# Generate text using the model:
outputs = model.generate(
    **input_ids,  # Pass tokenized input as keyword argument
    max_length=512,  # Limit output length to 512 tokens
)

# Decode the generated text:
print(tokenizer.decode(outputs[0]))

Define Input Text: The code starts by assigning the Prompt “List the key aspects of Responsible AI” to the input_text variable.
Tokenize Input: The tokenizer object associated with the downloaded model is used to convert the text into numerical tokens that the model can understand. The return_tensors=”pt” line tells about the conversion to a PyTorch tensor for efficient GPU processing. The resulting tensor of token IDs is then moved to the GPU using to(“cuda”) if available.
Generate Text: The model.generate function is called with the tokenized input (input_ids) and a maximum output length of 512 tokens. This instructs the model to generate text based on the provided Prompt, respecting the given length limit.
Decode and Convert: The generated text, represented in the format of a sequence of token IDs, is decoded back into human-readable text using the tokenizer.decode function. Finally, the decoded text is printed out.

Step 6: Response Generation

Running the code has generated the following response

The model has generated a fair response to the query provided. It has highlighted all the key aspects that go into creating a Responsible AI. This is really a relevant and accurate answer to the question asked. Let’s the AI by asking a common sense question.

input_text = "How many eggs can a Whale lay in its lifetime?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids,max_length=512)
print(tokenizer.decode(outputs[0]))

input_text = "How many smartphones can a human eat ?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids,max_length=512)
print(tokenizer.decode(outputs[0]))

So far, so good. The model possess good common sense abilities. It is able to identify what’s wrong in the sentence and output the same, which is seen in the pics above. Let’s try asking some math questions.

input_text = "I have 3 apples and 2 oranges. I ate 2 organes. How many apples do I have?"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Seems like the model struggled to answer this simple tricky math question. Let’s try do some Prompt Engineering here. Let’s add additional info in the Prompt and run it like the below:

input_text = "I have 3 apples and 2 oranges. \
I ate 2 oranges. How many apples do I have? \
Think Step by Step. For each step, re-evaluate your answer"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Wow, a simple tweak in the Prompt and the model answered correctly. It began thinking incrementally that is step by step. And for each step, it starts re-evaluating its answer, if it’s right or wrong. And finally, it has steered to the right answer. Let’s try asking the model to write a simple Hello World program in Python.

input_text = "Write a hello world program"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))

Conclusion

Gemma, Google’s latest addition to its suite of open language models, presents advancement in the field of natural language processing. With its strong generalist capabilities and state-of-the-art understanding and reasoning skills, Gemma outperforms other open models across different domains including question answering, commonsense reasoning, mathematics and science, and coding tasks. Built upon recent advancements in sequence models, transformers, and large-scale training techniques, Gemma provides improved performance and efficiency, making it a powerful tool for researchers and practitioners alike. However, responsible deployment and thorough safety testing specific to each problem are compulsory before integrating Gemma into production systems.

Hope you like this article! We will talk about how to use Gemma, a handy tool that includes Gemma LLM and Gemma NLP. With the Gemma model, developers can easily add these features to their projects. By tweaking the Gemma model llm, you can make it work better for different tasks, making it a valuable resource for many uses.

Key Takeaways

Gemma is a family of open language models developed by Google, based on the Gemini models but lighter in scale.
It comes in two sizes: a 7 billion parameter model for GPU and TPU deployment, and a 2 billion parameter model for CPU and on-device applications.
Gemma exhibits strong generalist capabilities and excels in different domains including question answering, commonsense reasoning, mathematics and science, and coding.
The model architecture includes advancements like multi-query attention, RoPE embeddings, GeGLU activations, and RMSNorm for normalization.
Training data for Gemma underwent filtering to ensure quality, and models underwent supervised fine-tuning and reinforcement learning from human feedback.
Performance benchmarks show Gemma’s superiority over other models, mainly in tasks like ARC-c and TruthfulQA.
Getting started with Gemma involves installing necessary libraries, logging into Hugging Face, and loading the model for inference.
Gemma shows impressive capabilities in generating text, answering questions, and even writing simple programming tasks.

Frequently Asked Questions

Q1. What is Gemma?

A. Gemma is a family of open language models developed by Google, providing strong generalist capabilities and state-of-the-art understanding and reasoning skills in different domains.

Q2. How does Gemma differ from old Google models like BERT and T5?

A. Gemma builds upon recent advancements in sequence models, transformers, and large-scale training, providing improved performance and efficiency compared to old models.

Q3. What training data was used for Gemma?

A. Gemma models were trained on primarily English data sourced from Web Docs mathematics, and code, with careful filtering to remove Unwanted or Unsafe Content.

Q4. How can I get started with using Gemma?

A. You can start using Gemma by installing the necessary libraries, logging into Hugging Face, and loading the model for inference in platforms like Google Colab.

Q5. What performance benchmarks have shown Gemma’s superiority?

A. Benchmarks comparing Gemma with other models, like the Mistral, across different NLP tasks showcase Gemma’s impressive capabilities, mainly in tasks like ARC-c and TruthfulQA.

Q6. Does Gemma support multilingual tasks like the Gemini models?

A. No, Gemma models are mainly trained on processing English text and do not include multimodal elements or support multilingual tasks like the Gemini models.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ajay

I work as a Developer in the field of Data Science. I constantly spend time learning new things be it related to AI, DataSceine, and CyberSecurity. Deep learning and machine learning are two topics that I find particularly fascinating, and Python is my preferred language for programming. Cyber Security is another field that I'm touching upon recently. I have experience with large-scale data analysis, and I have a solid grasp of a variety of deep learning and machine learning approaches, including neural networks, regression models, and natural language processing. I'm eager to take on new challenges and make a meaningful contribution to the industry, so I'm constantly seeking for ways to enlarge and deepen my knowledge and skills in the subject.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

What is the Best Way to Use Gemma LLM?

Introduction

Learning Objectives

Table of contents

What is Gemma?

Gemma – Model Architecture

How was Gemma Trained?

Benchmarks and Performance Metrics

Getting Started with Gemma

Step 1: Opening Gemma

Step 2: Click on Acknowledge License

Step 3: Installing Libraries

Step 4: Typing Important Command

Step 5: Inferencing the model

Step 6: Response Generation

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp