Large language models (LLMs) are increasingly becoming powerful tools for understanding and generating human language. These models have achieved state-of-the-art results on different natural language processing tasks, including text summarization, machine translation, question answering, and dialogue generation. LLMs have even shown promise in more specialized domains, like healthcare, finance, and law.
Google has been at the forefront of LLM research and development, releasing a series of open models that have pushed the boundaries of what is possible with this technology. These models include BERT, T5, and T5X, which have been widely adopted by researchers and practitioners alike. In this Guide, we introduce Gemma, a new family of open LLMs developed by Google.
In this article, we will look at how to use Gemma, a useful tool that includes Gemma LLM and Gemma NLP. By using the Gemma model, developers can easily add these features to their projects. Adjusting the Gemma model llm helps improve its performance for different tasks, making it a great resource for many applications.
This article was published as a part of the Data Science Blogathon.
Gemma is a family of open language models based on Google’s Gemini models, trained on up to 6T tokens of text. These are considered to be the lighter versions of Gemini models. The Gemma family consists of two sizes: a 7 billion parameter model for efficient deployment on GPU and TPU, and a 2 billion parameter model for CPU and on-device applications. Gemma exhibits strong generalist capabilities in text domains and state-of-the-art understanding and reasoning skills at scale. It achieves better performance compared to other open models of similar or larger scales across different domains, including question answering, commonsense reasoning, mathematics and science, and coding. For both the models, the pre-trained, finetune checkpoints and open-source codebase for inference and serving are released by the Google Team.
Gemma builds upon recent advancements in sequence models, transformers, deep learning, and large-scale training in a distributed manner. It continues Google’s history of releasing open models and ecosystems, following Word2Vec, Transformer, BERT, T5, and T5X. The responsible release of Gemma aims to improve the safety of frontier models, provide equitable access to this technology, give the path to rigorous evaluation and analysis of current techniques, and foster the development of future innovations. However, thorough safety testing specific to each Use Case is crucial before deploying or using Gemma.
Gemma follows the architecture of a decoder-only transformer that was introduced way back in 2017. Both the Gamma 2B and the 7B models have a vocabulary size of 256k. Both models even have a context length of 8192 tokens. The Gemma even includes the recent advancements made in the transformers’ architecture including:
Gemma 2B and 7B models were trained on 2T and 6T tokens, respectively, of primarily-English data sourced from Web Docs, mathematics, and code. Unlike Gemini models, which include multimodal elements and are optimized for multilingual tasks, Gemma models focus is on processing English text. The training data underwent a careful filtering process to remove Unwanted or Unsafe Content, including personal information and sensitive data. This filtering involved both heuristic methods and model-based classifiers to ensure the quality and safety of the dataset.
Gemma 2B and 7B models underwent supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) to further refine their performance. The supervised fine-tuning involved a mix of text-only, English-only synthetic, and human-generated prompt-response pairs. Data mixtures for fine-tuning were carefully selected based on LM-based side-by-side evaluations, with different Prompt sets designed to highlight specific capabilities like the instruction following, factuality, creativity, and safety.
Even, synthetic data underwent several stages of filtering to remove examples containing personal information or toxic outputs, following the approach established by Gemini for improving model performance without compromising safety. Finally, reinforcement learning from human feedback involved collecting pairs of preferences from human raters and training a reward function under the Bradley-Terry model. This function was then optimized using a type of REINFORCE to further refine the models’ performance and mitigate potential issues like reward hacking.
Also Watch this Video of Google Gemma Tutorial and How to use:
Looking at the results, Gemma outperforms Mistral on five out of six benchmarks, with the sole exception being HellaSwag, where they get similar accuracy. This dominance is clearly evident in tasks like ARC-c and TruthfulQA, where Gemma surpasses Mistral by nearly 2% and 2.5% in accuracy and F1 score, respectively. Even on MMLU, where Perplexity scores are lower is better, Gemma achieves a prominently lower Perplexity, indicating a better grip of language patterns. These results solidify Gemma’s position in being a powerful language model, capable of handling complex NLP tasks with good accuracy and efficiency.
In this section, we will get started with Gemma. We will be working with Google Colab because it comes with a free GPU. Before we get started, we need to accept Google’s Terms and Conditions to download the model.
Click on this link to go to Gemma on HuggingFace. You will be presented with something like the below:
If you click on Acknowledge License , then you will see a page as below.
Click on Authorize. Done we are now ready to download the model. Before, let’s generate a new HuggingFace Token. For this, you can go to the HuggingFace Settings and Generate a new Token, this token will be useful because we need it to authorize inside Google Colab to download the Google Gemma Large Language Model.
To get started, we first need to install the following libraries.
!pip install -U accelerate bitsandbytes transformers huggingface_hub
The -U option after the install indicates that we are fetching the latest updated versions of all the libraries.
Now, type the below command
!huggingface-cli login
The above command will ask you to provide the HuggingFace Token, which we can get from the HuggingFace website. Give this token and press the enter button and you will receive a Login Successful message. Now let’s move on to coding
# Import necessary classes for model loading and quantization
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
# Configure model quantization to 4-bit for memory and computation efficiency
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
# Load the tokenizer for the Gemma 7B Italian model
tokenizer = AutoTokenizer.from_pretrained("google/gemma-7b-it")
# Load the Gemma 7B Italian model itself, with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b-it",
quantization_config=quantization_config)
Our Gemma Large Language Model is downloaded, converted into a 4-bit quantized model, and loaded into the GPU.
Now let’s try inferencing the model.
# Define input text:
input_text = "List the key points about Responsible AI"
# Tokenize the input text:
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
# Generate text using the model:
outputs = model.generate(
**input_ids, # Pass tokenized input as keyword argument
max_length=512, # Limit output length to 512 tokens
)
# Decode the generated text:
print(tokenizer.decode(outputs[0]))
Running the code has generated the following response
The model has generated a fair response to the query provided. It has highlighted all the key aspects that go into creating a Responsible AI. This is really a relevant and accurate answer to the question asked. Let’s the AI by asking a common sense question.
input_text = "How many eggs can a Whale lay in its lifetime?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids,max_length=512)
print(tokenizer.decode(outputs[0]))
input_text = "How many smartphones can a human eat ?"
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids,max_length=512)
print(tokenizer.decode(outputs[0]))
So far, so good. The model possess good common sense abilities. It is able to identify what’s wrong in the sentence and output the same, which is seen in the pics above. Let’s try asking some math questions.
input_text = "I have 3 apples and 2 oranges. I ate 2 organes. How many apples do I have?"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Seems like the model struggled to answer this simple tricky math question. Let’s try do some Prompt Engineering here. Let’s add additional info in the Prompt and run it like the below:
input_text = "I have 3 apples and 2 oranges. \
I ate 2 oranges. How many apples do I have? \
Think Step by Step. For each step, re-evaluate your answer"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Wow, a simple tweak in the Prompt and the model answered correctly. It began thinking incrementally that is step by step. And for each step, it starts re-evaluating its answer, if it’s right or wrong. And finally, it has steered to the right answer. Let’s try asking the model to write a simple Hello World program in Python.
input_text = "Write a hello world program"
input_ids = tokenizer(input_text, return_tensors="pt").to('cuda')
outputs = model.generate(**input_ids,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
Gemma, Google’s latest addition to its suite of open language models, presents advancement in the field of natural language processing. With its strong generalist capabilities and state-of-the-art understanding and reasoning skills, Gemma outperforms other open models across different domains including question answering, commonsense reasoning, mathematics and science, and coding tasks. Built upon recent advancements in sequence models, transformers, and large-scale training techniques, Gemma provides improved performance and efficiency, making it a powerful tool for researchers and practitioners alike. However, responsible deployment and thorough safety testing specific to each problem are compulsory before integrating Gemma into production systems.
Hope you like this article! We will talk about how to use Gemma, a handy tool that includes Gemma LLM and Gemma NLP. With the Gemma model, developers can easily add these features to their projects. By tweaking the Gemma model llm, you can make it work better for different tasks, making it a valuable resource for many uses.
A. Gemma is a family of open language models developed by Google, providing strong generalist capabilities and state-of-the-art understanding and reasoning skills in different domains.
A. Gemma builds upon recent advancements in sequence models, transformers, and large-scale training, providing improved performance and efficiency compared to old models.
A. Gemma models were trained on primarily English data sourced from Web Docs mathematics, and code, with careful filtering to remove Unwanted or Unsafe Content.
A. You can start using Gemma by installing the necessary libraries, logging into Hugging Face, and loading the model for inference in platforms like Google Colab.
A. Benchmarks comparing Gemma with other models, like the Mistral, across different NLP tasks showcase Gemma’s impressive capabilities, mainly in tasks like ARC-c and TruthfulQA.
A. No, Gemma models are mainly trained on processing English text and do not include multimodal elements or support multilingual tasks like the Gemini models.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.