With the advent of Large Language Models (LLMs), they have permeated numerous applications, supplanting smaller transformer models like BERT or Rule Based Models in many Natural Language Processing (NLP) tasks. LLMs are versatile, capable of handling tasks such as Text Classification, Summarization, Sentiment Analysis, and Topic Modelling, owing to their extensive pre-training. However, despite their broad capabilities, LLMs often lag in accuracy compared to their smaller counterparts.
To address this limitation, one effective strategy is fine-tuning pre-trained LLMs to excel in specific tasks. Fine-tuning large models frequently yields optimal results. Notably, Google’s Gemini, among other large models, now offers users the ability to fine-tune these models with their own training data. In this guide, we will walk through the process of fine-tuning Gemini models for specific problems, as well as how to curate a dataset using resources from HuggingFace.
This article was published as a part of the Data Science Blogathon.
Gemini comes in two versions: Pro and Ultra. In the Pro version, there are Gemini 1.0 Pro and the new Gemini 1.5 Pro. These models from Google compete with other advanced models like ChatGPT and Claude. Gemini models are easy to access for everyone through AI Studio UI and a free API.
Recently, Google announced a new feature for Gemini models: fine-tuning. This means anyone can adjust the Gemini model to suit their needs. You can fine-tune Gemini using either the AI Studio UI or their API. Fine-tuning is when we give our own data to Gemini so it can behave the way we want. Google uses Parameter Efficient Tuning (PET) to quickly adjust a few important parts of the Gemini model, making it useful for different tasks.
Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Dive in today! Explore Now
Before we begin finetuning the model, we will start with installing the necessary libraries. By the way, we will be working with Colab for this guide.
The following are the Python modules necessary to get started:
!pip install -q google-generativeai datasets
Running the following code will download and install the Google Generative AI and the Datasets library in our Python Environment.
In the next step, we need to set up an OAuth for this tutorial. The OAuth is necessary so that the data we are sending to Google for Fine-Tuning Gemini is safe. To get the OAuth follow this link. Then download the client_secret.json after creating the OAuth. Save the contents of the client_secrent.json in the Colab Secrets under the CLIENT_SECRET name and run the below code:
import os
if 'COLAB_RELEASE_TAG' in os.environ:
from google.colab import userdata
import pathlib
pathlib.Path('client_secret.json').write_text(userdata.get('CLIENT_SECRET'))
# Use `--no-browser` in colab
!gcloud auth application-default login --no-browser \
--client-id-file client_secret.json --scopes=\
'https://www.googleapis.com/auth/cloud-platform,\
https://www.googleapis.com/auth/generative-language.tuning'
else:
!gcloud auth application-default login --client-id-file \
client_secret.json --scopes=\
'https://www.googleapis.com/auth/cloud-platform,\
https://www.googleapis.com/auth/generative-language.tuning'
Above, copy the second link and paste it into your CMD local system and run it.
Then you will be redirected to the Web Browser to log in with the email that you have set up OAuth with. After logging in, in the CMD, we get a URL, now paste that URL into the 3rd line and press enter. Now we are done performing the OAuth with Google.
Firstly, we will start by downloading the dataset that we will work with to finetune it to the Gemini Model. For this, we work with the datasets library. The code for this will be:
from datasets import load_dataset
dataset = load_dataset("ai4privacy/pii-masking-200k")
print(dataset)
We see that the dataset contains 209261 rows of training data and no test data. And each row contains different columns like masked_text, unmasked_text, privacy_mask, span_labels, bio_labels, and tokenised_text. The sample data is mentioned below:
In the displayed image, we observe both masked and unmasked sentences. Specifically, in the masked sentence, certain elements such as the person’s name and vehicle number are obscured by specific tags. To prepare the data for further processing, we now need to undertake some data preprocessing. Below is the code for this preprocessing step:
df = dataset['train'].to_pandas()
df = df[['unmasked_text','masked_text']][:2000]
df.columns = ['input','output']
The next step is to format our data. To do this, we will be creating a formatter function:
def formatter(x):
text = f"""\
Given the information below, mask the personal identifiable information.
Input:
{x['input']}
Output:
"""
return text
df['text_input'] = df.apply(formatter,axis=1)
print(df['text_input'][0])
Running the code will result in the creation of a new column called “train” that contains the formatted text for each row including the input field. Let’s try observing one of the elements of the dataframe:
We can see that the text_input contains the data where each row contains the context at the start of the data telling to mask the PII and then followed by the input data and followed by the word output, where the model needs to generate the output. Now we need to divide the dataframe into train and test:
df = df[['text_input','output']]
df_train = df.iloc[:1900,:]
df_test = df.iloc[1900:,:]
So running the code will filter our data and divide it into train and test. Finally, we are done with the data pre-processing part.
Follow the steps mentioned below to fine-tune your Gemini Model:
In this section, we will go through the process of Tuning the Gemini Model. For this, we will work with the following code:
import google.generativeai as genai
bm_name = "models/gemini-1.0-pro-001"
name = 'pii-model'
operation = genai.create_tuned_model(
source_model=bm_name,
training_data=df_train,
id = name,
epoch_count = 2,
batch_size=4,
learning_rate=0.001,
)
We are done setting up the parameters. Running this code will create a tuned model object. Now we need to start the process of training the Gemini LLM. For this, we work with the following code:
model = genai.get_tuned_model(f'tunedModels/{name}')
print(model)
Here, we use the .get_tuned_model() function from the genai library, passing our defined model’s name, starting the training process. Then, we print the model, as shown in the image below:
The model is of type TunedModel. Here we can observe different parameters for the model that we have defined. They are:
We can even get the state and the metadata of the tuned model through the following code:
print(operation.metadata)
Here it displays the total steps, that is 950, which is predictable. Because in our example we have 1900 rows of training data. In each step, we take in a batch of 4, i.e. 4 rows, so for one complete epoch we have 1900/4 i.e. 475 steps. We have set 2 epochs for training, which implies that 2*475 = 950 steps.
The code below creates a status bar telling how much percentage of the training has finished and the time that it will take to complete the entire training process:
import time
for status in operation.wait_bar():
time.sleep(30)
The above code creates a progress bar, when completed implies that our tuning process has ended.
The operation object even contains the snapshots of training. That it will contain the evaluation metrics like the mean_loss per epoch. We can visualize this with the following code:
import pandas as pd
import seaborn as sns
model = operation.result()
snapshots = pd.DataFrame(model.tuning_task.snapshots)
sns.lineplot(data=snapshots, x = 'epoch', y='mean_loss')
Running the code will result in the following graph:
In this image, we can see that we have reduced the loss from 3 to less than 0.5 in just 2 epochs of training. Finally, we are done with the training of the Gemini Model
In this section, we will test our model on the test data. Now to work with the tuned model, we work with the following code:
model = genai.GenerativeModel(model_name=f'tunedModels/{name}')
The above code will load the tuned model that we have just trained with the Personal Identifiable Information data. Now we will test this model with some examples from the test data that we have put aside. For this let’s print the random text_input and its corresponding output from the test set:
print(df_test['text_input'][1900])
df_test['output'][1900]
Above we can see a random text_input and the output taken from the test set. Now we will pass this text_input to the model and observe the output generated:
text = df_test['text_input'][1900]
res = model.generate_content(text)
print(res.text)
We see that the model was successful in masking the Personal Identifiable Information for the given text_input and the output generated by the model exactly matches the output from the test set. Now let us try this out with a few more examples:
print(df_test['text_input'][1969])
print(df_test['output'][1969])
text = df_test['text_input'][1969]
res = model.generate_content(text)
print(res.text)
print(df_test['text_input'][1987])
print(df_test['output'][1987])
text = df_test['text_input'][1987]
res = model.generate_content(text)
print(res.text)
print(df_test['text_input'][1933])
print(df_test['output'][1933])
text = df_test['text_input'][1933]
res = model.generate_content(text)
print(res.text)
For all the examples above, we see that our fine-tuned model performance is good. The model was able to learn from the given training data and apply the masking correctly to hide sensitive personal information. So we have seen from start to end how to create a dataset for finetuning and how to fine-tune the Gemini Model on a dataset and the results we see look very promising for a finetuned model
In conclusion, this guide has provided a comprehensive walkthrough on finetuning Google’s flagship Gemini models for masking personal identifiable information (PII). We began by exploring Google’s blog post of the finetuning capability for Gemini models, highlighting the need of finetuning these models to achieve task-specific accuracy. Through practical steps outlined in the guide, including Dataset Preparation, finetuning the Gemini model, and testing its performance, users can harness the power of large language models for PII masking tasks.
Here are the key takeaways from this guide:
A. Parameter Efficient Tuning (PET) is one of the finetuning techniques that only finetunes a small set of parameters of the model. This is employed by Google to quickly fine-tune important layers in the Gemini model. It efficiently adapts the model to the user’s data, improving its performance for specific tasks
A. Tuning a Gemini model involves providing parameters like the Base Model name, Epoch Count, Batch Size, and Learning Rate. These parameters influence the training process and ultimately affect the model’s performance
A. Users can monitor the training progress of a finetuned Gemini model through status updates, visualizations of metrics like mean loss per epoch, and by observing snapshots of the training process
A. Before finetuning a Gemini model, users need to install necessary libraries like google-generativeai and datasets. Additionally, initiating OAuth for data security and formatting the dataset for training are important steps
A. A finetuned Gemini model can be applied in different domains where PII masking is necessary, like data anonymization, privacy preservation in NLP applications, and compliance with data protection regulations like the GDPR
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.
Unable to access the pii-model after setting everything up