Language models have transformed how we interact with data, enabling applications like chatbots, sentiment analysis, and even automated content generation. However, most discussions revolve around large-scale models like GPT-3 or GPT-4, which require significant computational resources and vast datasets. While these models are powerful, they are not always practical for domain-specific tasks or deployment in resource-constrained environments. This is where small language models come into play.
This blog will walk you through the process of training a small language model using the Dataset from Hugging Face, focusing on creating a tailored model for predicting diseases based on symptoms.
A small language model refers to a scaled-down version of large models, optimized to balance performance and efficiency. Examples include DistilGPT-2, ALBERT, and DistilBERT.
These models:
In this tutorial, we will demonstrate how to fine-tune a small language model, specifically DistilGPT-2, to handle a medical task: predicting diseases based on symptoms using the Symptoms and Disease Dataset from Hugging Face. By the end, you’ll understand how small language models can be applied effectively to solve real-world problems in a focused manner.
The Symptoms and Disease Dataset provides mappings of medical instructions or symptom descriptions to their corresponding diseases. This dataset is well-suited for training models to predict diseases or answer medical queries based on symptom descriptions.
Example Entries:
Instruction | Disease |
---|---|
What are the symptoms of hypertensive disease? | The following are the symptoms of hypertensive disease: pain chest, shortness of breath, dizziness, asthenia, fall, syncope, vertigo, sweating increased, palpitation, nausea, angina pectoris, pressure chest |
What are the symptoms of diabetes? | The following are the symptoms of diabetes: polyuria, polydypsia, shortness of breath, pain chest, asthenia, nausea, orthopnea, rale, sweating increased, unresponsiveness, mental status changes, vertigo, vomiting, labored breathing |
This structured dataset enables a small language model to learn relationships between symptoms and diseases effectively.
This guide provides a practical demonstration of training a small language model using DistilGPT-2 for predicting diseases based on symptoms. Below is the step-by-step explanation of the code with implementation details.
Let’s dive into the steps.
Ensure you have the necessary libraries installed:
!pip install torch torchtext transformers sentencepiece pandas tqdm datasets
The following libraries are imported to set up the environment for training a small language model:
from datasets import load_dataset, DatasetDict, Dataset
import pandas as pd
import ast
import datasets
from tqdm import tqdm
import time
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split
We’ll use the Symptoms and Disease Dataset from Hugging Face and convert it into a format suitable for training.
# Load the dataset
dataset = load_dataset("prognosis/symptoms_disease_v1")
dataset
# Convert to a pandas dataframe
updated_data = [{'Input': item['instruction'], 'Disease': item['output']} for item in dataset['train']]
df = pd.DataFrame(updated_data)
df.head(5)
if torch.cuda.is_available():
device = torch.device('cuda')
else:
# If Apple Silicon, set to 'mps' - otherwise 'cpu' (not advised)
try:
device = torch.device('mps')
except Exception:
device = torch.device('cpu')
Device Selection:
# The tokenizer turns texts to numbers (and vice-versa)
tokenizer = GPT2Tokenizer.from_pretrained('distilgpt2')
# The transformer
model = GPT2LMHeadModel.from_pretrained('distilgpt2').to(device)
model
The GPT2Tokenizer from Hugging Face is loaded using from_pretrained(‘distilgpt2’). This tokenizer:
The DistilGPT-2 language model is loaded with GPT2LMHeadModel.from_pretrained(‘distilgpt2’). This is a smaller, efficient version of GPT-2 designed for language tasks like text generation. The model is moved to the appropriate hardware device (GPU, MPS, or CPU) for efficient computation.
The LanguageDataset class is designed to:
# Dataset Prep
class LanguageDataset(Dataset):
"""
An extension of the Dataset object to:
- Make training loop cleaner
- Make ingestion easier from pandas df's
"""
def __init__(self, df, tokenizer):
self.labels = df.columns
self.data = df.to_dict(orient='records')
self.tokenizer = tokenizer
x = self.fittest_max_length(df) # Fix here
self.max_length = x
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
x = self.data[idx][self.labels[0]]
y = self.data[idx][self.labels[1]]
text = f"{x} | {y}"
tokens = self.tokenizer.encode_plus(text, return_tensors='pt', max_length=128, padding='max_length', truncation=True)
return tokens
def fittest_max_length(self, df): # Fix here
"""
Smallest power of two larger than the longest term in the data set.
Important to set up max length to speed training time.
"""
max_length = max(len(max(df[self.labels[0]], key=len)), len(max(df[self.labels[1]], key=len)))
x = 2
while x < max_length: x = x * 2
return x
# Cast the Huggingface data set as a LanguageDataset we defined above
data_sample = LanguageDataset(df, tokenizer)
This step defines and initializes a custom PyTorch Dataset to handle the tokenization and formatting of a text-based dataset, preparing it for training with DistilGPT-2. It simplifies ingestion, ensures consistency in input size, and is tailored for efficient processing by the model.
train_size = int(0.8 * len(data_sample))
valid_size = len(data_sample) - train_size
train_data, valid_data = random_split(data_sample, [train_size, valid_size])
Divides the dataset into two subsets:
# Make the iterators
train_loader = DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=BATCH_SIZE)
DataLoaders feed data in manageable batches during training and validation.
train_loader:
valid_loader:
# Set the number of epochs
num_epochs = 2
# Model params
BATCH_SIZE = 8
# Training parameters
batch_size = BATCH_SIZE
model_name = 'distilgpt2'
gpu = 0
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
optimizer = optim.Adam(model.parameters(), lr=5e-4)
tokenizer.pad_token = tokenizer.eos_token
# Init a results dataframe
results = pd.DataFrame(columns=['epoch', 'transformer', 'batch_size', 'gpu',
'training_loss', 'validation_loss', 'epoch_duration_sec'])
Epochs and Batch Size:
Model and GPU Tracking:
Loss Function:
Optimizer:
Results Logging:
This step sets up the key parameters, components, and tracking mechanisms required for the training process. It ensures the training loop is configured with appropriate values and prepares a structure for logging the results.
# The training loop
for epoch in range(num_epochs):
start_time = time.time() # Start the timer for the epoch
# Training
## This line tells the model we're in 'learning mode'
model.train()
epoch_training_loss = 0
train_iterator = tqdm(train_loader, desc=f"Training Epoch {epoch+1}/{num_epochs} Batch Size: {batch_size}, Transformer: {model_name}")
for batch in train_iterator:
optimizer.zero_grad()
inputs = batch['input_ids'].squeeze(1).to(device)
targets = inputs.clone()
outputs = model(input_ids=inputs, labels=targets)
loss = outputs.loss
loss.backward()
optimizer.step()
train_iterator.set_postfix({'Training Loss': loss.item()})
epoch_training_loss += loss.item()
avg_epoch_training_loss = epoch_training_loss / len(train_iterator)
# Validation
# Validation
model.eval()
epoch_validation_loss = 0
total_loss = 0
valid_iterator = tqdm(valid_loader, desc=f"Validation Epoch {epoch+1}/{num_epochs}")
with torch.no_grad():
for batch in valid_iterator:
inputs = batch['input_ids'].squeeze(1).to(device)
targets = inputs.clone()
outputs = model(input_ids=inputs, labels=targets)
loss = outputs.loss
total_loss += loss.item() # Convert tensor to scalar
valid_iterator.set_postfix({'Validation Loss': loss.item()})
epoch_validation_loss += loss.item()
avg_epoch_validation_loss = epoch_validation_loss / len(valid_loader)
end_time = time.time() # End the timer for the epoch
epoch_duration_sec = end_time - start_time # Calculate the duration in seconds
new_row = {'transformer': model_name,
'batch_size': batch_size,
'gpu': gpu,
'epoch': epoch+1,
'training_loss': avg_epoch_training_loss,
'validation_loss': avg_epoch_validation_loss,
'epoch_duration_sec': epoch_duration_sec} # Add epoch_duration to the dataframe
results.loc[len(results)] = new_row
print(f"Epoch: {epoch+1}, Validation Loss: {total_loss/len(valid_loader)}")
Epoch Timer:
Training Phase:
Validation Phase:
Performance Logging:
Progress Display:
This step defines the core training and validation loop for the model, handling the forward pass, backpropagation, weight updates, and validation to evaluate model performance.
# Define the input string
input_str = "What are the symptoms of Chicken pox?"
# Encode the input string with padding and attention mask
encoded_input = tokenizer.encode_plus(
input_str,
return_tensors='pt',
padding=True,
truncation=True,
max_length=50 # Adjust max_length as needed
)
# Move tensors to the appropriate device
input_ids = encoded_input['input_ids'].to(device)
attention_mask = encoded_input['attention_mask'].to(device)
# Set the pad_token_id to the tokenizer's eos_token_id
pad_token_id = tokenizer.eos_token_id
# Generate the output
output = model.generate(
input_ids,
attention_mask=attention_mask,
max_length=50, # Adjust max_length as needed
num_return_sequences=1,
do_sample=True,
top_k=8,
top_p=0.95,
temperature=0.5,
repetition_penalty=1.2,
pad_token_id=pad_token_id
)
# Decode and print the output
decoded_output = tokenizer.decode(output[0], skip_special_tokens=True)
print(decoded_output)
This step qualitatively tests the model’s performance by providing a sample query and evaluating its generated response. It helps validate the model’s ability to produce relevant and meaningful outputs.
You can refer this for details.
Fine-tuning DistilGPT-2, a compact version of GPT-2, tailors the model to specific tasks, enhancing its performance in targeted applications. Here’s a comparison of DistilGPT-2’s capabilities before and after fine-tuning:
from transformers import GPT2Tokenizer, GPT2LMHeadModel
# Load pre-trained DistilGPT-2 tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("distilgpt2")
model = GPT2LMHeadModel.from_pretrained("distilgpt2")
# Set the padding token to the end-of-sequence token (common practice for GPT-2-based models)
tokenizer.pad_token = tokenizer.eos_token
# Define the input query
input_query = "What are the symptoms of Chicken pox?"
# Tokenize the input query
input_tokens = tokenizer.encode_plus(
input_query,
return_tensors="pt",
padding=True,
truncation=True,
max_length=50 # Adjust max_length if needed
)
# Generate response using the pre-trained model
output_tokens = model.generate(
input_ids=input_tokens["input_ids"],
attention_mask=input_tokens["attention_mask"],
max_length=50, # Adjust max_length if needed
num_return_sequences=1,
do_sample=True, # Sampling adds randomness for diverse outputs
top_k=8, # Keep top 8 most probable tokens at each step
top_p=0.95, # Consider tokens with a cumulative probability of 0.95
temperature=0.7, # Adjust temperature for response diversity
repetition_penalty=1.2, # Penalize repetitive token generations
pad_token_id=tokenizer.pad_token_id # Handle padding gracefully
)
# Decode the generated output to human-readable text
decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
# Print the results
print("Pre-Fine-Tuning Response:")
print(decoded_output)
The response from the pre-fine-tuned DistilGPT-2 model highlights its general-purpose nature. While it’s coherent and grammatically correct, it lacks specific, accurate information about the symptoms of chickenpox. This behavior is expected because the pre-trained model hasn’t been exposed to domain-specific knowledge about diseases or symptoms.
Once fine-tuned on the dataset “Symptoms and Disease Dataset,” the model :
In summary, fine-tuning DistilGPT-2 transforms it from a general-purpose language model into a specialized tool, enhancing its performance and accuracy in specific domains while retaining its inherent efficiency.
Small language models, such as DistilGPT-2, are a powerful and efficient alternative to large-scale models for domain-specific tasks. Through this tutorial, we demonstrated how to fine-tune DistilGPT-2 using the Symptoms and Disease Dataset, focusing on building a lightweight yet effective model for medical query answering. The process involved data preparation, training, validation, and response generation, showcasing the practical applications of small models in real-world scenarios.
The success of this approach lies in its balance between computational efficiency and performance, making small language models an excellent choice for resource-constrained environments or specialized use cases.
A. A small language model, like DistilGPT-2, is a compact version of large models designed to balance performance and efficiency. It requires fewer computational resources, making it ideal for resource-constrained environments and domain-specific tasks.
A. Small models are faster, cost-effective, and easier to fine-tune on specific datasets. They are ideal when large-scale general-purpose capabilities are unnecessary, such as in applications requiring domain-specific expertise.
A. Fine-tuning is the process of adapting a pre-trained model to a specific task or domain by training it on a curated dataset. It improves the model’s performance for specialized tasks, such as predicting diseases from symptoms.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.