Models that integrate visual and linguistic inputs, known as Vision Language Models are a subset of Multimodal AI, which are adept at processing both visual and textual data to produce textual responses. Their proficiency lies in their ability to perform tasks without prior specific training (zero-shot learning), along with strong generalization skills, unlike Large Language Models which can only perform tasks with text as the only modality. They are versatile in a range of applications, including identifying objects in images, responding to queries, and comprehending the content of documents. Moreover, these models possess the capability to discern spatial relationships within images, enabling them to generate precise location markers or delineate areas for particular objects. For further insight into Vision Language Models and their structural design, explore additional information here.
In this blog, we will be leveraging the Qwen2 7B Visual Language Model by Alibaba, by finetuning it on our custom healthcare dataset of radiology images and question answer pairs.
This article was published as a part of the Data Science Blogathon.
Vision language models are generally described as a type of multimodal models capable of learning from both images and text. These generative models accept image and text inputs and produce text outputs. Large vision language models exhibit strong zero-shot capabilities, generalize effectively, and are compatible with various types of images, including documents and web pages. Their applications encompass chatting about images, image recognition based on instructions, visual question answering, document understanding, and image captioning, among others.
Certain vision language models are also adept at capturing spatial properties within an image. They can generate bounding boxes or segmentation masks when instructed to detect or segment specific subjects, and they can localize different entities or respond to queries about their relative or absolute positions. The existing array of large vision language models is diverse in terms of the data they were trained on, how they encode images, and their overall capabilities.
Visual question answering is a task in artificial intelligence where the goal is to generate a correct answer to a question about a given image. A VQA model needs to understand both the visual content of the image and the semantics of the natural language question. This requires the model to perform a combination of image recognition and natural language processing.
For example, given an image of a dog sitting on a sofa and the question “What is the dog sitting on?”, the VQA model must first detect and recognize the objects in the image—identifying the dog and the sofa. It then needs to parse the question, understanding that the query is about the relationship between the dog and its surrounding environment. By combining these insights, the model can generate the answer “sofa.”
With the advent of LLMs or Large Language Models for Question Answering, Content Generation,, Summarization etc. various industries have started leveraging LLMs for their business use cases by coupling it with an RAG (Retrieval Augmented Generation) layer for the search and retrieval from vector databases which stores textual content as embeddings. As we all know, most of internet data is text, hence except for very complex use cases, there is not much need for training or finetuning LLMs, reason being – they are trained on vast amount of internet data and they are highly adept at understanding any form of text without the need of a transfer learning mechanism.
But let’s take a minute and think the same for images – are internet images domain specific? No. Most of the internet images are general purpose images and Visual Language Models are hence, trained with those general purpose images, making them difficult to perform better for targeted use cases in healthcare, manufacturing, finance, etc. where the images present are poles apart in structure and composition from the general purpose images (let’s say images in ImageNet and other benchmark datasets). Hence, finetuning VLMs for custom use cases has become an increasingly common approach for companies wanting to leverage the power of these pretrained VLMs on business specific use cases willing to extract and generate information from not only text, but visual elements too.
In essence, the process of fine-tuning is a strategic approach to model optimization, ensuring that the model not only fits the task at hand with greater accuracy but also operates with enhanced efficiency.
Unsloth is a framework used for efficient finetuning of large language, and vision language models at scale. Given below are a few highlights on Unsloth, which makes it a go-to choice for model finetuning activities for ML Engineers and Data Scientists:
Below we will look into the detailed steps using 4-bit quantized Qwen2 7B VL model:
To kick off our hands-on journey, we begin by importing the necessary libraries and modules to set up our deep learning environment.
import torch
import os
from tqdm import tqdm
from datasets import load_dataset
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
Now we move on to define key constants that will be used throughout our training process. TRAIN_SET, TEST_SET, and VAL_SET are set to “Train“, “Test“, and “Valid” respectively. These constants will help us reference specific data splits in our dataset, ensuring that we’re training on the right data and evaluating our model’s performance accurately.
We also define hyperparameters specific to the LoRA (Low-Rank Adaptation) architecture, which are ‘LORA_RANK‘ and ‘LORA_ALPHA‘, both set to 16. ‘LORA_RANK’ determines the rank of the low-rank matrices, while ‘LORA_ALPHA’ specifies the scale of the adaptation. Additionally, we have set ‘LORA_DROPOUT’ to 0, as we’re not applying dropout in the LoRA layers during fine-tuning.
To keep track of our experiments and model training, we set environment variables for Weights & Biases (wandb), a popular tool for experiment tracking, model optimization, and dataset versioning. By setting the ‘WANDB_PROJECT’ variable to “qwen2-vl-finetuning-logs”, we specify the project namespace in wandb where all our logs and outputs will be stored. The ‘WANDB_LOG_MODEL‘ variable is set to “checkpoint”, which instructs wandb to log model checkpoints, allowing us to monitor the model’s performance over time and resume training if necessary. These environment configurations are necessary for a manageable and reproducible training workflow.
TRAIN_SET = "Train"
TEST_SET = "Test"
VAL_SET = "Valid"
LORA_RANK = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0
os.environ["WANDB_PROJECT"] = "qwen2-vl-finetuning-logs"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"
In this step, we initialize our model and tokenizer using the FastVisionModel.from_pretrained method. We specify the pre-trained model we wish to use, in this case, “unsloth/Qwen2-VL-7B-Instruct-bnb-4bit“. The use_gradient_checkpointing parameter is set to “unsloth“, which enables gradient checkpointing to optimize memory usage during training. Gradient checkpointing is particularly useful when working with large models or when limited GPU memory is available.
By executing this code, we load both the model weights and the associated tokenizer, setting us up for the subsequent fine-tuning process.
For educational purposes and to expedite our training process, we opt to load a quantized 4-bit version of our model. Quantization reduces the precision of the model’s weights, which can lead to faster inference times and decreased memory usage without significantly impacting performance, making it ideal for learning scenarios and quick experimentation.
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
use_gradient_checkpointing="unsloth",
)
On running this cell, you must be able to see the below image in your output:
In the provided code snippet, we configure a model for Parameter-Efficient Fine-Tuning (PEFT) using the Low-Rank Adaptation (LoRA) technique. LoRA is a resource-efficient method for adapting large pre-trained models to new tasks. Vision-language models are typically pre-trained on large datasets, learning representations that transfer well to various downstream tasks. However, fine-tuning all parameters in these large models is computationally expensive and may lead to overfitting, especially with limited domain-specific data.
LoRA addresses this by adding low-rank matrices that approximate updates to the original weight matrices of the model. This is done in a way that is specifically designed to capture the new task’s requirements with minimal additional parameters. Read about it more here!
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True, # False if not finetuning vision layers
finetune_language_layers=True, # False if not finetuning language layers
finetune_attention_modules=True, # False if not finetuning attention layers
finetune_mlp_modules=True, # False if not finetuning MLP layers
r=LORA_RANK, # The larger, the higher the accuracy, but might overfit
lora_alpha=LORA_ALPHA, # Recommended alpha == r at least
lora_dropout=LORA_DROPOUT,
bias="none",
random_state=3407,
use_rslora=False, # We support rank stabilized LoRA
loftq_config=None, # And LoftQ
# target_modules = "all-linear", # Optional now! Can specify a list if needed
)
Let’s break down each of the parameters in the code snippet provided for the FastVisionModel.get_peft_model method, which is used to configure the model for PEFT using LoRA:
finetune_vision_layers=True
: Enables the vision layers of the model to be fine-tuned, allowing them to adapt to new visual data that may differ significantly from the data seen during pre-training. This is especially beneficial for tasks involving domain-specific imagery.finetune_language_layers=True
: Updates the language-processing layers, helping the model better understand and generate responses for linguistic nuances in the new task. This is crucial for fine-tuning the model’s textual output.finetune_attention_modules=True
: Fine-tunes the attention modules, which play a key role in understanding relationships between input elements. By refining these modules, the model can better identify task-relevant features and dependencies.finetune_mlp_modules=True
: Adapts the multi-layer perceptron (MLP) components of the model. These layers process outputs from attention modules, and their fine-tuning ensures better alignment with the specific requirements of the new task.r=LORA_RANK
: Sets the rank for the low-rank matrices introduced by LoRA, influencing the number of trainable parameters. Higher values can enhance accuracy but risk overfitting, making this a key parameter for balancing performance.lora_alpha=LORA_ALPHA
: Determines the scaling factor for LoRA weights, controlling how much they influence the model’s behavior. Larger values lead to more significant deviations from the pre-trained model.lora_dropout=LORA_DROPOUT
: Applies dropout regularization to LoRA layers, reducing overfitting risks during fine-tuning and improving model generalization.bias="none"
: Indicates that biases in the LoRA layers are not adjusted during fine-tuning, simplifying the training process.random_state=3407
: Ensures reproducibility by fixing the random seed for consistent results.use_rslora=False
: Disables Rank Stabilized LoRA (RS-LoRA), favoring standard LoRA for simplicity.loftq_config=None
: Skips LoftQ as the model already uses a 4-bit quantized Qwen setup.target_modules="all-linear"
: Indicates LoRA fine-tuning is applied to all linear layers, offering flexibility for customization.This step involves loading the MEDPIX-ShortQA dataset using the load_dataset function, which retrieves the training, testing, and validation sets for model training and evaluation.
The MEDPIX-ShortQA dataset consists of radiology images paired with short questions and answers. It is designed to train models for medical image diagnosis. The dataset includes image IDs, case IDs, and metadata, along with image width in pixels. It is structured to help develop AI models that interpret radiological images and answer related medical questions. This supports radiologists and healthcare professionals in their work.
train_dataset = load_dataset("adishourya/MEDPIX-ShortQA", split=TRAIN_SET)
test_dataset = load_dataset("adishourya/MEDPIX-ShortQA", split=TEST_SET)
val_dataset = load_dataset("adishourya/MEDPIX-ShortQA", split=VAL_SET)
Dataset preview (output on running the above cell):
Nothing fancy here! In this step, we define a function convert_to_conversation that transforms our MEDPIX-ShortQA dataset samples into a conversation format. This format is more suitable for training conversational AI models. Each sample is converted into a structured dialogue with a “user” asking a question accompanied by an “image” of a radiology scan, and the “assistant” providing the medical diagnosis as an answer.
Next, by iterating over the training, testing, and validation datasets, we transform each sample into a structured conversation:
def convert_to_conversation(sample):
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": sample["question"]},
{"type": "image", "image": sample["image_id"]},
],
},
{"role": "assistant", "content": [{"type": "text", "text": sample["answer"]}]},
]
return {"messages": conversation}
train_set = [convert_to_conversation(sample) for sample in train_dataset]
test_set = [convert_to_conversation(sample) for sample in test_dataset]
val_set = [convert_to_conversation(sample) for sample in val_dataset]
Let’s take a look for better understanding! Run the below cell and you will get a similar output and shown in the image below.
train_set[0] #look below for output!
In this step, we focus on evaluating our Qwen2 VL model in a zero-shot setting, which means we test the model’s pretrained weights without any additional training or fine-tuning. To do this, we define the function run_test_set, which performs inference on a given dataset. The function processes the dataset in batches and uses a pre-trained model and tokenizer to generate responses to the provided questions.
def run_test_set(dataset, batch_size=8):
FastVisionModel.for_inference(model)
ground_truths, responses = [], []
for sample in tqdm(
dataset,
desc="Running inference on test set",
bar_format="{l_bar}{bar:10}{r_bar}{bar:-10b}",
):
image = sample["messages"][0]["content"][1]["image"]
question = sample["messages"][0]["content"][0]["text"]
answer = sample["messages"][1]["content"][0]["text"]
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": question}],
}
]
input_text = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
)
inputs = tokenizer(
image,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to("cuda")
with torch.no_grad():
generated_ids = model.generate(
**inputs, max_new_tokens=128, use_cache=True, temperature=0.5, min_p=0.1
)
generated_ids_trimmed = [
out_ids[len(in_ids) :]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
responses.append(response)
ground_truths.append(answer)
torch.cuda.empty_cache()
return ground_truths, responses
Now, let’s run the inference sing the below cell!
ground_truths, responses = run_test_set(test_set, batch_size=8)
In this step we will be evaluating the performance of your Vision-Language Model (VLM) on the test set in a zero-shot setting. We have chosen to use the BERTScore, which is a metric for evaluating the quality of text generated by models based on the BERT embeddings. BERTScore computes precision, recall, and F1 score, which reflect the semantic similarity between the generated text and the reference text.
from bert_score import score
P, R, F1 = score(responses, ground_truths, lang="en", verbose=True, nthreads=10)
print(
f"""
Precision: {P}
Recall: {R}
F1 Score: {F1}
"""
)
On zero-shot mode, we are using the model’s pretrained weights to perform on our targeted task – which is answering questions from radiology scans or medical imageries. As we discussed earlier, VLMs are pretrained on general purpose images of animals, transports, places, landscapes, etc.
Hence, using the model’s pretrained weights only for our targeted use case won’t yield great performance, which can be clearly seen from the scores I got by running the above cell:-
Precision | Recall | F1-Score |
0.7786 | 0.7943 | 0.7863 |
It is important to first check the zero-shot capabilities of the chosen model before starting the transfer learning phase. This practice highlights the model’s performance in its pre-trained setting. It also serves as a benchmark, showing how well the model handles complex domain-specific use cases.
In this step, we are preparing to train or fine-tune the Qwen2 VL model. The code snippet below demonstrates the setup required to initiate the training process using a custom trainer, which is likely a part of a training framework like Hugging Face’s Transformers library or a similar custom implementation.
At first we are preparing the model for training by setting it in the training mode. This typically involves enabling gradient computations and dropout layers, which are used during training but not during inference. Then we are creating an instance of SFTTrainer (Supervised Finetuning Trainer), which is responsible for managing the training process. This includes everything from data collation to model optimization and logging.
FastVisionModel.for_training(model) # Enable for training!
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=UnslothVisionDataCollator(model, tokenizer), # Must use!
train_dataset=train_set,
eval_dataset=val_set,
args=SFTConfig(
do_train=True,
do_eval=True,
per_device_train_batch_size=4,
gradient_accumulation_steps=8,
save_total_limit=1,
warmup_steps=5,
# max_steps = 30,
num_train_epochs=2, # Set this instead of max_steps for full training runs
learning_rate=2e-4,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
logging_steps=100,
optim="adamw_8bit",
weight_decay=0.01,
lr_scheduler_type="linear",
seed=3407,
output_dir="outputs",
save_strategy="steps",
save_steps=100,
report_to=["wandb"],
# For vision finetuning:
remove_unused_columns=False,
dataset_text_field="",
dataset_kwargs={"skip_prepare_dataset": True},
dataset_num_proc=4,
max_seq_length=2048,
),
)
As we can see in the code above, the SFTTrainer takes several parameters, let’s go through each of them for complete understanding:-
The SFTConfig class includes parameters such as:
The trainer wrapper encapsulates the training logic and can be used to start the training process by calling a method like trainer.train().
Note: Ensure that all necessary custom classes and methods (FastVisionModel, SFTTrainer, UnslothVisionDataCollator, SFTConfig) are imported from the correct libraries. After configuring and initiating the trainer, begin the training process. You can then monitor the results using the logging and reporting tools specified in your configuration.
Additionally, use the below cell is to check the memory usage using PyTorch cuda utility function.
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
Output should look like the below image:
The below code snippet runs the training by using a trainer object and stores the statistics in trainer_stats.
trainer_stats = trainer.train()
Output should look similar to the below image:
The table we see in the output image above shows the training loss at various steps during the training, and we can see that the loss is gradually decreasing, which is expected and also shows that the model is learning and improving its performance over time.
Additionally, there will also be logging messages of Weights & Biases (wandb) logging. This indicates that the checkpoint at a certain step has been saved and added to an artifact for experiment tracking and versioning.
Use the below snippet to check the final memory and time stats! (optional)
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
Output should look similar to the below image:
The function run_test_set is designed to evaluate a trained FastVisionModel on a given dataset.
def run_test_set(dataset):
FastVisionModel.for_inference(model)
ground_truths, responses = [], []
for sample in tqdm(dataset, desc="Running inference on test set",bar_format="{l_bar}{bar:10}{r_bar}{bar:-10b}",):
image = sample["messages"][0]["content"][1]["image"]
question = sample["messages"][0]["content"][0]["text"]
answer = sample["messages"][1]["content"][0]["text"]
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": question}],
}
]
input_text = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
)
inputs = tokenizer(
image,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to("cuda")
generated_ids = model.generate(
**inputs, max_new_tokens=128, use_cache=True, temperature=0.5, min_p=0.1
)
generated_ids_trimmed = [
out_ids[len(in_ids) :]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)[0]
responses.append(response)
ground_truths.append(answer)
return ground_truths, responses
The snippet above involves the following steps:
Finally, the function returns two lists: ground_truths, containing the correct answers from the dataset, and responses, containing the model’s generated responses. These can be used to evaluate the model’s performance on the test set by comparing the generated responses to the ground truths.
Use the below snippet to begin running inference on test set!
ground_truths, responses = run_test_set(test_set)
Great job on coming this far! It’s time to print the metrics now and check how the model is performing!
This step involves evaluating the quality of generated responses by the fine-tuned Qwen2 Vision Language Model (VLM) using BERTScore. BERTScore leverages the contextual embeddings from pre-trained BERT models to calculate the similarity between two pieces of text.
Let’s use the model and try to generate response using an image and question pair from the test set.
The above image shows presence of a black mass in the left part of the brain, which the model was able to identify and describe in the response!
Now let’s use BERTScore just like las time to print the metrics!
from bert_score import score
P, R, F1 = score(responses, ground_truths, lang="en", verbose=True, nthreads=10)
print(
f"""
Precision: {P.mean().cpu().numpy()}
Recall: {R.mean().cpu().numpy()}
F1 Score: {F1.mean().cpu().numpy()}
"""
)
Refer the below image for results.
The fine-tuned model performs significantly better than the earlier zero-shot predictions, which had scores of around 78%. Precision and recall have now improved to approximately 87%. This demonstrates how fine-tuning VLMs on targeted datasets enhances their performance. It makes the model more reliable and effective in solving real-world challenges, such as those in healthcare, as shown in this article.
In conclusion, fine-tuning Vision Language Models (VLMs) like Qwen2 is a major advancement in AI, especially for processing multimodal data. The high precision, recall, and F1 scores show the model’s ability to generate responses closely aligned with human-generated ground truths, demonstrating the effectiveness of fine-tuning.
Fine-tuning allows models to go beyond their initial pre-training, enabling adaptation to the specific nuances and complexities of new domains. This adaptability is vital for industries like life sciences, finance, retail, and manufacturing, where documents often contain a mix of text and visual information that must be interpreted together to derive accurate and meaningful insights.
For more discussions, ideas or improvements and suggestions on this topic, please connect with me on my LinkedIn, and feel free to visit my GitHub Repo for accessing the entire code used in this article!
Thank You and Happy Learning! 🙂
A. Fine-tuning involves adapting a pre-trained VLM to a specific dataset or task, improving its performance on domain-specific challenges by training on relevant data.
A. VLMs can perform tasks such as image recognition, visual question answering, document understanding, and captioning, all of which require the integration of text and images.
A. Fine-tuning allows the model to better understand domain-specific nuances in both images and text, enhancing its ability to provide accurate and contextually relevant responses.
A. They are crucial for industries like healthcare, finance, and manufacturing, as they can process both images and text, enabling more accurate and insightful results for domain-specific use cases.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.