Finetuning Qwen2 7B VLM Using Unsloth for Radiology VQA

Sukanya Bag Last Updated : 16 Jan, 2025
16 min read

Models that integrate visual and linguistic inputs, known as Vision Language Models are a subset of Multimodal AI, which are adept at processing both visual and textual data to produce textual responses. Their proficiency lies in their ability to perform tasks without prior specific training (zero-shot learning), along with strong generalization skills, unlike Large Language Models which can only perform tasks with text as the only modality. They are versatile in a range of applications, including identifying objects in images, responding to queries, and comprehending the content of documents. Moreover, these models possess the capability to discern spatial relationships within images, enabling them to generate precise location markers or delineate areas for particular objects. For further insight into Vision Language Models and their structural design, explore additional information here.

In this blog, we will be leveraging the Qwen2 7B Visual Language Model by Alibaba, by finetuning it on our custom healthcare dataset of radiology images and question answer pairs.

Learning Objectives

  • Understand the role and capabilities of Vision Language Models in processing both visual and textual data.
  • Learn about Visual Question Answering (VQA) and how it combines image recognition with natural language processing.
  • Explore the need for fine-tuning VLMs on custom datasets for domain-specific applications like healthcare or finance.
  • Gain insights into leveraging fine-tuned Qwen2 7B VLM for precise tasks on multimodal datasets.
  • Discover the benefits and implementation of fine-tuning VLMs to improve performance on specialized use cases.

This article was published as a part of the Data Science Blogathon.

Introduction to Vision Language Models

Vision language models are generally described as a type of multimodal models capable of learning from both images and text. These generative models accept image and text inputs and produce text outputs. Large vision language models exhibit strong zero-shot capabilities, generalize effectively, and are compatible with various types of images, including documents and web pages. Their applications encompass chatting about images, image recognition based on instructions, visual question answering, document understanding, and image captioning, among others.

Certain vision language models are also adept at capturing spatial properties within an image. They can generate bounding boxes or segmentation masks when instructed to detect or segment specific subjects, and they can localize different entities or respond to queries about their relative or absolute positions. The existing array of large vision language models is diverse in terms of the data they were trained on, how they encode images, and their overall capabilities.

What is Visual Question Answering?

Visual question answering is a task in artificial intelligence where the goal is to generate a correct answer to a question about a given image. A VQA model needs to understand both the visual content of the image and the semantics of the natural language question. This requires the model to perform a combination of image recognition and natural language processing.

For example, given an image of a dog sitting on a sofa and the question “What is the dog sitting on?”, the VQA model must first detect and recognize the objects in the image—identifying the dog and the sofa. It then needs to parse the question, understanding that the query is about the relationship between the dog and its surrounding environment. By combining these insights, the model can generate the answer “sofa.”

Importance of Fine-Tuning VLMs for Domain-Specific Applications

With the advent of LLMs or Large Language Models for Question Answering, Content Generation,, Summarization etc. various industries have started leveraging LLMs for their business use cases by coupling it with an RAG (Retrieval Augmented Generation) layer for the search and retrieval from vector databases which stores textual content as embeddings. As we all know, most of internet data is text, hence except for very complex use cases, there is not much need for training or finetuning LLMs, reason being – they are trained on vast amount of internet data and they are highly adept at understanding any form of text without the need of a transfer learning mechanism. 

But let’s take a minute and think the same for images – are internet images domain specific? No. Most of the internet images are general purpose images and Visual Language Models are hence, trained with those general purpose images, making them difficult to perform better for targeted use cases in healthcare, manufacturing, finance, etc. where the images present are poles apart in structure and composition from the general purpose images (let’s say images in ImageNet and other benchmark datasets). Hence, finetuning VLMs for custom use cases has become an increasingly common approach for companies wanting to leverage the power of these pretrained VLMs on business specific use cases willing to extract and generate information from not only text, but visual elements too. 

Key Instances where Model Fine-tuning is Crucial

  • Domain-Specific Adjustment: Fine-tuning tailors models to function optimally within a particular domain, taking into account its unique language, style, or data.
  • Task-Focused Customization: This process involves leveraging a model’s capabilities so it excels at a specific task, making it adept at handling the nuances and requirements of that task.
  • Efficiency in Resource Use: By fine-tuning, models are optimized to use computational resources more effectively, thereby enhancing performance without unnecessary resource expenditure.

In essence, the process of fine-tuning is a strategic approach to model optimization, ensuring that the model not only fits the task at hand with greater accuracy but also operates with enhanced efficiency.

What is Unsloth?

Unsloth is a framework used for efficient finetuning of large language, and vision language models at scale. Given below are a few highlights on Unsloth, which makes it a go-to choice for model finetuning activities for ML Engineers and Data Scientists:

  • Enhanced Fine-Tuning Framework: Delivers a refined system for tuning both vision-language models (VLMs) and large language models (LLMs), boasting training times that are up to 30 times quicker alongside a 60% reduction in memory consumption.
  • Cross-Hardware Compatibility: Accommodates a variety of hardware configurations such as NVIDIA, AMD, and Intel GPUs. This is achieved through the use of advanced weight optimization strategies that significantly improve memory usage efficiency.
  • Faster Inference Time: Unsloth provides a natively 2x faster inference module for inferencing finetuned models. All QLoRA, LoRA and non LoRA inference paths are 2x faster. This requires no change of code or any new dependencies.

Code Implementation Using the 4-bit Quantized Qwen2 7B VL Model

Below we will look into the detailed steps using 4-bit quantized Qwen2 7B VL model:

Step1: Import all the necessary dependencies

To kick off our hands-on journey, we begin by importing the necessary libraries and modules to set up our deep learning environment.

import torch
import os
from tqdm import tqdm

from datasets import load_dataset
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig

Step2: Configuration and Environment Variables

Now we move on to define key constants that will be used throughout our training process. TRAIN_SET, TEST_SET, and VAL_SET are set to “Train“, “Test“, and “Valid” respectively. These constants will help us reference specific data splits in our dataset, ensuring that we’re training on the right data and evaluating our model’s performance accurately.

We also define hyperparameters specific to the LoRA (Low-Rank Adaptation) architecture, which are ‘LORA_RANK‘ and ‘LORA_ALPHA‘, both set to 16. ‘LORA_RANK’ determines the rank of the low-rank matrices, while ‘LORA_ALPHA’ specifies the scale of the adaptation. Additionally, we have set ‘LORA_DROPOUT’ to 0, as we’re not applying dropout in the LoRA layers during fine-tuning.

To keep track of our experiments and model training, we set environment variables for Weights & Biases (wandb), a popular tool for experiment tracking, model optimization, and dataset versioning. By setting the ‘WANDB_PROJECT’ variable to “qwen2-vl-finetuning-logs”, we specify the project namespace in wandb where all our logs and outputs will be stored. The ‘WANDB_LOG_MODEL‘ variable is set to “checkpoint”, which instructs wandb to log model checkpoints, allowing us to monitor the model’s performance over time and resume training if necessary. These environment configurations are necessary for a manageable and reproducible training workflow. 

TRAIN_SET = "Train"
TEST_SET = "Test"
VAL_SET = "Valid"

LORA_RANK = 16
LORA_ALPHA = 16
LORA_DROPOUT = 0

os.environ["WANDB_PROJECT"] = "qwen2-vl-finetuning-logs"
os.environ["WANDB_LOG_MODEL"] = "checkpoint"

Step3: Loading the Qwen2 VL 7B model and tokenizer

In this step, we initialize our model and tokenizer using the FastVisionModel.from_pretrained method. We specify the pre-trained model we wish to use, in this case, “unsloth/Qwen2-VL-7B-Instruct-bnb-4bit“. The use_gradient_checkpointing parameter is set to “unsloth“, which enables gradient checkpointing to optimize memory usage during training. Gradient checkpointing is particularly useful when working with large models or when limited GPU memory is available.

By executing this code, we load both the model weights and the associated tokenizer, setting us up for the subsequent fine-tuning process.

Note

For educational purposes and to expedite our training process, we opt to load a quantized 4-bit version of our model. Quantization reduces the precision of the model’s weights, which can lead to faster inference times and decreased memory usage without significantly impacting performance, making it ideal for learning scenarios and quick experimentation.

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2-VL-7B-Instruct-bnb-4bit",
    use_gradient_checkpointing="unsloth",
)

On running this cell, you must be able to see the below image in your output:

output

In the provided code snippet, we configure a model for Parameter-Efficient Fine-Tuning (PEFT) using the Low-Rank Adaptation (LoRA) technique. LoRA is a resource-efficient method for adapting large pre-trained models to new tasks. Vision-language models are typically pre-trained on large datasets, learning representations that transfer well to various downstream tasks. However, fine-tuning all parameters in these large models is computationally expensive and may lead to overfitting, especially with limited domain-specific data.

LoRA addresses this by adding low-rank matrices that approximate updates to the original weight matrices of the model. This is done in a way that is specifically designed to capture the new task’s requirements with minimal additional parameters. Read about it more here

model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,  # False if not finetuning vision layers
    finetune_language_layers=True,  # False if not finetuning language layers
    finetune_attention_modules=True,  # False if not finetuning attention layers
    finetune_mlp_modules=True,  # False if not finetuning MLP layers
    r=LORA_RANK,  # The larger, the higher the accuracy, but might overfit
    lora_alpha=LORA_ALPHA,  # Recommended alpha == r at least
    lora_dropout=LORA_DROPOUT,
    bias="none",
    random_state=3407,
    use_rslora=False,  # We support rank stabilized LoRA
    loftq_config=None,  # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

Understanding the Parameters

Let’s break down each of the parameters in the code snippet provided for the FastVisionModel.get_peft_model method, which is used to configure the model for PEFT using LoRA:

  • finetune_vision_layers=True: Enables the vision layers of the model to be fine-tuned, allowing them to adapt to new visual data that may differ significantly from the data seen during pre-training. This is especially beneficial for tasks involving domain-specific imagery.
  • finetune_language_layers=True: Updates the language-processing layers, helping the model better understand and generate responses for linguistic nuances in the new task. This is crucial for fine-tuning the model’s textual output.
  • finetune_attention_modules=True: Fine-tunes the attention modules, which play a key role in understanding relationships between input elements. By refining these modules, the model can better identify task-relevant features and dependencies.
  • finetune_mlp_modules=True: Adapts the multi-layer perceptron (MLP) components of the model. These layers process outputs from attention modules, and their fine-tuning ensures better alignment with the specific requirements of the new task.
  • r=LORA_RANK: Sets the rank for the low-rank matrices introduced by LoRA, influencing the number of trainable parameters. Higher values can enhance accuracy but risk overfitting, making this a key parameter for balancing performance.
  • lora_alpha=LORA_ALPHA: Determines the scaling factor for LoRA weights, controlling how much they influence the model’s behavior. Larger values lead to more significant deviations from the pre-trained model.
  • lora_dropout=LORA_DROPOUT: Applies dropout regularization to LoRA layers, reducing overfitting risks during fine-tuning and improving model generalization.
  • bias="none": Indicates that biases in the LoRA layers are not adjusted during fine-tuning, simplifying the training process.
  • random_state=3407: Ensures reproducibility by fixing the random seed for consistent results.
  • use_rslora=False: Disables Rank Stabilized LoRA (RS-LoRA), favoring standard LoRA for simplicity.
  • loftq_config=None: Skips LoftQ as the model already uses a 4-bit quantized Qwen setup.
  • target_modules="all-linear": Indicates LoRA fine-tuning is applied to all linear layers, offering flexibility for customization.

Step4: Loading the Dataset

This step involves loading the MEDPIX-ShortQA dataset using the load_dataset function, which retrieves the training, testing, and validation sets for model training and evaluation.

The MEDPIX-ShortQA dataset consists of radiology images paired with short questions and answers. It is designed to train models for medical image diagnosis. The dataset includes image IDs, case IDs, and metadata, along with image width in pixels. It is structured to help develop AI models that interpret radiological images and answer related medical questions. This supports radiologists and healthcare professionals in their work.

train_dataset = load_dataset("adishourya/MEDPIX-ShortQA", split=TRAIN_SET)
test_dataset = load_dataset("adishourya/MEDPIX-ShortQA", split=TEST_SET)
val_dataset = load_dataset("adishourya/MEDPIX-ShortQA", split=VAL_SET)

Dataset preview (output on running the above cell):

data set review

Step5: Define chat template and convert dataset

Nothing fancy here! In this step, we define a function convert_to_conversation that transforms our MEDPIX-ShortQA dataset samples into a conversation format. This format is more suitable for training conversational AI models. Each sample is converted into a structured dialogue with a “user” asking a question accompanied by an “image” of a radiology scan, and the “assistant” providing the medical diagnosis as an answer. 

Next, by iterating over the training, testing, and validation datasets, we transform each sample into a structured conversation:

def convert_to_conversation(sample):
    conversation = [
        {
            "role": "user",
            "content": [
                {"type": "text", "text": sample["question"]},
                {"type": "image", "image": sample["image_id"]},
            ],
        },
        {"role": "assistant", "content": [{"type": "text", "text": sample["answer"]}]},
    ]
    return {"messages": conversation}
train_set = [convert_to_conversation(sample) for sample in train_dataset]
test_set = [convert_to_conversation(sample) for sample in test_dataset]
val_set = [convert_to_conversation(sample) for sample in val_dataset]

Let’s take a look for better understanding! Run the below cell and you will get a similar output and shown in the image below.

train_set[0] #look below for output!
Define chat template and convert dataset

Step6: Running Zero-shot Inference on Few Samples

In this step, we focus on evaluating our Qwen2 VL model in a zero-shot setting, which means we test the model’s pretrained weights without any additional training or fine-tuning. To do this, we define the function run_test_set, which performs inference on a given dataset. The function processes the dataset in batches and uses a pre-trained model and tokenizer to generate responses to the provided questions. 

def run_test_set(dataset, batch_size=8):
    FastVisionModel.for_inference(model)
    ground_truths, responses = [], []

    for sample in tqdm(
        dataset,
        desc="Running inference on test set",
        bar_format="{l_bar}{bar:10}{r_bar}{bar:-10b}",
    ):
        image = sample["messages"][0]["content"][1]["image"]
        question = sample["messages"][0]["content"][0]["text"]
        answer = sample["messages"][1]["content"][0]["text"]

        messages = [
            {
                "role": "user",
                "content": [{"type": "image"}, {"type": "text", "text": question}],
            }
        ]
        input_text = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
        )
        inputs = tokenizer(
            image,
            input_text,
            add_special_tokens=False,
            return_tensors="pt",
        ).to("cuda")
        with torch.no_grad():
            generated_ids = model.generate(
                **inputs, max_new_tokens=128, use_cache=True, temperature=0.5, min_p=0.1
            )
        generated_ids_trimmed = [
            out_ids[len(in_ids) :]
            for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        response = tokenizer.batch_decode(
            generated_ids_trimmed,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )[0]
        responses.append(response)
        ground_truths.append(answer)
        torch.cuda.empty_cache()
    return ground_truths, responses

Now, let’s run the inference sing the below cell!

ground_truths, responses = run_test_set(test_set, batch_size=8)

Step7: Evaluating Results on Test Set in Zero Shot Setting

In this step we will be evaluating the performance of your Vision-Language Model (VLM) on the test set in a zero-shot setting. We have chosen to use the BERTScore, which is a metric for evaluating the quality of text generated by models based on the BERT embeddings. BERTScore computes precision, recall, and F1 score, which reflect the semantic similarity between the generated text and the reference text.

from bert_score import score

P, R, F1 = score(responses, ground_truths, lang="en", verbose=True, nthreads=10)

print(
    f"""
Precision: {P}
Recall: {R}
F1 Score: {F1}
"""
)

On zero-shot mode, we are using the model’s pretrained weights to perform on our targeted task – which is answering questions from radiology scans or medical imageries. As we discussed earlier, VLMs are pretrained on general purpose images of animals, transports, places, landscapes, etc.

Hence, using the model’s pretrained weights only for our targeted use case won’t yield great performance, which can be clearly seen from the scores I got by running the above cell:-

Precision Recall F1-Score
0.7786 0.7943 0.7863

It is important to first check the zero-shot capabilities of the chosen model before starting the transfer learning phase. This practice highlights the model’s performance in its pre-trained setting. It also serves as a benchmark, showing how well the model handles complex domain-specific use cases.

Step8: Initiating the Training/Finetuning the VLM

In this step, we are preparing to train or fine-tune the Qwen2 VL model. The code snippet below demonstrates the setup required to initiate the training process using a custom trainer, which is likely a part of a training framework like Hugging Face’s Transformers library or a similar custom implementation.

At first we are preparing the model for training by setting it in the training mode. This typically involves enabling gradient computations and dropout layers, which are used during training but not during inference. Then we are creating an instance of SFTTrainer (Supervised Finetuning Trainer), which is responsible for managing the training process. This includes everything from data collation to model optimization and logging.

FastVisionModel.for_training(model)  # Enable for training!

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),  # Must use!
    train_dataset=train_set,
    eval_dataset=val_set,
    args=SFTConfig(
        do_train=True,
        do_eval=True,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=8,
        save_total_limit=1,
        warmup_steps=5,
        # max_steps = 30,
        num_train_epochs=2,  # Set this instead of max_steps for full training runs
        learning_rate=2e-4,
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        logging_steps=100,
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
        save_strategy="steps",
        save_steps=100,
        report_to=["wandb"],
        # For vision finetuning:
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        dataset_num_proc=4,
        max_seq_length=2048,
    ),
)

As we can see in the code above, the SFTTrainer takes several parameters, let’s go through each of them for complete understanding:-

  • model: The model you’re training. Here, it is Qwen2 7B Vision Language Model.
  • tokenizer: The tokenizer for pre-processing text data. Here we are using Qwen model’s tokenizer itself.
  • data_collator: An instance of UnslothVisionDataCollator that handles batching and preparing data for the model during training.
  • train_dataset and eval_dataset: The datasets for training and evaluation.
  • args: An instance of SFTConfig that contains various training arguments and hyperparameters.

SFTConfig Class Paramters

The SFTConfig class includes parameters such as:

  • do_train and do_eval: Flags to indicate whether training and evaluation should be performed.
  • Batch size, learning rate, and other optimization-related settings.
  • logging_steps and output_dir: Settings for logging and saving model checkpoints.
  • report_to: A list of services to which training progress should be reported (e.g., Weights & Biases).
  • Settings specific to vision fine-tuning, like max_seq_length, remove_unused_columns and dataset_kwargs.

The trainer wrapper encapsulates the training logic and can be used to start the training process by calling a method like trainer.train().

Note: Ensure that all necessary custom classes and methods (FastVisionModel, SFTTrainer, UnslothVisionDataCollator, SFTConfig) are imported from the correct libraries. After configuring and initiating the trainer, begin the training process. You can then monitor the results using the logging and reporting tools specified in your configuration.

Additionally, use the below cell is to check the memory usage using PyTorch cuda utility function. 

# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

Output should look like the below image:

output

The below code snippet runs the training by using a trainer object and stores the statistics in trainer_stats

trainer_stats = trainer.train()

Output should look similar to the below image:

output

The table we see in the output image above shows the training loss at various steps during the training, and we can see that the loss is gradually decreasing, which is expected and also shows that the model is learning and improving its performance over time.

Additionally, there will also be logging messages of Weights & Biases (wandb) logging. This indicates that the checkpoint at a certain step has been saved and added to an artifact for experiment tracking and versioning.

Checking Final Memory and Time Stats

Use the below snippet to check the final memory and time stats! (optional)

# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Output should look similar to the below image:

output

Step9: Test the Finetuned Qwen Model on Test Set

The function run_test_set is designed to evaluate a trained FastVisionModel on a given dataset.

def run_test_set(dataset):
    FastVisionModel.for_inference(model)
    ground_truths, responses = [], []

    for sample in tqdm(dataset, desc="Running inference on test set",bar_format="{l_bar}{bar:10}{r_bar}{bar:-10b}",):
        image = sample["messages"][0]["content"][1]["image"]
        question = sample["messages"][0]["content"][0]["text"]
        answer = sample["messages"][1]["content"][0]["text"]

        messages = [
            {
                "role": "user",
                "content": [{"type": "image"}, {"type": "text", "text": question}],
            }
        ]
        input_text = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
        )
        inputs = tokenizer(
            image,
            input_text,
            add_special_tokens=False,
            return_tensors="pt",
        ).to("cuda")

        generated_ids = model.generate(
            **inputs, max_new_tokens=128, use_cache=True, temperature=0.5, min_p=0.1
        )
        generated_ids_trimmed = [
            out_ids[len(in_ids) :]
            for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
        ]
        response = tokenizer.batch_decode(
            generated_ids_trimmed,
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False,
        )[0]
        responses.append(response)
        ground_truths.append(answer)
    return ground_truths, responses

The snippet above involves the following steps:

  • Prepare the model for inference by calling FastVisionModel.for_inference(model).
  • Initialize two empty lists: ground_truths to store the correct answers and responses to store the model’s generated responses.
  • Iterate over each sample in the dataset using a progress bar (tqdm) to provide feedback on the inference process.
  • For each sample, extract the image, the question text, and the ground truth answer text.
  • Construct the input messages in the format expected by the model, combining the image and the question text.
  • Apply the tokenizer to these messages using a chat template with the addition of a generation prompt, if required.
  • Tokenize the combined image and text input and move the tensor to the GPU for inference (to(“cuda”)).
  • Generate a response from the model using the generate method with specified parameters. This ensures that only new tokens are considered in the generated response by trimming the input tokens.
  • Decode the generated token IDs back into text, ignoring special tokens, and append the result to the responses list.
  • Also, append the ground truth answer to the ground_truths list.

Finally, the function returns two lists: ground_truths, containing the correct answers from the dataset, and responses, containing the model’s generated responses. These can be used to evaluate the model’s performance on the test set by comparing the generated responses to the ground truths.

Use the below snippet to begin running inference on test set!

ground_truths, responses = run_test_set(test_set)

Great job on coming this far! It’s time to print the metrics now and check how the model is performing!

Step10: Observations and Results on Finetuned Qwen2 VLM (Evaluation)

This step involves evaluating the quality of generated responses by the fine-tuned Qwen2 Vision Language Model (VLM) using BERTScore. BERTScore leverages the contextual embeddings from pre-trained BERT models to calculate the similarity between two pieces of text.

Let’s use the model and try to generate response using an image and question pair from the test set.

Observations and Results on Finetuned Qwen2 VLM (Evaluation)

The above image shows presence of a black mass in the left part of the brain, which the model was able to identify and describe in the response!

Now let’s use BERTScore just like las time to print the metrics!

from bert_score import score

P, R, F1 = score(responses, ground_truths, lang="en", verbose=True, nthreads=10)
print(
    f"""
Precision: {P.mean().cpu().numpy()}
Recall: {R.mean().cpu().numpy()}
F1 Score: {F1.mean().cpu().numpy()}
"""
)

Refer the below image for results.

Observations and Results on Finetuned Qwen2 VLM (Evaluation)

The fine-tuned model performs significantly better than the earlier zero-shot predictions, which had scores of around 78%. Precision and recall have now improved to approximately 87%. This demonstrates how fine-tuning VLMs on targeted datasets enhances their performance. It makes the model more reliable and effective in solving real-world challenges, such as those in healthcare, as shown in this article.

Conclusion

In conclusion, fine-tuning Vision Language Models (VLMs) like Qwen2 is a major advancement in AI, especially for processing multimodal data. The high precision, recall, and F1 scores show the model’s ability to generate responses closely aligned with human-generated ground truths, demonstrating the effectiveness of fine-tuning.

Fine-tuning allows models to go beyond their initial pre-training, enabling adaptation to the specific nuances and complexities of new domains. This adaptability is vital for industries like life sciences, finance, retail, and manufacturing, where documents often contain a mix of text and visual information that must be interpreted together to derive accurate and meaningful insights.

For more discussions, ideas or improvements and suggestions on this topic, please connect with me on my LinkedIn, and feel free to visit my GitHub Repo for accessing the entire code used in this article!

Thank You and Happy Learning! 🙂

Key Takeaways

  • Qwen2 VLM’s fine-tuning shows strong semantic understanding, reflected in high BERTScore metrics.
  • Fine-tuning enables Qwen2 VLM to adapt effectively to domain-specific datasets across industries.
  • Fine-tuning boosts model accuracy beyond the zero-shot baseline for specialized tasks.
  • Fine-tuning validates transfer learning’s efficiency, reducing costs and time for custom models.
  • The fine-tuning approach is scalable, ensuring consistent model improvements across industries.
  • Fine-tuned VLMs excel in analyzing text and visuals for insights across multimodal datasets.

Frequently Asked Questions

Q1. What is fine-tuning in the context of VLMs?

A. Fine-tuning involves adapting a pre-trained VLM to a specific dataset or task, improving its performance on domain-specific challenges by training on relevant data.

Q2. What types of tasks can VLMs handle?

A. VLMs can perform tasks such as image recognition, visual question answering, document understanding, and captioning, all of which require the integration of text and images.

Q3. How does fine-tuning benefit VLMs?

A. Fine-tuning allows the model to better understand domain-specific nuances in both images and text, enhancing its ability to provide accurate and contextually relevant responses.

Q4. Why are VLMs important for domain-specific tasks?

A. They are crucial for industries like healthcare, finance, and manufacturing, as they can process both images and text, enabling more accurate and insightful results for domain-specific use cases.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

An ace multi-skilled programmer whose major area of work and interest lies in Software Development, Data Science, and Machine Learning. A proactive and detail-oriented individual who loves data storytelling, and is curious and passionate to solve complex value-oriented business problems with Data Science and Machine Learning to deliver robust machine learning pipelines that ensure maximum impact.

In my free time, I focus on creating Data Science and AI/ML content, providing 1:1 mentorships, career guidance and interview preparation tips, with a sole focus on teaching complex topics the easier way, to help people make a successful career transition to Data Science with the right skillset!

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details