Boost Model Evaluation with Custom Metrics in LLaMA-Factory

Rushikesh Chavan Last Updated : 05 Nov, 2024
6 min read

In this guide, I’ll walk you through the process of adding a custom evaluation metric to LLaMA-Factory. LLaMA-Factory is a versatile tool that enables users to fine-tune large language models (LLMs) with ease, thanks to its user-friendly WebUI and comprehensive set of scripts for training, deploying, and evaluating models. A key feature of LLaMA-Factory is LLaMA Board, an integrated dashboard that also displays evaluation metrics, providing valuable insights into model performance. While standard metrics are available by default, the ability to add custom metrics allows us to evaluate models in ways that are directly relevant to our specific use cases.

We’ll also cover the steps to create, integrate, and visualize a custom metric on LLaMA Board. By following this guide, you’ll be able to monitor additional metrics tailored to your needs, whether you’re interested in domain-specific accuracy, nuanced error types, or user-centered evaluations. This customization empowers you to assess model performance more effectively, ensuring it aligns with your application’s unique goals. Let’s dive in!

Learning Outcomes

  • Understand how to define and integrate a custom evaluation metric in LLaMA-Factory.
  • Gain practical skills in modifying metric.py to include custom metrics.
  • Learn to visualize custom metrics on LLaMA Board for enhanced model insights.
  • Acquire knowledge on tailoring model evaluations to align with specific project needs.
  • Explore ways to monitor domain-specific model performance using personalized metrics.

This article was published as a part of the Data Science Blogathon.

What is LLaMA-Factory?

LLaMA-Factory, developed by hiyouga, is an open-source project enabling users to fine-tune language models through a user-friendly WebUI interface. It offers a full suite of tools and scripts for fine-tuning, building chatbots, serving, and benchmarking LLMs.

Designed with beginners and non-technical users in mind, LLaMA-Factory simplifies the process of fine-tuning open-source LLMs on custom datasets, eliminating the need to grasp complex AI concepts. Users can simply select a model, upload their dataset, and adjust a few settings to start the training.

Upon completion, the web application also allows for testing the model, providing a quick and efficient way to fine-tune LLMs on a local machine.

While standard metrics provide valuable insights into a fine-tuned model’s general performance, customized metrics offer a way to directly evaluate a model’s effectiveness in your specific use case. By tailoring metrics, you can better gauge how well the model meets unique requirements that generic metrics might overlook. Custom metrics are invaluable because they offer the flexibility to create and track measures specifically aligned with practical needs, enabling continuous improvement based on relevant, measurable criteria. This approach allows for a targeted focus on domain-specific accuracy, weighted importance, and user experience alignment.

Getting Started with LLaMA-Factory

For this example, we’ll use a Python environment. Ensure you have Python 3.8 or higher and the necessary dependencies installed as per the repository requirements.

Installation

We will first install all the requirements.

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Fine-Tuning with LLaMA Board GUI (powered by Gradio)

llamafactory-cli webui

Note: You can find the official setup guide in more detail here on Github.

Understanding Evaluation Metrics in LLaMA-Factory

Learn about the default evaluation metrics provided by LLaMA-Factory, such as BLEU and ROUGE scores, and why they are essential for assessing model performance. This section also introduces the value of customizing metrics.

BLEU score

BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate the quality of text generated by machine translation models by comparing it to a reference (or human-translated) text. The BLEU score primarily assesses how similar the generated translation is to one or more reference translations.

ROUGE score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score is a set of metrics used to evaluate the quality of text summaries by comparing them to reference summaries. It is widely used for summarization tasks, and it measures the overlap of words and phrases between the generated and reference texts.

These metrics are available by default, but you can also add customized metrics tailored to your specific use case.

Prerequisites for Adding a Custom Metric

This guide assumes that LLaMA-Factory is already set up on your machine. If not, please refer to the LLaMA-Factory documentation for installation and setup.

In this example, the function returns a random value between 0 and 1 to simulate an accuracy score. However, you can replace this with your own evaluation logic to calculate and return an accuracy value (or any other metric) based on your specific requirements. This flexibility allows you to define custom evaluation criteria that better reflect your use case.

Defining Your Custom Metric

To begin, let’s create a Python file called custom_metric.py and define our custom metric function within it.

In this example, our custom metric is called x_score. This metric will take preds (predicted values) and labels (ground truth values) as inputs and return a score based on your custom logic.

import random

def cal_x_score(preds, labels):
    """
    Calculate a custom metric score.

    Parameters:
    preds -- list of predicted values
    labels -- list of ground truth values

    Returns:
    score -- a random value or a custom calculation as per your requirement
    """
    # Custom metric calculation logic goes here
    
    # Example: return a random score between 0 and 1
    return random.uniform(0, 1)

You may replace the random score with your specific calculation logic.

Modifying sft/metric.py to Integrate the Custom Metric

To ensure that LLaMA Board recognizes our new metric, we’ll need to integrate it into the metric computation pipeline within src/llamafactory/train/sft/metric.py

Add Your Metric to the Score Dictionary:

  • Locate the ComputeSimilarity function within sft/metric.py
  • Update self.score_dict to include your new metric as follows:
self.score_dict = {
    "rouge-1": [],
    "rouge-2": [],
    "bleu-4": [],
    "x_score": []  # Add your custom metric here
}
Modifying sft/metric.py to Integrate the Custom Metric

Calculate and Append the Custom Metric in the __call__ Method: 

  • Within the __call__ method, compute your custom metric and add it to the score_dict. Here’s an example of how to do that:
from .custom_metric import cal_x_score
def __call__(self, preds, labels):
    # Calculate the custom metric score
    custom_score = cal_x_score(preds, labels)
    # Append the score to 'extra_metric' in the score dictionary
    self.score_dict["x_score"].append(custom_score * 100)

This integration step is essential for the custom metric to appear on LLaMA Board.

llama board Evaluate tab
Final result

The predict_x_score metric now appears successfully, showing an accuracy of 93.75% for this model and validation dataset. This integration provides a straightforward way for you to assess each fine-tuned model directly within the evaluation pipeline.

Conclusion

After setting up your custom metric, you should see it in LLaMA Board after running the evaluation pipeline. The extra metric scores will update for each evaluation.

With these steps, you’ve successfully integrated a custom evaluation metric into LLaMA-Factory! This process gives you the flexibility to go beyond default metrics, tailoring model evaluations to meet the unique needs of your project. By defining and implementing metrics specific to your use case, you gain more meaningful insights into model performance, highlighting strengths and areas for improvement in ways that matter most to your goals.

Adding custom metrics also enables a continuous improvement loop. As you fine-tune and train models on new data or modify parameters, these personalized metrics offer a consistent way to assess progress. Whether your focus is on domain-specific accuracy, user experience alignment, or nuanced scoring methods, LLaMA Board provides a visual and quantitative way to compare and track these outcomes over time.

By enhancing model evaluation with customized metrics, LLaMA-Factory allows you to make data-driven decisions, refine models with precision, and better align the results with real-world applications. This customization capability empowers you to create models that perform effectively, optimize toward relevant goals, and provide added value in practical deployments.

Key Takeaways

  • Custom metrics in LLaMA-Factory enhance model evaluations by aligning them with unique project needs.
  • LLaMA Board allows for easy visualization of custom metrics, providing deeper insights into model performance.
  • Modifying metric.py enables seamless integration of custom evaluation criteria.
  • Personalized metrics support continuous improvement, adapting evaluations to evolving model goals.
  • Tailoring metrics empowers data-driven decisions, optimizing models for real-world applications.

Frequently Asked Questions

Q1. What is LLaMA-Factory?

A. LLaMA-Factory is an open-source tool for fine-tuning large language models through a user-friendly WebUI, with features for training, deploying, and evaluating models.

Q2. Why add a custom evaluation metric?

A. Custom metrics allow you to assess model performance based on criteria specific to your use case, providing insights that standard metrics may not capture.

Q3. How do I create a custom metric?

A. Define your metric in a Python file, specifying the logic for how it should calculate performance based on your data.

Q4. Where do I integrate the custom metric in LLaMA-Factory?

A. Add your metric to the sft/metric.py file and update the score dictionary and computation pipeline to include it.

Q5. Will my custom metric appear on LLaMA Board?

A. Yes, once you integrate your custom metric, LLaMA Board displays it, allowing you to visualize its results alongside other metrics.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Hello everyone! I'm Rushikesh, a Machine Learning Engineer at BNY Mellon with a strong interest in data science. I graduated from PICT with a degree in Computer Science, and since then, I've worked on cutting-edge backend technologies and various data science projects, including both traditional ML and GenAI initiatives.

Responses From Readers

Clear

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details