In this guide, I’ll walk you through the process of adding a custom evaluation metric to LLaMA-Factory. LLaMA-Factory is a versatile tool that enables users to fine-tune large language models (LLMs) with ease, thanks to its user-friendly WebUI and comprehensive set of scripts for training, deploying, and evaluating models. A key feature of LLaMA-Factory is LLaMA Board, an integrated dashboard that also displays evaluation metrics, providing valuable insights into model performance. While standard metrics are available by default, the ability to add custom metrics allows us to evaluate models in ways that are directly relevant to our specific use cases.
We’ll also cover the steps to create, integrate, and visualize a custom metric on LLaMA Board. By following this guide, you’ll be able to monitor additional metrics tailored to your needs, whether you’re interested in domain-specific accuracy, nuanced error types, or user-centered evaluations. This customization empowers you to assess model performance more effectively, ensuring it aligns with your application’s unique goals. Let’s dive in!
metric.py
to include custom metrics.This article was published as a part of the Data Science Blogathon.
LLaMA-Factory, developed by hiyouga, is an open-source project enabling users to fine-tune language models through a user-friendly WebUI interface. It offers a full suite of tools and scripts for fine-tuning, building chatbots, serving, and benchmarking LLMs.
Designed with beginners and non-technical users in mind, LLaMA-Factory simplifies the process of fine-tuning open-source LLMs on custom datasets, eliminating the need to grasp complex AI concepts. Users can simply select a model, upload their dataset, and adjust a few settings to start the training.
Upon completion, the web application also allows for testing the model, providing a quick and efficient way to fine-tune LLMs on a local machine.
While standard metrics provide valuable insights into a fine-tuned model’s general performance, customized metrics offer a way to directly evaluate a model’s effectiveness in your specific use case. By tailoring metrics, you can better gauge how well the model meets unique requirements that generic metrics might overlook. Custom metrics are invaluable because they offer the flexibility to create and track measures specifically aligned with practical needs, enabling continuous improvement based on relevant, measurable criteria. This approach allows for a targeted focus on domain-specific accuracy, weighted importance, and user experience alignment.
For this example, we’ll use a Python environment. Ensure you have Python 3.8 or higher and the necessary dependencies installed as per the repository requirements.
We will first install all the requirements.
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"
llamafactory-cli webui
Note: You can find the official setup guide in more detail here on Github.
Learn about the default evaluation metrics provided by LLaMA-Factory, such as BLEU and ROUGE scores, and why they are essential for assessing model performance. This section also introduces the value of customizing metrics.
BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate the quality of text generated by machine translation models by comparing it to a reference (or human-translated) text. The BLEU score primarily assesses how similar the generated translation is to one or more reference translations.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score is a set of metrics used to evaluate the quality of text summaries by comparing them to reference summaries. It is widely used for summarization tasks, and it measures the overlap of words and phrases between the generated and reference texts.
These metrics are available by default, but you can also add customized metrics tailored to your specific use case.
This guide assumes that LLaMA-Factory is already set up on your machine. If not, please refer to the LLaMA-Factory documentation for installation and setup.
In this example, the function returns a random value between 0 and 1 to simulate an accuracy score. However, you can replace this with your own evaluation logic to calculate and return an accuracy value (or any other metric) based on your specific requirements. This flexibility allows you to define custom evaluation criteria that better reflect your use case.
To begin, let’s create a Python file called custom_metric.py and define our custom metric function within it.
In this example, our custom metric is called x_score. This metric will take preds (predicted values) and labels (ground truth values) as inputs and return a score based on your custom logic.
import random
def cal_x_score(preds, labels):
"""
Calculate a custom metric score.
Parameters:
preds -- list of predicted values
labels -- list of ground truth values
Returns:
score -- a random value or a custom calculation as per your requirement
"""
# Custom metric calculation logic goes here
# Example: return a random score between 0 and 1
return random.uniform(0, 1)
You may replace the random score with your specific calculation logic.
To ensure that LLaMA Board recognizes our new metric, we’ll need to integrate it into the metric computation pipeline within src/llamafactory/train/sft/metric.py
Add Your Metric to the Score Dictionary:
self.score_dict = {
"rouge-1": [],
"rouge-2": [],
"bleu-4": [],
"x_score": [] # Add your custom metric here
}
Calculate and Append the Custom Metric in the __call__ Method:
from .custom_metric import cal_x_score
def __call__(self, preds, labels):
# Calculate the custom metric score
custom_score = cal_x_score(preds, labels)
# Append the score to 'extra_metric' in the score dictionary
self.score_dict["x_score"].append(custom_score * 100)
This integration step is essential for the custom metric to appear on LLaMA Board.
The predict_x_score
metric now appears successfully, showing an accuracy of 93.75% for this model and validation dataset. This integration provides a straightforward way for you to assess each fine-tuned model directly within the evaluation pipeline.
After setting up your custom metric, you should see it in LLaMA Board after running the evaluation pipeline. The extra metric scores will update for each evaluation.
With these steps, you’ve successfully integrated a custom evaluation metric into LLaMA-Factory! This process gives you the flexibility to go beyond default metrics, tailoring model evaluations to meet the unique needs of your project. By defining and implementing metrics specific to your use case, you gain more meaningful insights into model performance, highlighting strengths and areas for improvement in ways that matter most to your goals.
Adding custom metrics also enables a continuous improvement loop. As you fine-tune and train models on new data or modify parameters, these personalized metrics offer a consistent way to assess progress. Whether your focus is on domain-specific accuracy, user experience alignment, or nuanced scoring methods, LLaMA Board provides a visual and quantitative way to compare and track these outcomes over time.
By enhancing model evaluation with customized metrics, LLaMA-Factory allows you to make data-driven decisions, refine models with precision, and better align the results with real-world applications. This customization capability empowers you to create models that perform effectively, optimize toward relevant goals, and provide added value in practical deployments.
metric.py
enables seamless integration of custom evaluation criteria.A. LLaMA-Factory is an open-source tool for fine-tuning large language models through a user-friendly WebUI, with features for training, deploying, and evaluating models.
A. Custom metrics allow you to assess model performance based on criteria specific to your use case, providing insights that standard metrics may not capture.
A. Define your metric in a Python file, specifying the logic for how it should calculate performance based on your data.
A. Add your metric to the sft/metric.py
file and update the score dictionary and computation pipeline to include it.
A. Yes, once you integrate your custom metric, LLaMA Board displays it, allowing you to visualize its results alongside other metrics.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.