Boost Model Evaluation with Custom Metrics in LLaMA-Factory

Rushikesh Chavan Last Updated : 05 Nov, 2024

6 min read

In this guide, I’ll walk you through the process of adding a custom evaluation metric to LLaMA-Factory. LLaMA-Factory is a versatile tool that enables users to fine-tune large language models (LLMs) with ease, thanks to its user-friendly WebUI and comprehensive set of scripts for training, deploying, and evaluating models. A key feature of LLaMA-Factory is LLaMA Board, an integrated dashboard that also displays evaluation metrics, providing valuable insights into model performance. While standard metrics are available by default, the ability to add custom metrics allows us to evaluate models in ways that are directly relevant to our specific use cases.

We’ll also cover the steps to create, integrate, and visualize a custom metric on LLaMA Board. By following this guide, you’ll be able to monitor additional metrics tailored to your needs, whether you’re interested in domain-specific accuracy, nuanced error types, or user-centered evaluations. This customization empowers you to assess model performance more effectively, ensuring it aligns with your application’s unique goals. Let’s dive in!

Learning Outcomes

Understand how to define and integrate a custom evaluation metric in LLaMA-Factory.
Gain practical skills in modifying metric.py to include custom metrics.
Learn to visualize custom metrics on LLaMA Board for enhanced model insights.
Acquire knowledge on tailoring model evaluations to align with specific project needs.
Explore ways to monitor domain-specific model performance using personalized metrics.

This article was published as a part of the Data Science Blogathon.

Learning Outcomes
What is LLaMA-Factory?
Getting Started with LLaMA-Factory
Understanding Evaluation Metrics in LLaMA-Factory
Prerequisites for Adding a Custom Metric
Defining Your Custom Metric
Modifying sft/metric.py to Integrate the Custom Metric
Conclusion
Frequently Asked Questions

What is LLaMA-Factory?

LLaMA-Factory, developed by hiyouga, is an open-source project enabling users to fine-tune language models through a user-friendly WebUI interface. It offers a full suite of tools and scripts for fine-tuning, building chatbots, serving, and benchmarking LLMs.

Designed with beginners and non-technical users in mind, LLaMA-Factory simplifies the process of fine-tuning open-source LLMs on custom datasets, eliminating the need to grasp complex AI concepts. Users can simply select a model, upload their dataset, and adjust a few settings to start the training.

Upon completion, the web application also allows for testing the model, providing a quick and efficient way to fine-tune LLMs on a local machine.

While standard metrics provide valuable insights into a fine-tuned model’s general performance, customized metrics offer a way to directly evaluate a model’s effectiveness in your specific use case. By tailoring metrics, you can better gauge how well the model meets unique requirements that generic metrics might overlook. Custom metrics are invaluable because they offer the flexibility to create and track measures specifically aligned with practical needs, enabling continuous improvement based on relevant, measurable criteria. This approach allows for a targeted focus on domain-specific accuracy, weighted importance, and user experience alignment.

Getting Started with LLaMA-Factory

For this example, we’ll use a Python environment. Ensure you have Python 3.8 or higher and the necessary dependencies installed as per the repository requirements.

Installation

We will first install all the requirements.

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics]"

Fine-Tuning with LLaMA Board GUI (powered by Gradio)

llamafactory-cli webui

Note: You can find the official setup guide in more detail here on Github.

Understanding Evaluation Metrics in LLaMA-Factory

Learn about the default evaluation metrics provided by LLaMA-Factory, such as BLEU and ROUGE scores, and why they are essential for assessing model performance. This section also introduces the value of customizing metrics.

BLEU score

BLEU (Bilingual Evaluation Understudy) score is a metric used to evaluate the quality of text generated by machine translation models by comparing it to a reference (or human-translated) text. The BLEU score primarily assesses how similar the generated translation is to one or more reference translations.

ROUGE score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score is a set of metrics used to evaluate the quality of text summaries by comparing them to reference summaries. It is widely used for summarization tasks, and it measures the overlap of words and phrases between the generated and reference texts.

These metrics are available by default, but you can also add customized metrics tailored to your specific use case.

Prerequisites for Adding a Custom Metric

This guide assumes that LLaMA-Factory is already set up on your machine. If not, please refer to the LLaMA-Factory documentation for installation and setup.

In this example, the function returns a random value between 0 and 1 to simulate an accuracy score. However, you can replace this with your own evaluation logic to calculate and return an accuracy value (or any other metric) based on your specific requirements. This flexibility allows you to define custom evaluation criteria that better reflect your use case.

Defining Your Custom Metric

To begin, let’s create a Python file called custom_metric.py and define our custom metric function within it.

In this example, our custom metric is called x_score. This metric will take preds (predicted values) and labels (ground truth values) as inputs and return a score based on your custom logic.

import random

def cal_x_score(preds, labels):
    """
    Calculate a custom metric score.

    Parameters:
    preds -- list of predicted values
    labels -- list of ground truth values

    Returns:
    score -- a random value or a custom calculation as per your requirement
    """
    # Custom metric calculation logic goes here
    
    # Example: return a random score between 0 and 1
    return random.uniform(0, 1)

You may replace the random score with your specific calculation logic.

Modifying sft/metric.py to Integrate the Custom Metric

To ensure that LLaMA Board recognizes our new metric, we’ll need to integrate it into the metric computation pipeline within src/llamafactory/train/sft/metric.py

Add Your Metric to the Score Dictionary:

Locate the ComputeSimilarity function within sft/metric.py
Update self.score_dict to include your new metric as follows:

self.score_dict = {
    "rouge-1": [],
    "rouge-2": [],
    "bleu-4": [],
    "x_score": []  # Add your custom metric here
}

Modifying sft/metric.py to Integrate the Custom Metric

Calculate and Append the Custom Metric in the __call__ Method:

Within the __call__ method, compute your custom metric and add it to the score_dict. Here’s an example of how to do that:

from .custom_metric import cal_x_score
def __call__(self, preds, labels):
    # Calculate the custom metric score
    custom_score = cal_x_score(preds, labels)
    # Append the score to 'extra_metric' in the score dictionary
    self.score_dict["x_score"].append(custom_score * 100)

This integration step is essential for the custom metric to appear on LLaMA Board.

The predict_x_score metric now appears successfully, showing an accuracy of 93.75% for this model and validation dataset. This integration provides a straightforward way for you to assess each fine-tuned model directly within the evaluation pipeline.

Conclusion

After setting up your custom metric, you should see it in LLaMA Board after running the evaluation pipeline. The extra metric scores will update for each evaluation.

With these steps, you’ve successfully integrated a custom evaluation metric into LLaMA-Factory! This process gives you the flexibility to go beyond default metrics, tailoring model evaluations to meet the unique needs of your project. By defining and implementing metrics specific to your use case, you gain more meaningful insights into model performance, highlighting strengths and areas for improvement in ways that matter most to your goals.

Adding custom metrics also enables a continuous improvement loop. As you fine-tune and train models on new data or modify parameters, these personalized metrics offer a consistent way to assess progress. Whether your focus is on domain-specific accuracy, user experience alignment, or nuanced scoring methods, LLaMA Board provides a visual and quantitative way to compare and track these outcomes over time.

By enhancing model evaluation with customized metrics, LLaMA-Factory allows you to make data-driven decisions, refine models with precision, and better align the results with real-world applications. This customization capability empowers you to create models that perform effectively, optimize toward relevant goals, and provide added value in practical deployments.

Key Takeaways

Custom metrics in LLaMA-Factory enhance model evaluations by aligning them with unique project needs.
LLaMA Board allows for easy visualization of custom metrics, providing deeper insights into model performance.
Modifying metric.py enables seamless integration of custom evaluation criteria.
Personalized metrics support continuous improvement, adapting evaluations to evolving model goals.
Tailoring metrics empowers data-driven decisions, optimizing models for real-world applications.

Frequently Asked Questions

Q1. What is LLaMA-Factory?

A. LLaMA-Factory is an open-source tool for fine-tuning large language models through a user-friendly WebUI, with features for training, deploying, and evaluating models.

Q2. Why add a custom evaluation metric?

A. Custom metrics allow you to assess model performance based on criteria specific to your use case, providing insights that standard metrics may not capture.

Q3. How do I create a custom metric?

A. Define your metric in a Python file, specifying the logic for how it should calculate performance based on your data.

Q4. Where do I integrate the custom metric in LLaMA-Factory?

A. Add your metric to the sft/metric.py file and update the score dictionary and computation pipeline to include it.

Q5. Will my custom metric appear on LLaMA Board?

A. Yes, once you integrate your custom metric, LLaMA Board displays it, allowing you to visualize its results alongside other metrics.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

blogathon

Rushikesh Chavan

Hello everyone! I'm Rushikesh, a Machine Learning Engineer at BNY Mellon with a strong interest in data science. I graduated from PICT with a degree in Computer Science, and since then, I've worked on cutting-edge backend technologies and various data science projects, including both traditional ML and GenAI initiatives.

Advanced Generative AI LLMs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Data analyst Learning Path

Tableau Learning Path

NLP Learning Path

Data Scientist Learning Path

Data Engineer Learning Path

MLOps Learning Path

AI Engineer Learning Path

Computer Vision Learning Path

Generative AI Learning Path

Generative AI Roadmap for Enterprises

LLMs Roadmap

Prompt Engineer Leaning Path

Boost Model Evaluation with Custom Metrics in LLaMA-Factory

Learning Outcomes

Table of contents

What is LLaMA-Factory?

Getting Started with LLaMA-Factory

Installation

Fine-Tuning with LLaMA Board GUI (powered by Gradio)

Understanding Evaluation Metrics in LLaMA-Factory

BLEU score

ROUGE score

Prerequisites for Adding a Custom Metric

Defining Your Custom Metric

Modifying sft/metric.py to Integrate the Custom Metric

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv