Quantitative Metrics Simplified for Language Model Evaluation

Vikas Verma Last Updated : 01 Mar, 2024

9 min read

Introduction

Language models are usually trained on extensive amounts of textual data. These models aid in generating natural-sounding responses like humans. Additionally, they can perform various language-related tasks such as translation, text summarization, text generation, answering specific questions, and more. Language models’ evaluation is crucial to validate their performance, quality and to ensure the production of top-notch text. This is particularly significant for applications where the generated text
influences decision-making or furnishes information to users.

There are various ways to evaluate language models such as human evaluation, feedback from end-users, LLM-based evaluation, academic benchmarks (like GLUE and SQuAD), and standard quantitative metrics. In this article, we will delve deeply into various standard quantitative metrics such as BLEU, ROUGE, and METEOR. Quantitative metrics in the field of NLP have been pivotal in understanding language models and their functionalities. From precision and recall to BLEU and ROUGE scores, these metrics offer a quantitative metrics evaluation of model effectiveness. Let’s delve into each traditional metric.

Learning Objectives

Explore various types of standard quantitative metrics.
Understand the intuition, and math behind each metric.
Explore the limitations, and key features of each metric.

This article was published as a part of the Data Science Blogathon.

What is BLEU Score ?
- BLEU score calculation
- Geometric Average Precision
What is Brevity Penalty?
How to Implement BLEU Score in Python?
- BLEU Score Limitations
What is ROUGE score?
- Different Types of Metrics under ROUGE
- ROUGE Score Calculation
How to Implement ROUGE Score in Python?
- ROUGE Score Limitations
What is METEOR?
- METEOR Score Calculation
How to Implement METEOR Score in Python?
Frequently Asked Questions

What is BLEU Score ?

BLEU (BiLingual Evaluation Understudy) score is a metric for automatically evaluating machine-translated text. It evaluates how closely the machine-translated text aligns with a collection of high-quality reference translations. The BLEU score ranges from 0 to 1, with 0 indicating no overlap between the machine-translated output and the reference translation (i.e. low-quality translation), and 1 indicating perfect overlap with the reference translations (i.e. high-quality translation). It is an easy-to-understand and inexpensive-to-compute measure. Mathematically BLEU score is defined as:

BLEU score calculation

The BLEU score is calculated by comparing the n-grams in the machine-translated text to those in the reference text. N-grams refer to sequences of words, where “n” indicates the number of words in the sequence.

Let’s understand the BLEU score calculation using the following example:

Candidate sentence: They cancelled the match because it was raining.

Target sentence: They cancelled the match because of bad weather.

Here, the candidate sentence represents the sentence predicted by the language model and the target
sentence represents the reference sentence. To compute geometric average precision let’s first understand the precision scores from 1-gram to 4-grams.

Precision 1-gram

Predicated sentence 1-grams: [‘They’, ‘cancelled’, ‘the’, ‘match’, ‘because’, ‘it’, ‘was’, ‘raining’]

Precision 1-gram = 5/8 = 0.625

Precision 2-gram

Predicated sentence 2-grams: [‘They cancelled’, ‘cancelled the’, ‘the match’, ‘match because’, ‘because it’, ‘it was’, ‘was raining’]

Precision 2-gram = 4/7 = 0.5714

Precision 3-gram

Predicated sentence 3-grams: [‘They cancelled the’, ‘cancelled the match’, ‘the match because’, ‘match because it’, ‘because it was’, ‘it was raining’]

Precision 3-gram = 3/6 = 0.5

Precision 4-gram

Predicated sentence 4-grams: [‘They cancelled the match’, ‘cancelled the match because’, ‘the match because it’, ‘match because it was’, ‘because it was raining’]

Precision 4-gram = 2/5 = 0.4

Geometric Average Precision

Geometric average precision with different weights for different n-grams can be computed as

Here pn is the precision for n-grams. For N = 4 (up to 4-grams) with uniform weights.

What is Brevity Penalty?

Imagine the scenario where the language model predicts only one word, such as “cancelled,” resulting
in a clipped precision of 1. However, this can be misleading as it encourages the model to predict fewer words to achieve a high score.

To address this issue, a Brevity penalty is used, which penalizes machine translations that are too short
compared to the reference sentence. Where, c is the predicted length i.e. number of words in the predicated sentence. “r” is the target length i.e. number of words in the target sentence.

Here, Brevity Penalty =1

So BLEU(4) = 0.5169*1 = 0.5169

How to Implement BLEU Score in Python?

There are various implementations of the BLEU score in Python under different libraries. We will be using evaluate library. Evaluate library simplifies the process of evaluating and comparing language model results.

Installation

!pip install evaluate
import evaluate

bleu = evaluate.load("bleu")

predictions = ["They cancelled the match because it was raining "]
references = ["They cancelled the match because of bad weather"]

results = bleu.compute(predictions=predictions, references=references)
print(results)

BLEU Score Limitations

It does not capture the semantic and syntactic similarity of the word. If the language model uses “called off” instead of “cancelled”, the bleu score considers it as an incorrect word.
It doesn’t capture the significance of individual words within the text. For instance, prepositions, which typically carry less weight in meaning, are given equal importance by BLEU alongside nouns and verbs.
It doesn’t preserve the order of words.

It only considers exact word matches. For instance, “rain” and “raining” convey the same meaning, but BLEU Score treats them as errors due to the lack of exact match.
It primarily relies on precision and doesn’t consider recall. Therefore, it doesn’t consider whether all words from the reference are included in the predicted text or not.

What is ROUGE score?

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score comprises a set of metrics used for text summarization (commonly) and machine translation tasks evaluation. It was designed to evaluate the quality of machine-generated summaries by comparing them against the reference summaries. It measures the similarity between the machine-generated summary and the reference summaries by examining the overlapping n-grams. ROUGE metrics range from 0 to 1, where higher scores signify greater similarity between the automatically generated summary and the reference, whereas a score closer to zero suggests poor similarity between the candidate and the references.

Different Types of Metrics under ROUGE

ROUGE-N: Measures the overlap of n-grams between the system and reference summaries. For example,
ROUGE-1 assesses the overlap of unigrams (individual words), whereas ROUGE-2 examines the overlap of bigrams (pairs of two consecutive words).

ROUGE-L: It relies on the length of the Longest Common Subsequence (LCS). It calculates the longest common subsequence (LCS) between the candidate text and the reference text. It doesn’t require consecutive matches but instead considers in-sequence matches, reflecting the word order at the sentence level.

ROUGE-Lsum: It divides the text into sentences using newlines and calculates the LCS for each pair of
sentences. It then combines all LCS scores into a unified metric. This method is suitable for situations where both the candidate and reference summaries consist of multiple sentences.

ROUGE Score Calculation

ROUGE is essentially the F1 score derived from the precision and recall of n-grams. Precision (in the context of ROUGE) represents the proportion of n-grams in the prediction that also appear in the reference.

Recall (in the context of ROUGE) is the proportion of reference n-grams that are also captured by the
model-generated summary.

Let’s understand the ROUGE score calculation with the help of below example:

Candidate/Predicted Summary: He was extremely happy last night.

Reference/Target Summary: He was happy last night.

ROUGE1

Predicated 1-grams: [‘He’, ‘was’, ‘extremely’, ‘happy’, ‘last’, ‘night’]

Reference 1-grams: [‘He’, ‘was’, ‘happy’, ‘last’, ‘night’]

Overlapping 1-grams: [‘He’, ‘was’, ‘happy’, ‘last’, ‘night’]

Precision 1-gram = 5/6 = 0.83

Recall 1-gram = 6/6 = 1

ROUGE1 = (2*0.83*1) /
(0.83+1) = 0.9090

ROUGE2

Predicated 2-grams: [‘He was’, ‘was extremely’, ‘extremely happy’, ‘happy last’, ‘last night’]

Reference 2-grams: [‘He was’, ‘was happy’, ‘happy last’, ‘last night’]

Overlapping 2-grams: [‘He was’, ‘happy last’, ‘last night’]

Precision 2-gram = 3/5 = 0.6

Recall 2-gram = 3/4 = 0.75

ROUGE2 = (2*0.6*0.75) / (0.6+0.75) = 0.6666

How to Implement ROUGE Score in Python?

import evaluate 

rouge = evaluate.load('rouge')

predictions = ["He was extremely happy last night"]
references = ["He was happy last night"]

results = rouge.compute(predictions=predictions,references=references)
print(results)

ROUGE Score Limitations

It does not capture the semantic similarity of the words.
Its ability to detect order is limited, particularly when shorter n-grams are examined.
It lacks a proper mechanism for penalizing specific prediction lengths, such as when the generated summary is overly brief or contains unnecessary details.

What is METEOR?

METEOR (Metric for Evaluation of Translation with Explicit Ordering) score is a metric used to assess the quality of generated text by evaluating the alignment between the generated text and the reference text. It is computed using the harmonic mean of precision and recall, with recall being weighted more than precision. METEOR also incorporates a chunk penalty (a measure of fragmentation), which is intended to directly assess how well-ordered the matched words in the machine translation are compared to the reference.

It is a generalized concept of unigram matching between the machine-generated translation and reference translations. Unigrams can be matched according to their original forms, stemmed forms, synonyms, and meanings. It ranges from 0 to 1, where a higher score indicates better alignment between the language model translated text and the reference text.

Key Features of METEOR

It considers the order in which words appear as it penalizes the results having incorrect syntactical orders. BLEU score does not take word order into account.
It incorporates synonyms, stems, and paraphrases, allowing it to recognize translations that use different words or phrases while still conveying the same meaning as the reference translation.
Unlike the BLEU score, METEOR considers both the precision and recall (generally having more weight).
Mathematically METEOR is defined as –

METEOR Score Calculation

Let’s understand the BLEU score calculation using the following example:

Candidate/Predicted: The dog is hiding under the table.

Reference/Target: The dog is under the table.

Weighted F-score

Let’s first compute the weighted F-score.

Where α parameter controls the relative weights of precision and recall, with a default value of 0.9.

Predicated 1-grams: [‘The’, ‘dog’, ‘is’, ‘hiding’, ‘under’, ‘the’, ‘table’]

Reference 1-grams: [‘The’, ‘dog’, ‘is’, ‘under’, ‘the’, ‘table’]

Overlapping 1-grams: [‘The’, ‘dog’, ‘is’, ‘under’, ‘the’, ‘table’]

Precision 1-gram = 6/7 = 0.8571

Recall 1-gram = 6/6 = 1

So weighted F-score = 0.9836

Chunk Penalty

To ensure the correct word order, a penalty function is incorporated that rewards the longest matches and penalizes the more fragmented matches. The penalty function is defined as –

Where β is the parameter that controls the shape of the penalty as a function of fragmentation. The default value is 3. Parameter determines the relative weight assigned to the fragmentation penalty. The default value is 0.5.

“c” is the number of longest matching chunks in the candidate, here {‘the dog is’, ‘under the table’}. “m” is the number of unique unigrams in the candidate.

So Penalty = 0.0185

METEOR = (1 – Penalty) *
Weighted F-score = (1-0.0185)*0.9836 = 0.965

How to Implement METEOR Score in Python?

import evaluate

meteor = evaluate.load('meteor')

predictions = ["The dog is hiding under the table"]
references = ["The dog is under the table"]

results = meteor.compute(predictions=predictions,references=references)
print(results)

Conclusion

In this article, we discussed various types of quantitative metrics to evaluate the language model’s output. We additionally delved into their computation, presenting it clearly and understandably through both mathematical concepts and code implementation.

Key Takeaways

Assessing language models is essential to validate their output accuracy, efficiency, and reliability.
BLEU and METEOR are primarily used for machine translation tasks in NLP and ROUGE for text summarization.
The evaluate Python library contains built-in implementation for various quantitative metrics such as BLEU, ROUGE, METEOR, Perplexity, BERT score, etc.
Capturing the contextual and semantic relationships is crucial when evaluating output, yet standard quantitative metrics often fall short in achieving this.

Frequently Asked Questions

Q1. What is the significance of the brevity penalty in the context of BLEU score?

A. Brevity Penalty addresses the potential issue of overly short translations produced by language models. Without the Brevity Penalty, a model could artificially inflate its score by predicting fewer words, which might not accurately reflect the quality of the translation. The penalty penalizes translations that are significantly shorter than the reference sentence.

Q2. What are the different types of metrics returned by the evaluate library while computing the ROUGE score?

A. The built-in implementation of the ROUGE score inside the evaluate library returns rouge1, rouge2, rougeL, and rougeLsum.

Q3. Out of the above three metrics which one makes use of recall?

A. ROUGE and METEOR make use of recall in their calculations, where METEOR assigns more weight to recall.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Vikas Verma

A Data Science professional with 7.5 years of experience in data science, machine learning, and programming. Hands-on experience in different domains like data analytics, deep learning, big data, and natural language processing.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Quantitative Metrics Simplified for Language Model Evaluation

Introduction

Learning Objectives

Table of contents

What is BLEU Score ?

BLEU score calculation

Geometric Average Precision

What is Brevity Penalty?

How to Implement BLEU Score in Python?

BLEU Score Limitations

What is ROUGE score?

Different Types of Metrics under ROUGE

ROUGE Score Calculation

ROUGE1

ROUGE2

How to Implement ROUGE Score in Python?

ROUGE Score Limitations

What is METEOR?

Key Features of METEOR

METEOR Score Calculation

Chunk Penalty

How to Implement METEOR Score in Python?

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid