Language models are usually trained on extensive amounts of textual data. These models aid in generating natural-sounding responses like humans. Additionally, they can perform various language-related tasks such as translation, text summarization, text generation, answering specific questions, and more. Language models’ evaluation is crucial to validate their performance, quality and to ensure the production of top-notch text. This is particularly significant for applications where the generated text
influences decision-making or furnishes information to users.
There are various ways to evaluate language models such as human evaluation, feedback from end-users, LLM-based evaluation, academic benchmarks (like GLUE and SQuAD), and standard quantitative metrics. In this article, we will delve deeply into various standard quantitative metrics such as BLEU, ROUGE, and METEOR. Quantitative metrics in the field of NLP have been pivotal in understanding language models and their functionalities. From precision and recall to BLEU and ROUGE scores, these metrics offer a quantitative metrics evaluation of model effectiveness. Let’s delve into each traditional metric.
This article was published as a part of the Data Science Blogathon.
BLEU (BiLingual Evaluation Understudy) score is a metric for automatically evaluating machine-translated text. It evaluates how closely the machine-translated text aligns with a collection of high-quality reference translations. The BLEU score ranges from 0 to 1, with 0 indicating no overlap between the machine-translated output and the reference translation (i.e. low-quality translation), and 1 indicating perfect overlap with the reference translations (i.e. high-quality translation). It is an easy-to-understand and inexpensive-to-compute measure. Mathematically BLEU score is defined as:
The BLEU score is calculated by comparing the n-grams in the machine-translated text to those in the reference text. N-grams refer to sequences of words, where “n” indicates the number of words in the sequence.
Let’s understand the BLEU score calculation using the following example:
Candidate sentence: They cancelled the match because it was raining.
Target sentence: They cancelled the match because of bad weather.
Here, the candidate sentence represents the sentence predicted by the language model and the target
sentence represents the reference sentence. To compute geometric average precision let’s first understand the precision scores from 1-gram to 4-grams.
Precision 1-gram
Predicated sentence 1-grams: [‘They’, ‘cancelled’, ‘the’, ‘match’, ‘because’, ‘it’, ‘was’, ‘raining’]
Precision 1-gram = 5/8 = 0.625
Precision 2-gram
Predicated sentence 2-grams: [‘They cancelled’, ‘cancelled the’, ‘the match’, ‘match because’, ‘because it’, ‘it was’, ‘was raining’]
Precision 2-gram = 4/7 = 0.5714
Precision 3-gram
Predicated sentence 3-grams: [‘They cancelled the’, ‘cancelled the match’, ‘the match because’, ‘match because it’, ‘because it was’, ‘it was raining’]
Precision 3-gram = 3/6 = 0.5
Precision 4-gram
Predicated sentence 4-grams: [‘They cancelled the match’, ‘cancelled the match because’, ‘the match because it’, ‘match because it was’, ‘because it was raining’]
Precision 4-gram = 2/5 = 0.4
Geometric average precision with different weights for different n-grams can be computed as
Here pn is the precision for n-grams. For N = 4 (up to 4-grams) with uniform weights.
Imagine the scenario where the language model predicts only one word, such as “cancelled,” resulting
in a clipped precision of 1. However, this can be misleading as it encourages the model to predict fewer words to achieve a high score.
To address this issue, a Brevity penalty is used, which penalizes machine translations that are too short
compared to the reference sentence. Where, c is the predicted length i.e. number of words in the predicated sentence. “r” is the target length i.e. number of words in the target sentence.
Here, Brevity Penalty =1
So BLEU(4) = 0.5169*1 = 0.5169
There are various implementations of the BLEU score in Python under different libraries. We will be using evaluate library. Evaluate library simplifies the process of evaluating and comparing language model results.
Installation
!pip install evaluate
import evaluate
bleu = evaluate.load("bleu")
predictions = ["They cancelled the match because it was raining "]
references = ["They cancelled the match because of bad weather"]
results = bleu.compute(predictions=predictions, references=references)
print(results)
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score comprises a set of metrics used for text summarization (commonly) and machine translation tasks evaluation. It was designed to evaluate the quality of machine-generated summaries by comparing them against the reference summaries. It measures the similarity between the machine-generated summary and the reference summaries by examining the overlapping n-grams. ROUGE metrics range from 0 to 1, where higher scores signify greater similarity between the automatically generated summary and the reference, whereas a score closer to zero suggests poor similarity between the candidate and the references.
ROUGE-N: Measures the overlap of n-grams between the system and reference summaries. For example,
ROUGE-1 assesses the overlap of unigrams (individual words), whereas ROUGE-2 examines the overlap of bigrams (pairs of two consecutive words).
ROUGE-L: It relies on the length of the Longest Common Subsequence (LCS). It calculates the longest common subsequence (LCS) between the candidate text and the reference text. It doesn’t require consecutive matches but instead considers in-sequence matches, reflecting the word order at the sentence level.
ROUGE-Lsum: It divides the text into sentences using newlines and calculates the LCS for each pair of
sentences. It then combines all LCS scores into a unified metric. This method is suitable for situations where both the candidate and reference summaries consist of multiple sentences.
ROUGE is essentially the F1 score derived from the precision and recall of n-grams. Precision (in the context of ROUGE) represents the proportion of n-grams in the prediction that also appear in the reference.
Recall (in the context of ROUGE) is the proportion of reference n-grams that are also captured by the
model-generated summary.
Let’s understand the ROUGE score calculation with the help of below example:
Candidate/Predicted Summary: He was extremely happy last night.
Reference/Target Summary: He was happy last night.
Predicated 1-grams: [‘He’, ‘was’, ‘extremely’, ‘happy’, ‘last’, ‘night’]
Reference 1-grams: [‘He’, ‘was’, ‘happy’, ‘last’, ‘night’]
Overlapping 1-grams: [‘He’, ‘was’, ‘happy’, ‘last’, ‘night’]
Precision 1-gram = 5/6 = 0.83
Recall 1-gram = 6/6 = 1
ROUGE1 = (2*0.83*1) /
(0.83+1) = 0.9090
Predicated 2-grams: [‘He was’, ‘was extremely’, ‘extremely happy’, ‘happy last’, ‘last night’]
Reference 2-grams: [‘He was’, ‘was happy’, ‘happy last’, ‘last night’]
Overlapping 2-grams: [‘He was’, ‘happy last’, ‘last night’]
Precision 2-gram = 3/5 = 0.6
Recall 2-gram = 3/4 = 0.75
ROUGE2 = (2*0.6*0.75) / (0.6+0.75) = 0.6666
import evaluate
rouge = evaluate.load('rouge')
predictions = ["He was extremely happy last night"]
references = ["He was happy last night"]
results = rouge.compute(predictions=predictions,references=references)
print(results)
METEOR (Metric for Evaluation of Translation with Explicit Ordering) score is a metric used to assess the quality of generated text by evaluating the alignment between the generated text and the reference text. It is computed using the harmonic mean of precision and recall, with recall being weighted more than precision. METEOR also incorporates a chunk penalty (a measure of fragmentation), which is intended to directly assess how well-ordered the matched words in the machine translation are compared to the reference.
It is a generalized concept of unigram matching between the machine-generated translation and reference translations. Unigrams can be matched according to their original forms, stemmed forms, synonyms, and meanings. It ranges from 0 to 1, where a higher score indicates better alignment between the language model translated text and the reference text.
Let’s understand the BLEU score calculation using the following example:
Candidate/Predicted: The dog is hiding under the table.
Reference/Target: The dog is under the table.
Weighted F-score
Let’s first compute the weighted F-score.
Where α parameter controls the relative weights of precision and recall, with a default value of 0.9.
Predicated 1-grams: [‘The’, ‘dog’, ‘is’, ‘hiding’, ‘under’, ‘the’, ‘table’]
Reference 1-grams: [‘The’, ‘dog’, ‘is’, ‘under’, ‘the’, ‘table’]
Overlapping 1-grams: [‘The’, ‘dog’, ‘is’, ‘under’, ‘the’, ‘table’]
Precision 1-gram = 6/7 = 0.8571
Recall 1-gram = 6/6 = 1
So weighted F-score = 0.9836
To ensure the correct word order, a penalty function is incorporated that rewards the longest matches and penalizes the more fragmented matches. The penalty function is defined as –
Where β is the parameter that controls the shape of the penalty as a function of fragmentation. The default value is 3. Parameter determines the relative weight assigned to the fragmentation penalty. The default value is 0.5.
“c” is the number of longest matching chunks in the candidate, here {‘the dog is’, ‘under the table’}. “m” is the number of unique unigrams in the candidate.
So Penalty = 0.0185
METEOR = (1 – Penalty) *
Weighted F-score = (1-0.0185)*0.9836 = 0.965
import evaluate
meteor = evaluate.load('meteor')
predictions = ["The dog is hiding under the table"]
references = ["The dog is under the table"]
results = meteor.compute(predictions=predictions,references=references)
print(results)
In this article, we discussed various types of quantitative metrics to evaluate the language model’s output. We additionally delved into their computation, presenting it clearly and understandably through both mathematical concepts and code implementation.
A. Brevity Penalty addresses the potential issue of overly short translations produced by language models. Without the Brevity Penalty, a model could artificially inflate its score by predicting fewer words, which might not accurately reflect the quality of the translation. The penalty penalizes translations that are significantly shorter than the reference sentence.
A. The built-in implementation of the ROUGE score inside the evaluate library returns rouge1, rouge2, rougeL, and rougeLsum.
A. ROUGE and METEOR make use of recall in their calculations, where METEOR assigns more weight to recall.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.