We all depend on LLMs for our everyday activities, but quantifying “How efficient they are” is a gigantic challenge. Conventional metrics such as BLEU, ROUGE, and METEOR tend to fail in comprehending the real meaning of the text. They are too keen on matching similar words instead of comprehending the concept behind it. BERTScore reverses this by applying BERT embeddings to assess the quality of the text with better comprehension of meaning and context.
Whether you’re training a chatbot, translating, or making summaries, BERTScore makes it easier for you to evaluate your models better. It captures when two sentences convey the same thing despite using different words—something older metrics completely miss. As we dive into how BERTScore operates, you’ll learn how this brilliant evaluation approach ties together computer measurement and human intuition and revolutionizes the way we test and refine today’s sophisticated language models.
BERTScore is a neural evaluation metric for text generation that uses contextual embeddings from pre-trained language models like BERT to calculate similarity scores between candidate and reference texts. Unlike traditional n-gram-based metrics, BERTScore can identify semantic equivalence even when different words are used, making it useful for evaluating language tasks where multiple valid outputs exist.
Formulated by Zhang et al. and presented in their 2019 paper “BERTScore: Evaluating Text Generation with BERT,” this score has gained rapid acceptance within the NLP community due to its high correlation with human evaluation across a range of text generation tasks.
BERTScore’s architecture is elegantly simple yet powerful, consisting of three main components:
The beauty of BERTScore is that it leverages the contextual understanding of pre-trained models without requiring additional training for the evaluation task.
BERTScore can be customized using several parameters to suit specific evaluation needs:
Parameter | Description | Default |
model_type | Pre-trained model to use (e.g., ‘bert-base-uncased’) | ‘roberta-large’ |
num_layers | Which layer’s embeddings to use | 17 (for roberta-large) |
idf | Whether to use IDF weighting for token importance | False |
rescale_with_baseline | Whether to rescale scores based on a baseline | False |
baseline_path | Path to baseline scores | None |
lang | Language of the texts being compared | ‘en’ |
use_fast_tokenizer | Whether to use HuggingFace’s fast tokenizers | False |
These parameters allow researchers to fine-tune BERTScore for different languages, domains, and evaluation requirements.
BERTScore evaluates the similarity between generated text and reference text through a token-level matching process using contextual embeddings. Here is a step-by-step breakdown of how it operates:
This approach allows BERTScore to capture semantic equivalence even when different words are used to express the same meaning, making it more robust than lexical matching metrics for evaluating modern text generation systems.
Let’s implement BERTScore step by step to understand how it works in practice.
First, install the necessary packages:
# Install the bert-score package
pip install bert-score
Here’s how to calculate BERTScore between candidate and reference texts:
import bert_score
# Define reference and candidate texts
references = ["The cat sat on the mat.", "The feline rested on the floor covering."]
candidates = ["A cat was sitting on a mat.", "The cat was on the mat."]
# Calculate BERTScore
P, R, F1 = bert_score.score(
candidates,
references,
lang="en",
model_type="roberta-large",
num_layers=17,
verbose=True
)
# Print results
for i, (p, r, f) in enumerate(zip(P, R, F1)):
print(f"Example {i+1}:")
print(f" Precision: {p.item():.4f}")
print(f" Recall: {r.item():.4f}")
print(f" F1: {f.item():.4f}")
print()
Output:
This demonstrates how BERTScore captures semantic similarity even when different phrasings are used.
The core of BERTScore lies in how it leverages contextual embeddings and cosine similarity. Let’s break down the process:
1. Generating Contextual Embeddings: With this distinction in mind, BERTScore is a measure really alternative to the traditional n-gram-based measures, since it is based on contextual embedding generation. Unlike static word embeddings (such as Word2Vec or GloVe), contextual embeddings are finely tuned for semantic similarity evaluation as they account for the importance of surrounding context in assigning meaning to words.
import torch
from transformers import AutoTokenizer, AutoModel
def get_bert_embeddings(texts, model_name="bert-base-uncased"):
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Move model to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# Process texts in batch
encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
encoded_input = {k: v.to(device) for k, v in encoded_input.items()}
# Get model output
with torch.no_grad():
outputs = model(**encoded_input)
# Use embeddings from the last layer
embeddings = outputs.last_hidden_state
# Remove padding tokens
attention_mask = encoded_input['attention_mask']
embeddings = [emb[mask.bool()] for emb, mask in zip(embeddings, attention_mask)]
return embeddings
# Example usage
texts = ["The cat sat on the mat.", "A cat was sitting on a mat."]
embeddings = get_bert_embeddings(texts)
print(f"Number of texts: {len(embeddings)}")
print(f"Shape of first text embeddings: {embeddings[0].shape}")
Output:
2. Computing Cosine Similarity: BERTScore uses cosine similarity, a metric that measures how aligned two vectors are in the embedding space regardless of their size, to calculate the semantic similarity between tokens once contextual embeddings for the reference and candidate texts have been created.
Now, let’s implement the cosine similarity calculation between tokens:
def token_cosine_similarity(embeddings1, embeddings2):
# Normalize embeddings for cosine similarity
embeddings1_norm = embeddings1 / embeddings1.norm(dim=1, keepdim=True)
embeddings2_norm = embeddings2 / embeddings2.norm(dim=1, keepdim=True)
similarity_matrix = torch.matmul(embeddings1_norm, embeddings2_norm.transpose(0, 1))
return similarity_matrix
# Example usage with our previously generated embeddings
sim_matrix = token_cosine_similarity(embeddings[0], embeddings[1])
print(f"Shape of similarity matrix: {sim_matrix.shape}")
print("Similarity matrix (token-to-token):")
print(sim_matrix)
Output:
Let’s implement the core BERTScore calculation from scratch to understand the mathematics behind it:
BERTScore calculates three metrics:
1. Precision: How many tokens in the candidate text match tokens in the reference?
2. Recall: How many tokens in the reference text are covered by the candidate?
3. F1: The harmonic mean of precision and recall
Where:
def calculate_bertscore(candidate_embeddings, reference_embeddings):
# Compute similarity matrix
sim_matrix = token_cosine_similarity(candidate_embeddings, reference_embeddings)
# Compute precision (max similarity for each candidate token)
precision = sim_matrix.max(dim=1)[0].mean().item()
# Compute recall (max similarity for each reference token)
recall = sim_matrix.max(dim=0)[0].mean().item()
# Compute F1
f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0
return precision, recall, f1
# Example
cand_emb = embeddings[0] # "The cat sat on the mat."
ref_emb = embeddings[1] # "A cat was sitting on a mat."
precision, recall, f1 = calculate_bertscore(cand_emb, ref_emb)
print(f"Custom BERTScore calculation:")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1: {f1:.4f}")
Output:
This implementation demonstrates the core algorithm behind BERTScore. The actual library includes additional optimizations, IDF weighting options, and baseline rescaling.
Advantages | Limitations |
Captures semantic similarity beyond lexical overlap | Computationally more intensive than n-gram metrics |
Correlates better with human judgments | Performance depends on the quality of underlying embeddings |
Works well across different tasks and domains | May not capture structural or logical coherence |
No training required specifically for evaluation | Can be sensitive to the choice of BERT layer and model |
Handles synonyms and paraphrases naturally | Less interpretable than explicit matching metrics |
Language-agnostic (with appropriate models) | Requires GPU for efficient processing of large datasets |
Can be customized with different embedding models | Not designed to evaluate factual correctness |
Effectively handles multiple valid references | May struggle with highly creative or unusual text |
BERTScore has found wide application across numerous NLP tasks:
How does BERTScore stack up against other popular evaluation metrics?
Metric | Basis | Strengths | Weaknesses | Human Correlation |
BLEU | N-gram precision | Fast, interpretable | Surface-level, position-insensitive | Moderate |
ROUGE | N-gram recall | Good for summarization | Misses semantic equivalence | Moderate |
METEOR | Enhanced lexical matching | Handles synonyms | Still primarily lexical | Moderate-High |
BERTScore | Contextual embeddings | Semantic understanding | Computationally intensive | High |
BLEURT | Learned metric (fine-tuned) | Task-specific | Requires training | Very High |
LLM-as-Judge | Direct LLM evaluation | Comprehensive | Black box, expensive | Very High |
BERTScore offers a balance between sophistication and practicality, capturing semantic similarity without requiring task-specific training.
BERTScore represents a significant advancement in text generation advancements by leveraging the semantic understanding capabilities of contextual embeddings. Its ability to capture meaning beyond surface-level lexical matches makes it valuable for evaluating modern language models, where creativity and variation in outputs are both expected and desired.
While no single metric can perfectly assess text quality, it is important to note that BERTScore provides a reliable framework that not only aligns with human evaluation across diverse tasks but also offers consistent results. Furthermore, when combined with traditional metrics as well as human analysis, it ultimately enables deeper insights into language generation capabilities.
As language models evolve, tools like BERTScore become necessary for identifying model strengths and weaknesses, and improving the overall quality of natural language generation systems.