BERTScore: A Contextual Metric for LLM Evaluation

Riya Bansal. Last Updated : 08 Apr, 2025
8 min read

We all depend on LLMs for our everyday activities, but quantifying “How efficient they are” is a gigantic challenge. Conventional metrics such as BLEU, ROUGE, and METEOR tend to fail in comprehending the real meaning of the text. They are too keen on matching similar words instead of comprehending the concept behind it. BERTScore reverses this by applying BERT embeddings to assess the quality of the text with better comprehension of meaning and context.

Whether you’re training a chatbot, translating, or making summaries, BERTScore makes it easier for you to evaluate your models better. It captures when two sentences convey the same thing despite using different words—something older metrics completely miss. As we dive into how BERTScore operates, you’ll learn how this brilliant evaluation approach ties together computer measurement and human intuition and revolutionizes the way we test and refine today’s sophisticated language models.

What is BERTScore?

BERTScore is a neural evaluation metric for text generation that uses contextual embeddings from pre-trained language models like BERT to calculate similarity scores between candidate and reference texts. Unlike traditional n-gram-based metrics, BERTScore can identify semantic equivalence even when different words are used, making it useful for evaluating language tasks where multiple valid outputs exist.

Formulated by Zhang et al. and presented in their 2019 paper “BERTScore: Evaluating Text Generation with BERT,” this score has gained rapid acceptance within the NLP community due to its high correlation with human evaluation across a range of text generation tasks.

BERTScore Architecture

BERTScore’s architecture is elegantly simple yet powerful, consisting of three main components:

  1. Embedding Generation: Each token in both reference and candidate texts is embedded using a pre-trained contextual embedding model (typically BERT).
  2. Token Matching: The algorithm computes pairwise cosine similarities between all tokens in the reference and candidate texts, creating a similarity matrix.
  3. Score Aggregation: These similarity scores are aggregated into precision, recall, and F1 measures that represent how well the candidate text matches the reference.

The beauty of BERTScore is that it leverages the contextual understanding of pre-trained models without requiring additional training for the evaluation task.

How to Use BERTScore? 

BERTScore can be customized using several parameters to suit specific evaluation needs:

Parameter Description Default
model_type Pre-trained model to use (e.g., ‘bert-base-uncased’) ‘roberta-large’
num_layers Which layer’s embeddings to use 17 (for roberta-large)
idf Whether to use IDF weighting for token importance False
rescale_with_baseline Whether to rescale scores based on a baseline False
baseline_path Path to baseline scores None
lang Language of the texts being compared ‘en’
use_fast_tokenizer Whether to use HuggingFace’s fast tokenizers False

These parameters allow researchers to fine-tune BERTScore for different languages, domains, and evaluation requirements.

How Does BERTScore Work?

BERTScore evaluates the similarity between generated text and reference text through a token-level matching process using contextual embeddings. Here is a step-by-step breakdown of how it operates:

Source: BERTScore
  1. Tokenization: Both candidate (generated) and reference texts are tokenized using the tokenizer corresponding to the pre-trained model being used (e.g., BERT, RoBERTa).
  2. Contextual Embedding: Each token is then embedded using a pre-trained contextual model. Importantly, these embeddings capture the meaning of words in context rather than static word representations. For example, the word “bank” would have different embeddings in “river bank” versus “financial bank.”
  3. Cosine Similarity Computation: For each token in the candidate text, BERTScore computes its cosine similarity with every token in the reference text, creating a similarity matrix.
  4. Greedy Matching:
    • For precision: Each candidate token is matched with the most similar reference token
    • For recall: Each reference token is matched with the most similar candidate token
  5. Importance Weighting (Optional): Tokens can be weighted by their inverse document frequency (IDF) to emphasize content words over function words.
  6. Score Aggregation:
    • Precision is calculated as the average of the maximum similarity scores for each candidate token
    • Recall is calculated as the average of the maximum similarity scores for each reference token
    • F1 combines precision and recall using the harmonic mean formula
  7. Score Normalization (Optional): Raw scores can be rescaled based on baseline scores to make them more interpretable.

This approach allows BERTScore to capture semantic equivalence even when different words are used to express the same meaning, making it more robust than lexical matching metrics for evaluating modern text generation systems.

Implementation in Python

Let’s implement BERTScore step by step to understand how it works in practice.

1. Setup and Installation

First, install the necessary packages:

# Install the bert-score package

pip install bert-score

2. Basic Implementation

Here’s how to calculate BERTScore between candidate and reference texts:

import bert_score

# Define reference and candidate texts

references = ["The cat sat on the mat.", "The feline rested on the floor covering."]

candidates = ["A cat was sitting on a mat.", "The cat was on the mat."]

# Calculate BERTScore

P, R, F1 = bert_score.score(

    candidates, 

    references, 

    lang="en", 

    model_type="roberta-large", 

    num_layers=17,

    verbose=True

)

# Print results

for i, (p, r, f) in enumerate(zip(P, R, F1)):

    print(f"Example {i+1}:")

    print(f"  Precision: {p.item():.4f}")

    print(f"  Recall: {r.item():.4f}")

    print(f"  F1: {f.item():.4f}")

    print()

Output:

This demonstrates how BERTScore captures semantic similarity even when different phrasings are used.

BERT Embeddings and Cosine Similarity

The core of BERTScore lies in how it leverages contextual embeddings and cosine similarity. Let’s break down the process:

1. Generating Contextual Embeddings: With this distinction in mind, BERTScore is a measure really alternative to the traditional n-gram-based measures, since it is based on contextual embedding generation. Unlike static word embeddings (such as Word2Vec or GloVe), contextual embeddings are finely tuned for semantic similarity evaluation as they account for the importance of surrounding context in assigning meaning to words.

import torch

from transformers import AutoTokenizer, AutoModel

def get_bert_embeddings(texts, model_name="bert-base-uncased"):

    # Load tokenizer and model

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model = AutoModel.from_pretrained(model_name)

    # Move model to GPU if available

    device = "cuda" if torch.cuda.is_available() else "cpu"

    model.to(device)

    # Process texts in batch

    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}

    # Get model output

    with torch.no_grad():

        outputs = model(**encoded_input)

    # Use embeddings from the last layer

    embeddings = outputs.last_hidden_state

    # Remove padding tokens

    attention_mask = encoded_input['attention_mask']

    embeddings = [emb[mask.bool()] for emb, mask in zip(embeddings, attention_mask)]

    return embeddings

# Example usage

texts = ["The cat sat on the mat.", "A cat was sitting on a mat."]

embeddings = get_bert_embeddings(texts)

print(f"Number of texts: {len(embeddings)}")

print(f"Shape of first text embeddings: {embeddings[0].shape}")

Output:

2. Computing Cosine Similarity: BERTScore uses cosine similarity, a metric that measures how aligned two vectors are in the embedding space regardless of their size, to calculate the semantic similarity between tokens once contextual embeddings for the reference and candidate texts have been created.

Now, let’s implement the cosine similarity calculation between tokens:

def token_cosine_similarity(embeddings1, embeddings2):

    # Normalize embeddings for cosine similarity

    embeddings1_norm = embeddings1 / embeddings1.norm(dim=1, keepdim=True)

    embeddings2_norm = embeddings2 / embeddings2.norm(dim=1, keepdim=True)

        similarity_matrix = torch.matmul(embeddings1_norm, embeddings2_norm.transpose(0, 1))

    return similarity_matrix

# Example usage with our previously generated embeddings

sim_matrix = token_cosine_similarity(embeddings[0], embeddings[1])

print(f"Shape of similarity matrix: {sim_matrix.shape}")

print("Similarity matrix (token-to-token):")

print(sim_matrix)

Output:

BERTScore: Precision, Recall, and F1

Let’s implement the core BERTScore calculation from scratch to understand the mathematics behind it:

Mathematical Formulation

BERTScore calculates three metrics:

1. Precision: How many tokens in the candidate text match tokens in the reference?

2. Recall: How many tokens in the reference text are covered by the candidate?

3. F1: The harmonic mean of precision and recall

Where:

  • x and y are the candidate and reference texts, respectively
  • xi​ and yjare the token embeddings.

Implementation

def calculate_bertscore(candidate_embeddings, reference_embeddings):

    # Compute similarity matrix

    sim_matrix = token_cosine_similarity(candidate_embeddings, reference_embeddings)

    # Compute precision (max similarity for each candidate token)

    precision = sim_matrix.max(dim=1)[0].mean().item()

    # Compute recall (max similarity for each reference token)

    recall = sim_matrix.max(dim=0)[0].mean().item()

    # Compute F1

    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0

    return precision, recall, f1

# Example

cand_emb = embeddings[0]  # "The cat sat on the mat."

ref_emb = embeddings[1]   # "A cat was sitting on a mat."

precision, recall, f1 = calculate_bertscore(cand_emb, ref_emb)

print(f"Custom BERTScore calculation:")

print(f"  Precision: {precision:.4f}")

print(f"  Recall: {recall:.4f}")

print(f"  F1: {f1:.4f}")

Output:

This implementation demonstrates the core algorithm behind BERTScore. The actual library includes additional optimizations, IDF weighting options, and baseline rescaling.

Advantages and Limitations

Advantages Limitations
Captures semantic similarity beyond lexical overlap Computationally more intensive than n-gram metrics
Correlates better with human judgments Performance depends on the quality of underlying embeddings
Works well across different tasks and domains May not capture structural or logical coherence
No training required specifically for evaluation Can be sensitive to the choice of BERT layer and model
Handles synonyms and paraphrases naturally Less interpretable than explicit matching metrics
Language-agnostic (with appropriate models) Requires GPU for efficient processing of large datasets
Can be customized with different embedding models Not designed to evaluate factual correctness
Effectively handles multiple valid references May struggle with highly creative or unusual text

Practical Applications

BERTScore has found wide application across numerous NLP tasks:

  1. Machine Translation: BERTScore helps evaluate translations by focusing on meaning preservation rather than exact wording, which is particularly valuable given the different valid ways to translate a sentence.
  2. Summarization: When evaluating summaries, BERTScore can identify when different phrasings capture the same key information, making it more flexible than ROUGE for assessing summary quality.
  3. Dialog Systems: For conversational AI, BERTScore can evaluate response appropriateness by measuring semantic similarity to reference responses, even when the wording differs significantly.
  4. Text Simplification: BERTScore can assess whether simplifications maintain the original meaning while using different vocabulary, a task where lexical overlap metrics often fall short.
  5. Content Creation: When evaluating AI-generated creative content, BERTScore can measure how well the generation captures the intended themes or information without requiring exact matching.

Comparison with Other Metrics

How does BERTScore stack up against other popular evaluation metrics?

Metric Basis Strengths Weaknesses Human Correlation
BLEU N-gram precision Fast, interpretable Surface-level, position-insensitive Moderate
ROUGE N-gram recall Good for summarization Misses semantic equivalence Moderate
METEOR Enhanced lexical matching Handles synonyms Still primarily lexical Moderate-High
BERTScore Contextual embeddings Semantic understanding Computationally intensive High
BLEURT Learned metric (fine-tuned) Task-specific Requires training Very High
LLM-as-Judge Direct LLM evaluation Comprehensive Black box, expensive Very High

BERTScore offers a balance between sophistication and practicality, capturing semantic similarity without requiring task-specific training.

Conclusion

BERTScore represents a significant advancement in text generation advancements by leveraging the semantic understanding capabilities of contextual embeddings. Its ability to capture meaning beyond surface-level lexical matches makes it valuable for evaluating modern language models, where creativity and variation in outputs are both expected and desired.

While no single metric can perfectly assess text quality, it is important to note that BERTScore provides a reliable framework that not only aligns with human evaluation across diverse tasks but also offers consistent results. Furthermore, when combined with traditional metrics as well as human analysis, it ultimately enables deeper insights into language generation capabilities.

As language models evolve, tools like BERTScore become necessary for identifying model strengths and weaknesses, and improving the overall quality of natural language generation systems.

Gen AI Intern at Analytics Vidhya 
Department of Computer Science, Vellore Institute of Technology, Vellore, India 

I am currently working as a Gen AI Intern at Analytics Vidhya, where I contribute to innovative AI-driven solutions that empower businesses to leverage data effectively. As a final-year Computer Science student at Vellore Institute of Technology, I bring a solid foundation in software development, data analytics, and machine learning to my role. 

Feel free to connect with me at riya.bansal@analyticsvidhya.com 

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details