BERTScore: A Contextual Metric for LLM Evaluation

Riya Bansal. Last Updated : 08 Apr, 2025

8 min read

We all depend on LLMs for our everyday activities, but quantifying “How efficient they are” is a gigantic challenge. Conventional metrics such as BLEU, ROUGE, and METEOR tend to fail in comprehending the real meaning of the text. They are too keen on matching similar words instead of comprehending the concept behind it. BERTScore reverses this by applying BERT embeddings to assess the quality of the text with better comprehension of meaning and context.

Whether you’re training a chatbot, translating, or making summaries, BERTScore makes it easier for you to evaluate your models better. It captures when two sentences convey the same thing despite using different words—something older metrics completely miss. As we dive into how BERTScore operates, you’ll learn how this brilliant evaluation approach ties together computer measurement and human intuition and revolutionizes the way we test and refine today’s sophisticated language models.

What is BERTScore?
BERTScore Architecture
How to Use BERTScore?
How Does BERTScore Work?
Implementation in Python
BERT Embeddings and Cosine Similarity
BERTScore: Precision, Recall, and F1
Implementation
Advantages and Limitations
Practical Applications
Comparison with Other Metrics
Conclusion

What is BERTScore?

BERTScore is a neural evaluation metric for text generation that uses contextual embeddings from pre-trained language models like BERT to calculate similarity scores between candidate and reference texts. Unlike traditional n-gram-based metrics, BERTScore can identify semantic equivalence even when different words are used, making it useful for evaluating language tasks where multiple valid outputs exist.

Formulated by Zhang et al. and presented in their 2019 paper “BERTScore: Evaluating Text Generation with BERT,” this score has gained rapid acceptance within the NLP community due to its high correlation with human evaluation across a range of text generation tasks.

BERTScore Architecture

BERTScore’s architecture is elegantly simple yet powerful, consisting of three main components:

Embedding Generation: Each token in both reference and candidate texts is embedded using a pre-trained contextual embedding model (typically BERT).
Token Matching: The algorithm computes pairwise cosine similarities between all tokens in the reference and candidate texts, creating a similarity matrix.
Score Aggregation: These similarity scores are aggregated into precision, recall, and F1 measures that represent how well the candidate text matches the reference.

The beauty of BERTScore is that it leverages the contextual understanding of pre-trained models without requiring additional training for the evaluation task.

How to Use BERTScore?

BERTScore can be customized using several parameters to suit specific evaluation needs:

Parameter	Description	Default
model_type	Pre-trained model to use (e.g., ‘bert-base-uncased’)	‘roberta-large’
num_layers	Which layer’s embeddings to use	17 (for roberta-large)
idf	Whether to use IDF weighting for token importance	False
rescale_with_baseline	Whether to rescale scores based on a baseline	False
baseline_path	Path to baseline scores	None
lang	Language of the texts being compared	‘en’
use_fast_tokenizer	Whether to use HuggingFace’s fast tokenizers	False

These parameters allow researchers to fine-tune BERTScore for different languages, domains, and evaluation requirements.

How Does BERTScore Work?

BERTScore evaluates the similarity between generated text and reference text through a token-level matching process using contextual embeddings. Here is a step-by-step breakdown of how it operates:

Tokenization: Both candidate (generated) and reference texts are tokenized using the tokenizer corresponding to the pre-trained model being used (e.g., BERT, RoBERTa).
Contextual Embedding: Each token is then embedded using a pre-trained contextual model. Importantly, these embeddings capture the meaning of words in context rather than static word representations. For example, the word “bank” would have different embeddings in “river bank” versus “financial bank.”
Cosine Similarity Computation: For each token in the candidate text, BERTScore computes its cosine similarity with every token in the reference text, creating a similarity matrix.
Greedy Matching:
- For precision: Each candidate token is matched with the most similar reference token
- For recall: Each reference token is matched with the most similar candidate token
Importance Weighting (Optional): Tokens can be weighted by their inverse document frequency (IDF) to emphasize content words over function words.
Score Aggregation:
- Precision is calculated as the average of the maximum similarity scores for each candidate token
- Recall is calculated as the average of the maximum similarity scores for each reference token
- F1 combines precision and recall using the harmonic mean formula
Score Normalization (Optional): Raw scores can be rescaled based on baseline scores to make them more interpretable.

This approach allows BERTScore to capture semantic equivalence even when different words are used to express the same meaning, making it more robust than lexical matching metrics for evaluating modern text generation systems.

Implementation in Python

Let’s implement BERTScore step by step to understand how it works in practice.

1. Setup and Installation

First, install the necessary packages:

# Install the bert-score package

pip install bert-score

2. Basic Implementation

Here’s how to calculate BERTScore between candidate and reference texts:

import bert_score

# Define reference and candidate texts

references = ["The cat sat on the mat.", "The feline rested on the floor covering."]

candidates = ["A cat was sitting on a mat.", "The cat was on the mat."]

# Calculate BERTScore

P, R, F1 = bert_score.score(

    candidates, 

    references, 

    lang="en", 

    model_type="roberta-large", 

    num_layers=17,

    verbose=True

)

# Print results

for i, (p, r, f) in enumerate(zip(P, R, F1)):

    print(f"Example {i+1}:")

    print(f"  Precision: {p.item():.4f}")

    print(f"  Recall: {r.item():.4f}")

    print(f"  F1: {f.item():.4f}")

    print()

Output:

This demonstrates how BERTScore captures semantic similarity even when different phrasings are used.

BERT Embeddings and Cosine Similarity

The core of BERTScore lies in how it leverages contextual embeddings and cosine similarity. Let’s break down the process:

1. Generating Contextual Embeddings: With this distinction in mind, BERTScore is a measure really alternative to the traditional n-gram-based measures, since it is based on contextual embedding generation. Unlike static word embeddings (such as Word2Vec or GloVe), contextual embeddings are finely tuned for semantic similarity evaluation as they account for the importance of surrounding context in assigning meaning to words.

import torch

from transformers import AutoTokenizer, AutoModel

def get_bert_embeddings(texts, model_name="bert-base-uncased"):

    # Load tokenizer and model

    tokenizer = AutoTokenizer.from_pretrained(model_name)

    model = AutoModel.from_pretrained(model_name)

    # Move model to GPU if available

    device = "cuda" if torch.cuda.is_available() else "cpu"

    model.to(device)

    # Process texts in batch

    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

    encoded_input = {k: v.to(device) for k, v in encoded_input.items()}

    # Get model output

    with torch.no_grad():

        outputs = model(**encoded_input)

    # Use embeddings from the last layer

    embeddings = outputs.last_hidden_state

    # Remove padding tokens

    attention_mask = encoded_input['attention_mask']

    embeddings = [emb[mask.bool()] for emb, mask in zip(embeddings, attention_mask)]

    return embeddings

# Example usage

texts = ["The cat sat on the mat.", "A cat was sitting on a mat."]

embeddings = get_bert_embeddings(texts)

print(f"Number of texts: {len(embeddings)}")

print(f"Shape of first text embeddings: {embeddings[0].shape}")

Output:

2. Computing Cosine Similarity: BERTScore uses cosine similarity, a metric that measures how aligned two vectors are in the embedding space regardless of their size, to calculate the semantic similarity between tokens once contextual embeddings for the reference and candidate texts have been created.

Now, let’s implement the cosine similarity calculation between tokens:

def token_cosine_similarity(embeddings1, embeddings2):

    # Normalize embeddings for cosine similarity

    embeddings1_norm = embeddings1 / embeddings1.norm(dim=1, keepdim=True)

    embeddings2_norm = embeddings2 / embeddings2.norm(dim=1, keepdim=True)

        similarity_matrix = torch.matmul(embeddings1_norm, embeddings2_norm.transpose(0, 1))

    return similarity_matrix

# Example usage with our previously generated embeddings

sim_matrix = token_cosine_similarity(embeddings[0], embeddings[1])

print(f"Shape of similarity matrix: {sim_matrix.shape}")

print("Similarity matrix (token-to-token):")

print(sim_matrix)

Output:

BERTScore: Precision, Recall, and F1

Let’s implement the core BERTScore calculation from scratch to understand the mathematics behind it:

Mathematical Formulation

BERTScore calculates three metrics:

1. Precision: How many tokens in the candidate text match tokens in the reference?

2. Recall: How many tokens in the reference text are covered by the candidate?

3. F1: The harmonic mean of precision and recall

Where:

x and y are the candidate and reference texts, respectively
x_i and y_j are the token embeddings.

Implementation

def calculate_bertscore(candidate_embeddings, reference_embeddings):

    # Compute similarity matrix

    sim_matrix = token_cosine_similarity(candidate_embeddings, reference_embeddings)

    # Compute precision (max similarity for each candidate token)

    precision = sim_matrix.max(dim=1)[0].mean().item()

    # Compute recall (max similarity for each reference token)

    recall = sim_matrix.max(dim=0)[0].mean().item()

    # Compute F1

    f1 = 2 * precision * recall / (precision + recall) if precision + recall > 0 else 0

    return precision, recall, f1

# Example

cand_emb = embeddings[0]  # "The cat sat on the mat."

ref_emb = embeddings[1]   # "A cat was sitting on a mat."

precision, recall, f1 = calculate_bertscore(cand_emb, ref_emb)

print(f"Custom BERTScore calculation:")

print(f"  Precision: {precision:.4f}")

print(f"  Recall: {recall:.4f}")

print(f"  F1: {f1:.4f}")

Output:

This implementation demonstrates the core algorithm behind BERTScore. The actual library includes additional optimizations, IDF weighting options, and baseline rescaling.

Advantages and Limitations

Advantages	Limitations
Captures semantic similarity beyond lexical overlap	Computationally more intensive than n-gram metrics
Correlates better with human judgments	Performance depends on the quality of underlying embeddings
Works well across different tasks and domains	May not capture structural or logical coherence
No training required specifically for evaluation	Can be sensitive to the choice of BERT layer and model
Handles synonyms and paraphrases naturally	Less interpretable than explicit matching metrics
Language-agnostic (with appropriate models)	Requires GPU for efficient processing of large datasets
Can be customized with different embedding models	Not designed to evaluate factual correctness
Effectively handles multiple valid references	May struggle with highly creative or unusual text

Practical Applications

BERTScore has found wide application across numerous NLP tasks:

Machine Translation: BERTScore helps evaluate translations by focusing on meaning preservation rather than exact wording, which is particularly valuable given the different valid ways to translate a sentence.
Summarization: When evaluating summaries, BERTScore can identify when different phrasings capture the same key information, making it more flexible than ROUGE for assessing summary quality.
Dialog Systems: For conversational AI, BERTScore can evaluate response appropriateness by measuring semantic similarity to reference responses, even when the wording differs significantly.
Text Simplification: BERTScore can assess whether simplifications maintain the original meaning while using different vocabulary, a task where lexical overlap metrics often fall short.
Content Creation: When evaluating AI-generated creative content, BERTScore can measure how well the generation captures the intended themes or information without requiring exact matching.

Comparison with Other Metrics

How does BERTScore stack up against other popular evaluation metrics?

Metric	Basis	Strengths	Weaknesses	Human Correlation
BLEU	N-gram precision	Fast, interpretable	Surface-level, position-insensitive	Moderate
ROUGE	N-gram recall	Good for summarization	Misses semantic equivalence	Moderate
METEOR	Enhanced lexical matching	Handles synonyms	Still primarily lexical	Moderate-High
BERTScore	Contextual embeddings	Semantic understanding	Computationally intensive	High
BLEURT	Learned metric (fine-tuned)	Task-specific	Requires training	Very High
LLM-as-Judge	Direct LLM evaluation	Comprehensive	Black box, expensive	Very High

BERTScore offers a balance between sophistication and practicality, capturing semantic similarity without requiring task-specific training.

Conclusion

BERTScore represents a significant advancement in text generation advancements by leveraging the semantic understanding capabilities of contextual embeddings. Its ability to capture meaning beyond surface-level lexical matches makes it valuable for evaluating modern language models, where creativity and variation in outputs are both expected and desired.

While no single metric can perfectly assess text quality, it is important to note that BERTScore provides a reliable framework that not only aligns with human evaluation across diverse tasks but also offers consistent results. Furthermore, when combined with traditional metrics as well as human analysis, it ultimately enables deeper insights into language generation capabilities.

As language models evolve, tools like BERTScore become necessary for identifying model strengths and weaknesses, and improving the overall quality of natural language generation systems.

Riya Bansal.

Gen AI Intern at Analytics Vidhya
Department of Computer Science, Vellore Institute of Technology, Vellore, India

I am currently working as a Gen AI Intern at Analytics Vidhya, where I contribute to innovative AI-driven solutions that empower businesses to leverage data effectively. As a final-year Computer Science student at Vellore Institute of Technology, I bring a solid foundation in software development, data analytics, and machine learning to my role.

Feel free to connect with me at riya.bansal@analyticsvidhya.com

Advanced BERT LLMs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

BERTScore: A Contextual Metric for LLM Evaluation

Table of contents

What is BERTScore?

BERTScore Architecture

How to Use BERTScore?

How Does BERTScore Work?

Implementation in Python

1. Setup and Installation

2. Basic Implementation

BERT Embeddings and Cosine Similarity

BERTScore: Precision, Recall, and F1

Mathematical Formulation

Implementation

Advantages and Limitations

Practical Applications

Comparison with Other Metrics

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us