A Guide on Effective LLM Assessment with DeepEval

Nibedita Dutta Last Updated : 24 Jan, 2025

10 min read

Evaluating Large Language Models (LLMs) is essential for understanding their performance, reliability, and applicability in various contexts. This evaluation process involves assessing models against established benchmarks and metrics to ensure they generate accurate, coherent, and contextually relevant responses, ultimately enhancing their utility in real-world applications. As LLMs continue to evolve, robust evaluation methodologies are crucial for maintaining their effectiveness and addressing challenges such as bias and safety such as DeepEval.

DeepEval is an open-source evaluation framework designed to assess Large Language Model (LLM) performance. It provides a comprehensive suite of metrics and features, including the ability to generate synthetic datasets, perform real-time evaluations, and integrate seamlessly with testing frameworks like pytest. By facilitating easy customization and iteration on LLM applications, DeepEval enhances the reliability and effectiveness of AI models in various contexts.

Learning Objectives

Overview of DeepEval as a comprehensive framework for evaluating large language models (LLMs).
Examination of the core functionalities that make DeepEval an effective evaluation tool.
Detailed discussion on the various metrics available for LLM assessment.
Application of DeepEval to analyze the performance of the Falcon 3 3B model.
Focus on key evaluation metrics.

This article was published as a part of the Data Science Blogathon.

What is DeepEval?
Key Features of DeepEval
Hands-On Guide on Evaluation of LLM Model Using DeepEval
Answer Relevancy Metric
G-EVAL Metric
Prompt Alignment Metric
JSON Correctness Metric
Summarization Metric
Conclusions

What is DeepEval?

DeepEval serves as a comprehensive platform for evaluating LLM performance, offering a user-friendly interface and extensive functionality. It enables developers to create unit tests for model outputs, ensuring that LLMs meet specific performance criteria. The framework operates entirely on local infrastructure, which enhances security and flexibility while facilitating real-time production monitoring and advanced synthetic dataset generation.

Key Features of DeepEval

1. Extensive Metric Suite

DeepEval provides over 14 research-backed metrics tailored for different evaluation scenarios. These metrics include:

G-Eval: A versatile metric that utilizes chain-of-thought reasoning to evaluate outputs based on custom criteria.
Faithfulness: Measures the accuracy and reliability of the information provided by the model.
Toxicity: Assesses the likelihood of harmful or offensive content in the generated text.
Answer Relevancy: Evaluate how well the model’s responses align with user expectations.
Conversational Metrics: These metrics, such as Knowledge Retention and Conversation Completeness, are designed specifically for evaluating dialogues rather than individual outputs.

2. Custom Metric Development

Users can easily develop their own custom evaluation metrics to suit specific needs. This flexibility allows for tailored assessments that can adapt to various contexts and requirements.

3. Integration with LLMs

DeepEval supports evaluations using any LLM, including those from OpenAI. This capability ensures that users can benchmark their models against popular standards like MMLU and HumanEval, making it easier to transition between different LLM providers or configurations.

4. Real-Time Monitoring and Benchmarking

The framework facilitates real-time monitoring of LLM performance in production environments. It also offers comprehensive benchmarking capabilities, allowing users to evaluate their models against established datasets efficiently.

5. Simplified Testing Process

With its Pytest-like architecture, DeepEval simplifies the testing process into just a few lines of code. This ease of use enables developers to quickly implement tests without extensive setup or configuration.

6. Batch Evaluation Support

DeepEval includes functionality for batch evaluations, significantly speeding up the benchmarking process when implemented with custom LLMs. This feature is particularly useful for large-scale evaluations where time efficiency is crucial.

Also Read: How to Evaluate a Large Language Model (LLM)?

Hands-On Guide on Evaluation of LLM Model Using DeepEval

We will be evaluating the Falcon 3 3B model’s outputs using DeepEval. We will be using Ollama to pull the model and then evaluate it using DeepEval on Google Colab.

Step 1. Installing Necessary Libraries

!pip install deepeval==2.1.5
!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2

Step 2. Enablement of Threading For Running Ollama Model on Google Colab

import threading
import subprocess
import time

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

Step 3. Pulling the Ollama Model & Defining the OpenAI API Key

!ollama pull falcon3:3b
import os
os.environ['OPENAI_API_KEY'] = ''

We will be using the GPT-4 model here to evaluate the answers from the LLM

Step 4. Querying the Model & Measuring Different Metrics

Below we will query the model and measure different metrics

Answer Relevancy Metric

We start with querying our model and getting the output generated from it.

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown

template = """Question: {question}

Answer: Let's think step by step."""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="falcon3:3b")

chain = prompt | model
query = 'How is Gurgaon Connected to Noida?'
#Prepare input for invocation
input_data = {
    "question": query  }

#Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)

We will then measure the Answer Relevancy Metric. The answer relevancy metric measures how relevant the actual_output of your LLM application is compared to the provided input. This is an important metric in RAG evaluations as well.

The AnswerRelevancyMetric first uses an LLM to extract all statements made in the actual_output, before using the same LLM to classify whether each statement is relevant to the input.

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input=query ,
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

As seen from the output above, the Answer Relevancy Metric comes to be 1 here because the output from the Falcon 3 3B model is in alignment with the asked query.

G-EVAL Metric

G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. G-Eval is a two-step algorithm that –

First generates a series of evaluation_steps using the Chain of Thoughts (CoTs) based on the given criteria.
Second, it uses the generated steps to determine the final score using the parameters presented in an LLMTestCase.

When you provide evaluation_steps, the GEval metric skips the first step and uses the provided steps to determine the final score instead.

Defining the Custom Criteria & Evaluation Steps

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

Measuring the Metric With the Output From the Previously Defined Falcon 3 3B Model

from deepeval.test_case import LLMTestCase
...
query="The dog chased the cat up the tree, who ran up the tree?"
# Prepare input for invocation
input_data = {
    "question": query}

# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)

test_case = LLMTestCase(
    input=query,
    actual_output=actual_output,
    expected_output="The cat."
)

correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)

As we can see the correctness metric score comes to be very low here the model’s output contains the wrong answer “dog” which ideally should have been “cat”.

Prompt Alignment Metric

The prompt alignment metric measures whether your LLM application is able to generate actual_outputs that align with any instructions specified in your prompt template.

from deepeval import evaluate
from deepeval.metrics import PromptAlignmentMetric
from deepeval.test_case import LLMTestCase

#QUERYING THE MODEL
template = """Question: {question}

Answer: Answer in Upper case."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query = "What is capital of Spain?"
# Prepare input for invocation
input_data = {
    "question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
display(Markdown(actual_output))

#MEASURING PROMPT ALIGNMENT QUESTION
metric = PromptAlignmentMetric(
    prompt_instructions=["Reply in all uppercase"],
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input=query,
    # Replace this with the actual output from your LLM application
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

As we can see the Prompt Alignment metric score comes to be 0 here as the model’s output doesnt contain the answer “Madrid” in Upper Case as was instructed.

JSON Correctness Metric

The json correctness metric measures whether your LLM application is able to generate actual_outputs with the correct json schema.

The Json Correctness Metric does not use an LLM for evaluation and instead uses the provided expected_schema to determine whether the actual_output can be loaded into the schema.

Defining the Desired Output Schema

from pydantic import BaseModel

class ExampleSchema(BaseModel):
    name: str

Querying Our Model & Measuring the Metric

from deepeval import evaluate
from deepeval.metrics import JsonCorrectnessMetric
from deepeval.test_case import LLMTestCase

#QUERYING THE MODEL
template = """Question: {question}

Answer:  Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query ="Output me a random Json with the 'name' key"
# Prepare input for invocation
input_data = {
    "question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)

#MEASURING THE METRIC
metric = JsonCorrectnessMetric(
    expected_schema=ExampleSchema,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input="Output me a random Json with the 'name' key",
    # Replace this with the actual output from your LLM application
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output From Falcon 3 3B Model

{
  "name": "John Doe"
}

Metric Score & Reason

0
The generated Json is not valid because it does not meet the expected json
 schema. It lacks the 'required' array in the properties of 'name'. The
 property of 'name' does not have a 'title' field.

As we can see the metric score comes to be 0 here as the model’s output is NOT in a JSON format (as predefined) completely.

Summarization Metric

The summarization metric uses LLMs to determine whether your LLM (application) is generating factually correct summaries while including the necessary details from the original text.

The Summarization Metric score is calculated according to the following equation:

alignment_score determines whether the summary contains hallucinated or contradictory information to the original text.
coverage_score determines whether the summary contains the necessary information from the original text.

Querying Our Model & Generating Model’s Output

# This is the original text to be summarized
text = """
Rice is the staple food of Bengal. Bhortas (lit-"mashed") are a really common type of food used as an additive too rice. there are several types of Bhortas such as Ilish bhorta shutki bhorta, begoon bhorta and more. Fish and other seafood are also important because Bengal is a reverrine region.

Some fishes like puti (Puntius species) are fermented. Fish curry is prepared with fish alone or in combination with vegetables.Shutki maach is made using the age-old method of preservation where the food item is dried in the sun and air, thus removing the water content. This allows for preservation that can make the fish last for months, even years in Bangladesh
"""

template = """Question: {question}

Answer:  Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query ="Summarize the text for me %s"%(text)
# Prepare input for invocation
input_data = {
    "question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)

Output (Summary) From Model

Rice, along with Bhortas (mashed) dishes, are staples in Bengal. Fish curry
 and age-old preservation methods like Shutki maach highlight the region's
 seafood culture.

Measuring the Metric

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(input=text, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4"

)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

As we can see the metric score comes to be 0.4 here as the model’s output which is a summary of the original text doesn’t contain many key points present in the original text.

Also read: Making Sure Super-Smart AI Plays Nice: Testing Knowledge, Goals, and Safety

Conclusions

In conclusion, DeepEval stands out as a powerful and flexible platform for evaluating LLMs, offering a range of features that streamline the testing and benchmarking process. Its comprehensive suite of metrics, support for custom evaluations, and integration with any LLM make it an invaluable tool for developers aiming to optimize model performance. With capabilities like real-time monitoring, simplified testing, and batch evaluation, DeepEval ensures efficient and reliable assessments, enhancing both security and flexibility in production environments.

Key Takeaways

Comprehensive Evaluation Platform: DeepEval provides a robust platform for evaluating LLM performance, offering a user-friendly interface, real-time monitoring, and advanced dataset generation—all running on local infrastructure for enhanced security and flexibility.
Extensive Metric Suite: The framework includes over 14 research-backed metrics, such as G-Eval, Faithfulness, Toxicity, and conversational metrics, designed to address a wide variety of evaluation scenarios and provide thorough insights into model performance.
Customizable Metrics: DeepEval allows users to develop custom evaluation metrics tailored to specific needs, making it adaptable to diverse contexts and enabling personalized assessments.
Integration with Multiple LLMs: The platform supports evaluations across any LLM, including those from OpenAI, facilitating benchmarking against popular standards like MMLU and HumanEval, and offering seamless transitions between different LLM configurations.
Efficient Testing and Batch Evaluation: With a simplified testing process (Pytest-like architecture) and batch evaluation support, DeepEval makes it easier to implement tests quickly and efficiently, especially for large-scale evaluations where time efficiency is essential.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is DeepEval and how does it help in evaluating LLMs?

Ans. DeepEval is a comprehensive platform designed to evaluate LLM (Large Language Model) performance. It offers a user-friendly interface, a wide range of evaluation metrics, and supports real-time monitoring of model outputs. It enables developers to create unit tests for model outputs to ensure they meet specific performance criteria.

Q2. What evaluation metrics does DeepEval offer?

Ans. DeepEval provides over many research-backed metrics for diverse evaluation scenarios. Key metrics include G-Eval for chain-of-thought reasoning, Faithfulness for accuracy, Toxicity for harmful content detection, Answer Relevancy for response alignment with user expectations, and various Conversational Metrics for dialogue evaluation, such as Knowledge Retention and Conversation Completeness.

Q3. Can I create custom evaluation metrics with DeepEval?

Ans. Yes, DeepEval allows users to develop custom evaluation metrics tailored to their specific needs. This flexibility enables developers to assess models based on unique criteria or requirements, providing a more personalized evaluation process.

Q4. Does DeepEval support integration with all LLMs?

Ans. Yes, DeepEval is compatible with any LLM, including popular models from OpenAI. It allows users to benchmark their models against recognized standards like MMLU and HumanEval, making it easy to switch between different LLM providers or configurations.

Q5. How does DeepEval simplify the testing process?

Ans. DeepEval simplifies the testing process with a Pytest-like architecture, enabling developers to implement tests with just a few lines of code. Additionally, it supports batch evaluations, which speeds up the benchmarking process, especially for large-scale assessments.

Nibedita Dutta

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

A Guide on Effective LLM Assessment with DeepEval

Learning Objectives

Table of contents

What is DeepEval?

Key Features of DeepEval

1. Extensive Metric Suite

2. Custom Metric Development

3. Integration with LLMs

4. Real-Time Monitoring and Benchmarking

5. Simplified Testing Process

6. Batch Evaluation Support

Hands-On Guide on Evaluation of LLM Model Using DeepEval

Step 1. Installing Necessary Libraries

Step 2. Enablement of Threading For Running Ollama Model on Google Colab

Step 3. Pulling the Ollama Model & Defining the OpenAI API Key

Step 4. Querying the Model & Measuring Different Metrics

Answer Relevancy Metric

G-EVAL Metric

Defining the Custom Criteria & Evaluation Steps

Measuring the Metric With the Output From the Previously Defined Falcon 3 3B Model

Prompt Alignment Metric

JSON Correctness Metric

Defining the Desired Output Schema

Querying Our Model & Measuring the Metric

Summarization Metric

Querying Our Model & Generating Model’s Output

Measuring the Metric

Conclusions

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth