A Guide on Effective LLM Assessment with DeepEval

Nibedita Dutta Last Updated : 24 Jan, 2025
10 min read

Evaluating Large Language Models (LLMs) is essential for understanding their performance, reliability, and applicability in various contexts. This evaluation process involves assessing models against established benchmarks and metrics to ensure they generate accurate, coherent, and contextually relevant responses, ultimately enhancing their utility in real-world applications. As LLMs continue to evolve, robust evaluation methodologies are crucial for maintaining their effectiveness and addressing challenges such as bias and safety such as DeepEval.

DeepEval is an open-source evaluation framework designed to assess Large Language Model (LLM) performance. It provides a comprehensive suite of metrics and features, including the ability to generate synthetic datasets, perform real-time evaluations, and integrate seamlessly with testing frameworks like pytest. By facilitating easy customization and iteration on LLM applications, DeepEval enhances the reliability and effectiveness of AI models in various contexts.

Learning Objectives

  • Overview of DeepEval as a comprehensive framework for evaluating large language models (LLMs).
  • Examination of the core functionalities that make DeepEval an effective evaluation tool.
  • Detailed discussion on the various metrics available for LLM assessment.
  • Application of DeepEval to analyze the performance of the Falcon 3 3B model.
  • Focus on key evaluation metrics.

This article was published as a part of the Data Science Blogathon.

What is DeepEval?

DeepEval serves as a comprehensive platform for evaluating LLM performance, offering a user-friendly interface and extensive functionality. It enables developers to create unit tests for model outputs, ensuring that LLMs meet specific performance criteria. The framework operates entirely on local infrastructure, which enhances security and flexibility while facilitating real-time production monitoring and advanced synthetic dataset generation.

Key Features of DeepEval

Metrics In DeepEval Framework
Some Metrics In DeepEval Framework

1. Extensive Metric Suite

DeepEval provides over 14 research-backed metrics tailored for different evaluation scenarios. These metrics include:

  • G-Eval: A versatile metric that utilizes chain-of-thought reasoning to evaluate outputs based on custom criteria.
  • Faithfulness: Measures the accuracy and reliability of the information provided by the model.
  • Toxicity: Assesses the likelihood of harmful or offensive content in the generated text.
  • Answer Relevancy: Evaluate how well the model’s responses align with user expectations.
  • Conversational Metrics: These metrics, such as Knowledge Retention and Conversation Completeness, are designed specifically for evaluating dialogues rather than individual outputs.

2. Custom Metric Development

Users can easily develop their own custom evaluation metrics to suit specific needs. This flexibility allows for tailored assessments that can adapt to various contexts and requirements.

3. Integration with LLMs

DeepEval supports evaluations using any LLM, including those from OpenAI. This capability ensures that users can benchmark their models against popular standards like MMLU and HumanEval, making it easier to transition between different LLM providers or configurations.

4. Real-Time Monitoring and Benchmarking

The framework facilitates real-time monitoring of LLM performance in production environments. It also offers comprehensive benchmarking capabilities, allowing users to evaluate their models against established datasets efficiently.

5. Simplified Testing Process

With its Pytest-like architecture, DeepEval simplifies the testing process into just a few lines of code. This ease of use enables developers to quickly implement tests without extensive setup or configuration.

6. Batch Evaluation Support

DeepEval includes functionality for batch evaluations, significantly speeding up the benchmarking process when implemented with custom LLMs. This feature is particularly useful for large-scale evaluations where time efficiency is crucial.

Also Read: How to Evaluate a Large Language Model (LLM)?

Hands-On Guide on Evaluation of LLM Model Using DeepEval

We will be evaluating the Falcon 3 3B model’s outputs using DeepEval. We will be using Ollama to pull the model and then evaluate it using DeepEval on Google Colab.

Step 1. Installing Necessary Libraries

!pip install deepeval==2.1.5
!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2

Step 2. Enablement of Threading For Running Ollama Model on Google Colab

import threading
import subprocess
import time

def run_ollama_serve():
  subprocess.Popen(["ollama", "serve"])

thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)

Step 3. Pulling the Ollama Model & Defining the OpenAI API Key

!ollama pull falcon3:3b
import os
os.environ['OPENAI_API_KEY'] = ''

We will be using the GPT-4 model here to evaluate the answers from the LLM

Step 4. Querying the Model & Measuring Different Metrics

Below we will query the model and measure different metrics

Answer Relevancy Metric

We start with querying our model and getting the output generated from it.

from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown

template = """Question: {question}

Answer: Let's think step by step."""

prompt = ChatPromptTemplate.from_template(template)

model = OllamaLLM(model="falcon3:3b")

chain = prompt | model
query = 'How is Gurgaon Connected to Noida?'
#Prepare input for invocation
input_data = {
    "question": query  }

#Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)
Output

We will then measure the Answer Relevancy Metric. The answer relevancy metric measures how relevant the actual_output of your LLM application is compared to the provided input. This is an important metric in RAG evaluations as well.

Answer relevancy metric

The AnswerRelevancyMetric first uses an LLM to extract all statements made in the actual_output, before using the same LLM to classify whether each statement is relevant to the input.  

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input=query ,
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
Output

As seen from the output above, the Answer Relevancy Metric comes to be 1 here because the output from the Falcon 3 3B model is in alignment with the asked query.

G-EVAL Metric

G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. G-Eval is a two-step algorithm that –

  1. First generates a series of evaluation_steps using the Chain of Thoughts (CoTs) based on the given criteria.
  2. Second, it uses the generated steps to determine the final score using the parameters presented in an LLMTestCase.

When you provide evaluation_steps, the GEval metric skips the first step and uses the provided steps to determine the final score instead.

Defining the Custom Criteria & Evaluation Steps

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams

correctness_metric = GEval(
    name="Correctness",
    criteria="Determine whether the actual output is factually correct based on the expected output.",
    # NOTE: you can only provide either criteria or evaluation_steps, and not both
    evaluation_steps=[
        "Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
        "You should also heavily penalize omission of detail",
        "Vague language, or contradicting OPINIONS, are OK"
    ],
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

Measuring the Metric With the Output From the Previously Defined Falcon 3 3B Model

from deepeval.test_case import LLMTestCase
...
query="The dog chased the cat up the tree, who ran up the tree?"
# Prepare input for invocation
input_data = {
    "question": query}

# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)

test_case = LLMTestCase(
    input=query,
    actual_output=actual_output,
    expected_output="The cat."
)

correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)
Output

As we can see the correctness metric score comes to be very low here the model’s output contains the wrong answer “dog” which ideally should have been “cat”.

Prompt Alignment Metric

The prompt alignment metric measures whether your LLM application is able to generate actual_outputs that align with any instructions specified in your prompt template.

Prompt Alignment metric
from deepeval import evaluate
from deepeval.metrics import PromptAlignmentMetric
from deepeval.test_case import LLMTestCase

#QUERYING THE MODEL
template = """Question: {question}

Answer: Answer in Upper case."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query = "What is capital of Spain?"
# Prepare input for invocation
input_data = {
    "question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
display(Markdown(actual_output))

#MEASURING PROMPT ALIGNMENT QUESTION
metric = PromptAlignmentMetric(
    prompt_instructions=["Reply in all uppercase"],
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input=query,
    # Replace this with the actual output from your LLM application
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)
Output

As we can see the Prompt Alignment metric score comes to be 0 here as the model’s output doesnt contain the answer “Madrid” in Upper Case as was instructed.

JSON Correctness Metric

The json correctness metric measures whether your LLM application is able to generate actual_outputs with the correct json schema.

JSON Correctness Metric

The Json Correctness Metric does not use an LLM for evaluation and instead uses the provided expected_schema to determine whether the actual_output can be loaded into the schema.

Defining the Desired Output Schema

from pydantic import BaseModel

class ExampleSchema(BaseModel):
    name: str

Querying Our Model & Measuring the Metric

from deepeval import evaluate
from deepeval.metrics import JsonCorrectnessMetric
from deepeval.test_case import LLMTestCase

#QUERYING THE MODEL
template = """Question: {question}

Answer:  Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query ="Output me a random Json with the 'name' key"
# Prepare input for invocation
input_data = {
    "question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)

#MEASURING THE METRIC
metric = JsonCorrectnessMetric(
    expected_schema=ExampleSchema,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input="Output me a random Json with the 'name' key",
    # Replace this with the actual output from your LLM application
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

Output From Falcon 3 3B Model

{
"name": "John Doe"
}

Metric Score & Reason

0
The generated Json is not valid because it does not meet the expected json
schema. It lacks the 'required' array in the properties of 'name'. The
property of 'name' does not have a 'title' field.

As we can see the metric score comes to be 0 here as the model’s output is NOT in a JSON format (as predefined) completely.

Summarization Metric

The summarization metric uses LLMs to determine whether your LLM (application) is generating factually correct summaries while including the necessary details from the original text.

The Summarization Metric score is calculated according to the following equation:

Summarization Metric
  • alignment_score determines whether the summary contains hallucinated or contradictory information to the original text.
  • coverage_score determines whether the summary contains the necessary information from the original text.

Querying Our Model & Generating Model’s Output

# This is the original text to be summarized
text = """
Rice is the staple food of Bengal. Bhortas (lit-"mashed") are a really common type of food used as an additive too rice. there are several types of Bhortas such as Ilish bhorta shutki bhorta, begoon bhorta and more. Fish and other seafood are also important because Bengal is a reverrine region.

Some fishes like puti (Puntius species) are fermented. Fish curry is prepared with fish alone or in combination with vegetables.Shutki maach is made using the age-old method of preservation where the food item is dried in the sun and air, thus removing the water content. This allows for preservation that can make the fish last for months, even years in Bangladesh
"""

template = """Question: {question}

Answer:  Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query ="Summarize the text for me %s"%(text)
# Prepare input for invocation
input_data = {
    "question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)

Output (Summary) From Model

Rice, along with Bhortas (mashed) dishes, are staples in Bengal. Fish curry
and age-old preservation methods like Shutki maach highlight the region's
seafood culture.

Measuring the Metric

from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...

test_case = LLMTestCase(input=text, actual_output=actual_output)
metric = SummarizationMetric(
    threshold=0.5,
    model="gpt-4"

)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])
Output

As we can see the metric score comes to be 0.4 here as the model’s output which is a summary of the original text doesn’t contain many key points present in the original text.

Also read: Making Sure Super-Smart AI Plays Nice: Testing Knowledge, Goals, and Safety

Conclusions

In conclusion, DeepEval stands out as a powerful and flexible platform for evaluating LLMs, offering a range of features that streamline the testing and benchmarking process. Its comprehensive suite of metrics, support for custom evaluations, and integration with any LLM make it an invaluable tool for developers aiming to optimize model performance. With capabilities like real-time monitoring, simplified testing, and batch evaluation, DeepEval ensures efficient and reliable assessments, enhancing both security and flexibility in production environments.

Key Takeaways

  1. Comprehensive Evaluation Platform: DeepEval provides a robust platform for evaluating LLM performance, offering a user-friendly interface, real-time monitoring, and advanced dataset generation—all running on local infrastructure for enhanced security and flexibility.
  2. Extensive Metric Suite: The framework includes over 14 research-backed metrics, such as G-Eval, Faithfulness, Toxicity, and conversational metrics, designed to address a wide variety of evaluation scenarios and provide thorough insights into model performance.
  3. Customizable Metrics: DeepEval allows users to develop custom evaluation metrics tailored to specific needs, making it adaptable to diverse contexts and enabling personalized assessments.
  4. Integration with Multiple LLMs: The platform supports evaluations across any LLM, including those from OpenAI, facilitating benchmarking against popular standards like MMLU and HumanEval, and offering seamless transitions between different LLM configurations.
  5. Efficient Testing and Batch Evaluation: With a simplified testing process (Pytest-like architecture) and batch evaluation support, DeepEval makes it easier to implement tests quickly and efficiently, especially for large-scale evaluations where time efficiency is essential.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is DeepEval and how does it help in evaluating LLMs?

Ans. DeepEval is a comprehensive platform designed to evaluate LLM (Large Language Model) performance. It offers a user-friendly interface, a wide range of evaluation metrics, and supports real-time monitoring of model outputs. It enables developers to create unit tests for model outputs to ensure they meet specific performance criteria.

Q2. What evaluation metrics does DeepEval offer?

Ans. DeepEval provides over many research-backed metrics for diverse evaluation scenarios. Key metrics include G-Eval for chain-of-thought reasoning, Faithfulness for accuracy, Toxicity for harmful content detection, Answer Relevancy for response alignment with user expectations, and various Conversational Metrics for dialogue evaluation, such as Knowledge Retention and Conversation Completeness.

Q3. Can I create custom evaluation metrics with DeepEval?

Ans. Yes, DeepEval allows users to develop custom evaluation metrics tailored to their specific needs. This flexibility enables developers to assess models based on unique criteria or requirements, providing a more personalized evaluation process.

Q4. Does DeepEval support integration with all LLMs?

Ans. Yes, DeepEval is compatible with any LLM, including popular models from OpenAI. It allows users to benchmark their models against recognized standards like MMLU and HumanEval, making it easy to switch between different LLM providers or configurations.

Q5. How does DeepEval simplify the testing process?

Ans. DeepEval simplifies the testing process with a Pytest-like architecture, enabling developers to implement tests with just a few lines of code. Additionally, it supports batch evaluations, which speeds up the benchmarking process, especially for large-scale assessments.

Nibedita completed her master’s in Chemical Engineering from IIT Kharagpur in 2014 and is currently working as a Senior Data Scientist. In her current capacity, she works on building intelligent ML-based solutions to improve business processes.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details