Evaluating Large Language Models (LLMs) is essential for understanding their performance, reliability, and applicability in various contexts. This evaluation process involves assessing models against established benchmarks and metrics to ensure they generate accurate, coherent, and contextually relevant responses, ultimately enhancing their utility in real-world applications. As LLMs continue to evolve, robust evaluation methodologies are crucial for maintaining their effectiveness and addressing challenges such as bias and safety such as DeepEval.
DeepEval is an open-source evaluation framework designed to assess Large Language Model (LLM) performance. It provides a comprehensive suite of metrics and features, including the ability to generate synthetic datasets, perform real-time evaluations, and integrate seamlessly with testing frameworks like pytest. By facilitating easy customization and iteration on LLM applications, DeepEval enhances the reliability and effectiveness of AI models in various contexts.
This article was published as a part of the Data Science Blogathon.
DeepEval serves as a comprehensive platform for evaluating LLM performance, offering a user-friendly interface and extensive functionality. It enables developers to create unit tests for model outputs, ensuring that LLMs meet specific performance criteria. The framework operates entirely on local infrastructure, which enhances security and flexibility while facilitating real-time production monitoring and advanced synthetic dataset generation.
DeepEval provides over 14 research-backed metrics tailored for different evaluation scenarios. These metrics include:
Users can easily develop their own custom evaluation metrics to suit specific needs. This flexibility allows for tailored assessments that can adapt to various contexts and requirements.
DeepEval supports evaluations using any LLM, including those from OpenAI. This capability ensures that users can benchmark their models against popular standards like MMLU and HumanEval, making it easier to transition between different LLM providers or configurations.
The framework facilitates real-time monitoring of LLM performance in production environments. It also offers comprehensive benchmarking capabilities, allowing users to evaluate their models against established datasets efficiently.
With its Pytest-like architecture, DeepEval simplifies the testing process into just a few lines of code. This ease of use enables developers to quickly implement tests without extensive setup or configuration.
DeepEval includes functionality for batch evaluations, significantly speeding up the benchmarking process when implemented with custom LLMs. This feature is particularly useful for large-scale evaluations where time efficiency is crucial.
Also Read: How to Evaluate a Large Language Model (LLM)?
We will be evaluating the Falcon 3 3B model’s outputs using DeepEval. We will be using Ollama to pull the model and then evaluate it using DeepEval on Google Colab.
!pip install deepeval==2.1.5
!sudo apt update
!sudo apt install -y pciutils
!pip install langchain-ollama
!curl -fsSL https://ollama.com/install.sh | sh
!pip install ollama==0.4.2
import threading
import subprocess
import time
def run_ollama_serve():
subprocess.Popen(["ollama", "serve"])
thread = threading.Thread(target=run_ollama_serve)
thread.start()
time.sleep(5)
!ollama pull falcon3:3b
import os
os.environ['OPENAI_API_KEY'] = ''
We will be using the GPT-4 model here to evaluate the answers from the LLM
Below we will query the model and measure different metrics
We start with querying our model and getting the output generated from it.
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama.llms import OllamaLLM
from IPython.display import Markdown
template = """Question: {question}
Answer: Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query = 'How is Gurgaon Connected to Noida?'
#Prepare input for invocation
input_data = {
"question": query }
#Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)
We will then measure the Answer Relevancy Metric. The answer relevancy metric measures how relevant the actual_output of your LLM application is compared to the provided input. This is an important metric in RAG evaluations as well.
The AnswerRelevancyMetric first uses an LLM to extract all statements made in the actual_output, before using the same LLM to classify whether each statement is relevant to the input.
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
metric = AnswerRelevancyMetric(
threshold=0.7,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input=query ,
actual_output=actual_output
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
As seen from the output above, the Answer Relevancy Metric comes to be 1 here because the output from the Falcon 3 3B model is in alignment with the asked query.
G-Eval is a framework that uses LLMs with chain-of-thoughts (CoT) to evaluate LLM outputs based on ANY custom criteria. G-Eval is a two-step algorithm that –
When you provide evaluation_steps, the GEval metric skips the first step and uses the provided steps to determine the final score instead.
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCaseParams
correctness_metric = GEval(
name="Correctness",
criteria="Determine whether the actual output is factually correct based on the expected output.",
# NOTE: you can only provide either criteria or evaluation_steps, and not both
evaluation_steps=[
"Check whether the facts in 'actual output' contradicts any facts in 'expected output'",
"You should also heavily penalize omission of detail",
"Vague language, or contradicting OPINIONS, are OK"
],
evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)
from deepeval.test_case import LLMTestCase
...
query="The dog chased the cat up the tree, who ran up the tree?"
# Prepare input for invocation
input_data = {
"question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)
test_case = LLMTestCase(
input=query,
actual_output=actual_output,
expected_output="The cat."
)
correctness_metric.measure(test_case)
print(correctness_metric.score)
print(correctness_metric.reason)
As we can see the correctness metric score comes to be very low here the model’s output contains the wrong answer “dog” which ideally should have been “cat”.
The prompt alignment metric measures whether your LLM application is able to generate actual_outputs that align with any instructions specified in your prompt template.
from deepeval import evaluate
from deepeval.metrics import PromptAlignmentMetric
from deepeval.test_case import LLMTestCase
#QUERYING THE MODEL
template = """Question: {question}
Answer: Answer in Upper case."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query = "What is capital of Spain?"
# Prepare input for invocation
input_data = {
"question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
display(Markdown(actual_output))
#MEASURING PROMPT ALIGNMENT QUESTION
metric = PromptAlignmentMetric(
prompt_instructions=["Reply in all uppercase"],
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input=query,
# Replace this with the actual output from your LLM application
actual_output=actual_output
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
As we can see the Prompt Alignment metric score comes to be 0 here as the model’s output doesnt contain the answer “Madrid” in Upper Case as was instructed.
The json correctness metric measures whether your LLM application is able to generate actual_outputs with the correct json schema.
The Json Correctness Metric does not use an LLM for evaluation and instead uses the provided expected_schema to determine whether the actual_output can be loaded into the schema.
from pydantic import BaseModel
class ExampleSchema(BaseModel):
name: str
from deepeval import evaluate
from deepeval.metrics import JsonCorrectnessMetric
from deepeval.test_case import LLMTestCase
#QUERYING THE MODEL
template = """Question: {question}
Answer: Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query ="Output me a random Json with the 'name' key"
# Prepare input for invocation
input_data = {
"question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)
#MEASURING THE METRIC
metric = JsonCorrectnessMetric(
expected_schema=ExampleSchema,
model="gpt-4",
include_reason=True
)
test_case = LLMTestCase(
input="Output me a random Json with the 'name' key",
# Replace this with the actual output from your LLM application
actual_output=actual_output
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
Output From Falcon 3 3B Model
{
"name": "John Doe"
}
Metric Score & Reason
0
The generated Json is not valid because it does not meet the expected json
schema. It lacks the 'required' array in the properties of 'name'. The
property of 'name' does not have a 'title' field.
As we can see the metric score comes to be 0 here as the model’s output is NOT in a JSON format (as predefined) completely.
The summarization metric uses LLMs to determine whether your LLM (application) is generating factually correct summaries while including the necessary details from the original text.
The Summarization Metric score is calculated according to the following equation:
# This is the original text to be summarized
text = """
Rice is the staple food of Bengal. Bhortas (lit-"mashed") are a really common type of food used as an additive too rice. there are several types of Bhortas such as Ilish bhorta shutki bhorta, begoon bhorta and more. Fish and other seafood are also important because Bengal is a reverrine region.
Some fishes like puti (Puntius species) are fermented. Fish curry is prepared with fish alone or in combination with vegetables.Shutki maach is made using the age-old method of preservation where the food item is dried in the sun and air, thus removing the water content. This allows for preservation that can make the fish last for months, even years in Bangladesh
"""
template = """Question: {question}
Answer: Let's think step by step."""
prompt = ChatPromptTemplate.from_template(template)
model = OllamaLLM(model="falcon3:3b")
chain = prompt | model
query ="Summarize the text for me %s"%(text)
# Prepare input for invocation
input_data = {
"question": query}
# Invoke the chain with input data and display the response in Markdown format
actual_output = chain.invoke(input_data)
print(actual_output)
Output (Summary) From Model
Rice, along with Bhortas (mashed) dishes, are staples in Bengal. Fish curry
and age-old preservation methods like Shutki maach highlight the region's
seafood culture.
from deepeval import evaluate
from deepeval.metrics import SummarizationMetric
from deepeval.test_case import LLMTestCase
...
test_case = LLMTestCase(input=text, actual_output=actual_output)
metric = SummarizationMetric(
threshold=0.5,
model="gpt-4"
)
metric.measure(test_case)
print(metric.score)
print(metric.reason)
# or evaluate test cases in bulk
evaluate([test_case], [metric])
As we can see the metric score comes to be 0.4 here as the model’s output which is a summary of the original text doesn’t contain many key points present in the original text.
Also read: Making Sure Super-Smart AI Plays Nice: Testing Knowledge, Goals, and Safety
In conclusion, DeepEval stands out as a powerful and flexible platform for evaluating LLMs, offering a range of features that streamline the testing and benchmarking process. Its comprehensive suite of metrics, support for custom evaluations, and integration with any LLM make it an invaluable tool for developers aiming to optimize model performance. With capabilities like real-time monitoring, simplified testing, and batch evaluation, DeepEval ensures efficient and reliable assessments, enhancing both security and flexibility in production environments.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Ans. DeepEval is a comprehensive platform designed to evaluate LLM (Large Language Model) performance. It offers a user-friendly interface, a wide range of evaluation metrics, and supports real-time monitoring of model outputs. It enables developers to create unit tests for model outputs to ensure they meet specific performance criteria.
Ans. DeepEval provides over many research-backed metrics for diverse evaluation scenarios. Key metrics include G-Eval for chain-of-thought reasoning, Faithfulness for accuracy, Toxicity for harmful content detection, Answer Relevancy for response alignment with user expectations, and various Conversational Metrics for dialogue evaluation, such as Knowledge Retention and Conversation Completeness.
Ans. Yes, DeepEval allows users to develop custom evaluation metrics tailored to their specific needs. This flexibility enables developers to assess models based on unique criteria or requirements, providing a more personalized evaluation process.
Ans. Yes, DeepEval is compatible with any LLM, including popular models from OpenAI. It allows users to benchmark their models against recognized standards like MMLU and HumanEval, making it easy to switch between different LLM providers or configurations.
Ans. DeepEval simplifies the testing process with a Pytest-like architecture, enabling developers to implement tests with just a few lines of code. Additionally, it supports batch evaluations, which speeds up the benchmarking process, especially for large-scale assessments.