How to Build a RAG Evaluator Python Package with Poetry?

Sriniketh Last Updated : 25 Jul, 2024

11 min read

Introduction

Imagine that you are about to produce a Python package that has the potential to completely transform the way developers and data analysts assess their models. The trip begins with a straightforward concept: a flexible RAG evaluation tool that can manage a variety of metrics and edge circumstances. You’ll go from initializing your package with poetry to creating a solid evaluator class and testing your code as you dive into this post. You will get knowledge on how to create your package, calculate BLEU and ROUGE scores, and post it online. By the end, you will have gained more insight into Python packaging and open-source contributions in addition to having a working tool that is ready for usage by the general public.

Learning Outcomes

Learn to initialize and structure a Python package using poetry.
Develop and implement an evaluator class for multiple metrics.
Calculate and evaluate metrics such as BLEU and ROUGE scores.
Write and execute tests to ensure code functionality and robustness.
Build and publish a Python package to PyPI, including handling distributions and licensing.

This article was published as a part of the Data Science Blogathon.

Initializing Your Package with Poetry
Designing the Core Evaluator Class
Defining Evaluator Class
Implementing Evaluation Metrics
Calculating BLEU Scores
Measuring ROUGE Scores
Testing Your Package
Build the wheel
Creating Source and Binary Distributions
Publish the Package
Frequently Asked Questions

Initializing Your Package with Poetry

Now that we have the requirements we can start by initializing a new python package using poetry. The reason for choosing poetry is:

It removes the need for managing multiple virtual environments.
It supports all types of python package formats, both native and legacy packages.
It ensures the right version even for the dependencies through the `poetry.lock` file.
Pypi ready with a single command.

Install poetry using the command for almost all the OS:

curl -sSL https://install.python-poetry.org | python3 -

Then we can create a new repository with the boilerplate using the following command.

poetry new package_name

There will be few generic questions for which you can press the enter and leave it as default. Then you will land in a folder structure similar to this.

poetry-demo

├── pyproject.toml

├── README.md

├── poetry_demo

│   └── __init__.py

└── tests

    └── __init__.py

Though the structure is just fine, we can use the `src` layout compared to the `flat` layout as discussed in the official Python documentation. We shall be following the `src` layout in the rest of the blog.

Designing the Core Evaluator Class

The heart of our package contains all the source code to power the Python evaluator package. It contains the base class that is going to be inherited by all the metrics that we wish to have. So this class has to be the most robust and utmost care must be taken during construction. This class will have the necessary logic needed for basic initialization, a method to get the result from the metric, and another method(s) for handling user input to be readily consumable.

All these methods must have their own scope and proper data types defined. The reason to focus more on the data types is because Python is dynamically typed. Hence, we must ensure the proper use of variables as these cause errors only at runtime. So there must be test suites to catch these minute errors, rather than using a dedicated type-checking compiler. Well and good if we use proper typing in Python.

Defining Evaluator Class

Now that we saw what all the evaluator class must contain and why it’s the most important we are left with the implementation of the same. For building this class we are inheriting the ABC – Abstract Base Class provided by python. The reason for choosing this class is that it contains all the concrete features upon which we can build our evaluator base class. Now let’s define the inputs and outputs of the evaluator class.

Inputs: Candidates[list of string], References[list of string]
Methods: `padding` (to ensure the length of candidates and references are the same), `get_score` (method to calculate the final result of the evaluation metrics)

# src/evaluator_blog/evaluator.py

import warnings
from typing import Union, List
from abc import ABC, abstractmethod


class BaseEvaluator(ABC):
    def __init__(self, candidates: List, references: List) -> None:
        self.candidates = candidates
        self.references = references

    @staticmethod
    def padding(
        candidates: List[str], references: List[str]
    ) -> Union[List[str], List[str]]:
        """_summary_

        Args:
            candidates (List[str]): The response generated from the LLM
            references (List[str]): The response to be measured against

        Returns:
            Union[List[str], List[str]]: Ensures equal length of `candidates` and `references`
        """
        _msg = str(
            """
            The length of references and candidates (hypothesis) are not same.
            """
        )
        warnings.warn(_msg)
        max_length = max(len(candidates), len(references))
        candidates.extend([""] * (max_length - len(candidates)))
        references.extend([""] * (max_length - len(references)))
        return candidates, references

    @staticmethod
    def list_to_string(l: List) -> str:
        assert (
            len(l) >= 1
        ), "Ensure the length of the message is greater than or equal to 1"

        return str(l[0])

    @abstractmethod
    def get_score(self) -> float:
        """
        Method to calculate the final result of the score function.

        Returns:
            Floating point value of the chosen evaluation metric.
        """

Here we can find that the `__init()__` method contains the parameters required that is the basic requirement for any evaluator metric i.e. candidates and references.

Then the padding required to ensure both the `candidates` and `references` contain the same length defined as the static method because we don’t need to initialize this everytime we call. Therefore, the staticmethod decorator contains the required logic.

Finally, for the `get_score()` we use abstractmethod decorator meaning all the classes that inherit the base evaluator class must definitely contain this method.

Implementing Evaluation Metrics

Now comes the heart of the implementation of the library, the evaluation of the metrics. Currently for the calculation we make use of respective libraries that perform the task and display the metric score. We mainly use `candidates` i.e. the LLM generated response and `references` i.e. the ground truth and we calculate the value respectively. For simplicity we calculate the BLEU and Rouge score. This logic is extensible to all the metrics available in the market.

Calculating BLEU Scores

Abbreviated as Bilingual Evaluation Understudy is one of the common evaluation metrics of machine translation(candidates) that is quick, inexpensive, and language-independent. It has marginal errors in comparison to manual evaluation. It compares the closeness of machine translation to the professional human responses(references) and returns the evaluation as a metric score in the range of 0-1, with anything towards 1 being termed as a close match. They consider n-gram(s) (chunks of n words) in a sentence from candidates. Eg. unigrams (1 gram) considers every word from candidates and references and return the normalized score termed as the precision score.

But it doesn’t always work well considering if the same word appears multiple times it accounts for the final score for each appearance which typically is incorrect. Therefore BLEU uses a modified precision score where it clips the number of word matches and normalizes it with the number of words in the candidate. Another catch here is it doesn’t take the word ordering into account. Therefore bleu score considers multiple n-grams and displays the precision scores of 1-4 grams with other parameters.

Advantages

Faster computation and simple calculations involved.
Widely used and easy to benchmark results.

Disadvantages

Doesn’t consider the meaning of translation.
Doesn’t take into account the sentence format.
Though it is bilingual, it struggles with non-english languages.
Hard to compute scores when human translations are already tokenized.

# src/evaluator_blog/metrics/bleu.py

from typing import List, Callable, Optional
from src.evaluator_blog.evaluator import BaseEvaluator

from nltk.translate.bleu_score import corpus_bleu, SmoothingFunction

"""
BLEU implementation from NLTK
"""
class BLEUScore(BaseEvaluator):
    def __init__(
        self,
        candidates: List[str],
        references: List[str],
        weights: Optional[List[float]] = None,
        smoothing_function: Optional[Callable] = None,
        auto_reweigh: Optional[bool] = False,
    ) -> None:
        """
        Calculate BLEU score (Bilingual Evaluation Understudy) from
        Papineni, Kishore, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002.
        "BLEU: a method for automatic evaluation of machine translation."
        In Proceedings of ACL. https://aclanthology.org/P02-1040.pdf

            Args:
                weights (Optional[List[float]], optional): The weights that must be applied to each bleu_score. Defaults to None.
                smoothing_function (Optional[Callable], optional): A callable function to overcome the problem of the sparsity of training data by adding or adjusting the probability mass distribution of words. Defaults to None.
                auto_reweigh (Optional[bool], optional): Uniformly re-weighting based on maximum hypothesis lengths if largest order of n-grams < 4 and weights is set at default. Defaults to False.
        """
        super().__init__(candidates, references)

        # Check if `weights` is provided
        if weights is None:
            self.weights = [1, 0, 0, 0]
        else:
            self.weights = weights

        # Check if `smoothing_function` is provided
        # If `None` defaulted to method0
        if smoothing_function is None:
            self.smoothing_function = SmoothingFunction().method0
        else:
            self.smoothing_function = smoothing_function

        # If `auto_reweigh` enable it
        self.auto_reweigh = auto_reweigh

    def get_score(
        self,
    ) -> float:
        """
        Calculate the BLEU score for the given candidates and references.

        Args:
            candidates (List[str]): List of candidate sentences
            references (List[str]): List of reference sentences
            weights (Optional[List[float]], optional): Weights for BLEU score calculation. Defaults to (1.0, 0, 0, 0)
            smoothing_function (Optional[function]): Smoothing technique to for segment-level BLEU scores

        Returns:
            float: The calculated BLEU score.
        """
        # Check if the length of candidates and references are equal
        if len(self.candidates) != len(self.references):
            self.candidates, self.references = self.padding(
                self.candidates, self.references
            )

        # Calculate the BLEU score
        return corpus_bleu(
            list_of_references=self.references,
            hypotheses=self.candidates,
            weights=self.weights,
            smoothing_function=self.smoothing_function,
            auto_reweigh=self.auto_reweigh,
        )

Measuring ROUGE Scores

Abbreviated as Recall Oriented Understudy for Gisting Evaluation is one of the common evaluation metrics for comparing model-generated summaries with multiple human summaries. In a naive way, it compares the n-grams of both the machine and human-generated summary. This is called the Rouge-n recall score. To ensure more relevancy in machine generated summary to the human summary we can calculate the precision score. As we have both precision and recall scores we can calculate the f1-score. It’s normally recommended to consider multiple values of `n`. A small variant in rouge is the rouge-l score which considers the sequence of words and computes the LCS (longest common subsequence). In the same way, we can get the precision and recall score. A slight advantage here is it considers the molecularity of the sentence and produces relevant results.

Rouge-1 : Build a RAG Evaluator Python Package with Poetry

Advantages

Highly effective for evaluating the quality of automatic text summarization by comparing n-grams and longest common subsequences.
ROUGE can be applied to any language, making it versatile for multilingual text analysis and evaluation.

Disadvantages

ROUGE focuses on surface-level text matching (n-grams), which might not capture deeper semantic meaning and coherence.
The accuracy of ROUGE heavily depends on the quality and representativeness of the reference summaries

# src/evaluator_blog/metrics/rouge.py

import warnings
from typing import List, Union, Dict, Callable, Tuple, Optional
from ..evaluator import BaseEvaluator
from rouge_score import rouge_scorer


class RougeScore(BaseEvaluator):
    def __init__(
        self,
        candidates: List,
        references: List,
        rouge_types: Optional[Union[str, Tuple[str]]] = [
            "rouge1",
            "rouge2",
            "rougeL",
            "rougeLsum",
        ],
        use_stemmer: Optional[bool] = False,
        split_summaries: Optional[bool] = False,
        tokenizer: Optional[Callable] = None,
    ) -> None:
        super().__init__(candidates, references)

        # Default `rouge_types` is all, else the user specified
        if isinstance(rouge_types, str):
            self.rouge_types = [rouge_types]
        else:
            self.rouge_types = rouge_types

        # Enable `use_stemmer` to remove word suffixes to improve matching capability
        self.use_stemmer = use_stemmer

        # If enabled checks whether to add newlines between sentences for `rougeLsum`
        self.split_summaries = split_summaries

        # Enable `tokenizer` if user defined or else use the `rouge_scorer` default
        # https://github.com/google-research/google-research/blob/master/rouge/rouge_scorer.py#L83
        if tokenizer:
            self.tokenizer = tokenizer
        else:
            self.tokenizer = None
            _msg = str(
                """
                Utilizing the default tokenizer
                """
            )
            warnings.warn(_msg)

    def get_score(self) -> Dict:
        """
        Returns:
            Dict: JSON value of the evaluation for the corresponding metric
        """
        scorer = rouge_scorer.RougeScorer(
            rouge_types=self.rouge_types,
            use_stemmer=self.use_stemmer,
            tokenizer=self.tokenizer,
            split_summaries=self.split_summaries,
        )

        return scorer.score(self.list_to_string(self.candidates), self.list_to_string(self.references))

Testing Your Package

Now that we have the source file ready before the actual usage we must verify the working of the code. That’s where the testing phase comes into the picture. In Python library format/convention/best practice, we write all the tests under the folder named `tests/`. This naming convention makes it easy for developers to understand that this folder has its significance. Though we have multiple development tools we can restrict the library using type checking, error handling, and much more. This caters to the first round of checking and testing. But to ensure edge cases and exceptions, we can use unittest, and pytest as the go-to frameworks. With that being said we just go with setting up the basic tests using the `unittest` library.

Writing Effective Unit Tests

The key terms to know with respect to `unittest` is the test case and test suite.

Test case: Smallest unit of testing where we evaluate the inputs against a set of outputs.
Test suite: A collection of test cases, suites or both. Used to aggregate tests to work together.
Naming convention: This must be prefixed with `tests_` to the file name as well as the function name. The reason is the parser will detect them and add them to the test suite.

Build the wheel

Wheel is basically a python package i.e. installed when we run the command `pip install <package_name>`. The contents of the wheel are stored in the ‘.whl’ file. The wheel file is stored at `dist/`. There’s a built distribution `.whl` and the source distribution `.gz`. Since we are using poetry we can build the distribution using the build command:

poetry build

It generates the wheel and zip file inside the `dist/` folder in the root of the folder.

dist/
├── package_name-0.0.1-py3-none-any.whl
└── package_name-0.0.1.tar.gz

Aliter, The equivalent python command is installing the `build` package and then running the build command from the root of the folder.

python3 -m pip install --upgrade build
python3 -m build

Creating Source and Binary Distributions

Let us now look in to creating source and binary distributions.

Source Distribution (sdist)

`sdist` is the source distribution of the package that contains source code and metadata to build from external tools like pip or poetry. `sdist` is required to be built before `bdist`. If `pip` doesn’t find the build distribution, the source distribution acts as a fallback. Then it builds a wheel out of it and then installs the package requirements.

Binary Distribution (bdist)

`bdist` contains the necessary files that need to be moved to the correct location of the target device. One of the best-supported formats is `.whl`. Point to be noted it doesn’t have compiled python files.

License

While open-sourcing the package to the external world it’s always advisable to have a license that shows the extent to which your code can be reused. While creating a repository in GitHub we have the option to select the license there. It creates a `LICENSE` file with usage options. If you are unsure which license to choose then this external resource is a perfect one to the rescue.

Build a RAG Evaluator Python Package with Poetry

Publish the Package

Now that we have all the requirements we need to publish the package to the external world. So we are using the publish command which abstracts all the steps with a single command.

test.pypi.org

If you are unsure how the package would perform or for testing purposes it is advised to publish to a test.pypi.org rather than directly uploading to the official repository. This gives us the flexibility to test the package before sharing it with everyone.

pypi.org

The official Python package contains all the private and public software published by the Python community. It’s useful for authors and organizations to share their packages through an official central repository. All that it takes to publish your package to the world is this single command.

poetry publish --build --username $PYPI_USERNAME --password $PYPI_PASSWORD

Conclusion

By the end of this article, you have successfully published a Python package that is ready to be used by millions. We have initialized a new package using poetry, worked on the use case, wrote the tests, built the package, and published them to the Pypi repository. This will add more value for yourself and also help you to understand the various open-source Python package repositories on how they are structured. Last but not least, this is just the beginning and we can make it as extensible as possible. We can refer to the open-source Python packages and distributions, and get inspiration from the same.

Key Takeaways

Master Python package creation and management with poetry.
Implement and customize evaluation metrics for diverse use cases.
Build and test robust Python packages with unit testing frameworks.
Publish your package to PyPI and understand distribution formats.
Gain insights into open-source package development and licensing practices.

Frequently Asked Questions

Q1. What is the purpose of this article?

A. The article helps you create and publish a Python package, focusing on a RAG evaluator tool that can be used by the community for various evaluation metrics.

Q2. Why should I use poetry for managing my Python package?

A. Poetry simplifies dependency management and packaging by integrating version control, virtual environments, and publishing tasks into a single tool, making development and distribution easier.

Q3. What evaluation metrics are covered in the article?

A. The article details how to calculate BLEU and ROUGE scores, which are commonly used metrics for assessing the quality of machine-generated text in comparison to reference texts.

Q4. How can I test the functionality of my Python package?

A. You can test your package using frameworks like unittest or pytest to ensure the code works as expected and handles edge cases, providing confidence before publishing.

Q5. What are the steps for publishing a Python package?

A. Build your package using poetry or build, test it on test.pypi.org, and then publish it to the official pypi.org repository using the poetry publish command to make it available to the public.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sriniketh

A seasoned software and ML developer with likes to share knowledge on the latest skills, frameworks, and technologies. Writes about Data Science and Machine learning and love to build and ship projects.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

How to Build a RAG Evaluator Python Package with Poetry?

Introduction

Learning Outcomes

Table of contents

Initializing Your Package with Poetry

Designing the Core Evaluator Class

Defining Evaluator Class

Implementing Evaluation Metrics

Calculating BLEU Scores

Advantages

Disadvantages

Measuring ROUGE Scores

Advantages

Disadvantages

Testing Your Package

Writing Effective Unit Tests

Build the wheel

Creating Source and Binary Distributions

Source Distribution (sdist)

Binary Distribution (bdist)

License

Publish the Package

test.pypi.org

pypi.org

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B