In the fast-evolving world of AI, it’s crucial to keep track of your API costs, especially when building LLM-based applications such as Retrieval-Augmented Generation (RAG) pipelines in production. Experimenting with different LLMs to get the best results often involves making numerous API requests to the server, each request incurring a cost. Understanding and tracking where every dollar is spent is vital to managing these expenses effectively.
In this article, we will implement LLM observability with RAG using just 10-12 lines of code. Observability helps us monitor key metrics such as latency, the number of tokens, prompts, and the cost per request.
This article was published as a part of the Data Science Blogathon.
Think of LLM Observability just like you monitor your car’s performance or track your daily expenses, LLM Observability involves watching and understanding every detail of how these AI models operate. It helps you track usage by counting number of “tokens”—units of processing that each request to the model uses. This helps you stay within budget and avoid unexpected expenses.
Additionally, it monitors performance by logging how long each request takes, ensuring that no part of the process is unnecessarily slow. It provides valuable insights by showing patterns and trends, helping you identify inefficiencies and areas where you might be overspending. LLM Observability is a best practice to follow while building applications on production, as this can automate the action pipeline to send alerts if something goes wrong.
Retrieval Augmented Generation (RAG) is a concept where relevant document chunks are returned to a Large Language Model (LLM) as in-context learning (i.e., few-shot prompting) based on a user’s query. Simply put, RAG consists of two parts: the retriever and the generator.
When a user enters a query, it is first converted into embeddings. These query embeddings are then searched in a vector database by the retriever to return the most relevant or semantically similar documents. These documents are passed as in-context learning to the generator model, allowing the LLM to generate a reasonable response. RAG reduces the likelihood of hallucinations and provides domain-specific responses based on the given knowledge base.
Building a RAG pipeline involves several key components: data source, text splitters, vector database, embedding models, and large language models. RAG is widely implemented when you need to connect a large language model to a custom data source. For example, if you want to create your own ChatGPT for your class notes, RAG would be the ideal solution. This approach ensures that the model can provide accurate and relevant responses based on your specific data, making it highly useful for personalized applications.
Building RAG application depends on different use cases. Each use case depends its own custom prompts for in-context learning. Custom prompts includes combination of both system prompt and user prompt, system prompt is the rules or instructions based on which LLM needs to behave and user prompt is the augmented prompt to the user query. Writing a good prompt is first attempt is a very rare case.
Using observability with Retrieval Augmented Generation (RAG) is crucial for ensuring efficient and cost-effective operations. Observability helps you monitor and understand every detail of your RAG pipeline, from tracking token usage to measuring latency, prompts and response times. By keeping a close watch on these metrics, you can identify and address inefficiencies, avoid unexpected expenses, and optimize your system’s performance. Essentially, observability provides the insights needed to fine-tune your RAG setup, ensuring it runs smoothly, stays within budget, and consistently delivers accurate, domain-specific responses.
Let’s take a practical example and understand why we need to use observability while using RAG. Suppose you built the app and now its on production
Let us now look into the steps of Observability with RAG Implementation.
Before we proceed with the code implementation, you need to install a few libraries. These libraries include Beyond LLM, OpenAI, Phoenix, and YouTube Transcript API. Beyond LLM is a library that helps you build advanced RAG applications efficiently, incorporating observability, fine-tuning, embeddings, and model evaluation.
pip install beyondllm
pip install openai
pip install arize-phoenix[evals]
pip install youtube_transcript_api llama-index-readers-youtube-transcript
Set up the environment variable for the OpenAI API key, which is necessary to authenticate and access OpenAI’s services such as LLM and embedding.
import os, getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass("API:")
# import required libraries
from beyondllm import source,retrieve,generator, llms, embeddings
from beyondllm.observe import Observer
Enabling observability should be the first step in your code to ensure all subsequent operations are tracked.
Observe = Observer()
Observe.run()
Since the OpenAI API key is already stored in environment variable, you can now define the LLM and embedding model to retrieve the document and generate the response accordingly.
llm=llms.ChatOpenAIModel()
embed_model = embeddings.OpenAIEmbeddings()
BeyondLLM is a native framework for Data Scientists. To ingest data, you can define the data source inside the `fit` function. Based on the data source, you can specify the `dtype` in our case, it’s YouTube. Additionally, we can chunk our data to avoid the context length issues of the model and return only the specific chunk. Chunk overlap defines the number of tokens that need to be repeated in the consecutive chunk.
The Auto retriever in BeyondLLM helps retrieve the relevant k number of documents based on the type. There are various retriever types such as Hybrid, Re-ranking, Flag embedding re-rankers, and more. In this use case, we will use a normal retriever, i.e., an in-memory retriever.
data = source.fit("https://www.youtube.com/watch?v=IhawEdplzkI",
dtype="youtube",
chunk_size=512,
chunk_overlap=50)
retriever = retrieve.auto_retriever(data,
embed_model,
type="normal",
top_k=4)
The generator model combines the user query and the relevant documents from the retriever class and passes them to the Large Language Model. To facilitate this, BeyondLLM supports a generator module that chains up this pipeline, allowing for further evaluation of the pipeline on the RAG triad.
user_query = "summarize simple task execution worflow?"
pipeline = generator.Generate(question=user_query,retriever=retriever,llm=llm)
print(pipeline.call())
Output
Evaluation of RAG pipeline can be performed using RAG triad metrics that includes Context relevancy, Answer relevancy and Groundness.
print(pipeline.get_rag_triad_evals())
#or
# run it individually
print(pipeline.get_context_relevancy()) # context relevancy
print(pipeline.get_answer_relevancy()) # answer relevancy
print(pipeline.get_groundedness()) # groundedness
Output:
Figure-1 denotes the main dashboard of the Phoenix, once you run the Observer.run(), it returns two links:
Since we are using two services from OpenAI, it will display both LLM and embeddings under the provider. It will show the number of tokens each provider utilized, along with the latency, start time, input given to the API request, and the output generated from the LLM.
Figure 2 shows the trace details of the LLM. It includes latency, which is 1.53 seconds, the number of tokens, which is 2212, and information such as the system prompt, user prompt, and response.
Figure-3 shows the trace details of the Embeddings for the user query asked, along with other metrics similar to Figure-2. Instead of prompting, you see the input query converted into embeddings.
Figure 4 shows the trace details of the embeddings for the YouTube transcript data. Here, the data is converted into chunks and then into embeddings, which is why the utilized tokens amount to 5365. This trace detail denotes the transcript video data as the information.
To summarize, you have successfully built a Retrieval Augmented Generation (RAG) pipeline along with advanced concepts such as evaluation and observability. With this approach, you can further use this learning to automate and write scripts for alerts if something goes wrong, or use the requests to trace the logging details to get better insights into how the application is performing, and, of course, maintain the cost within the budget. Additionally, incorporating observability helps you optimize model usage and ensures efficient, cost-effective performance for your specific needs.
A. When it comes to observability, it is useful to track closed-source models like GPT, Gemini, Claude, and others. Phoenix supports direct integrations with Langchain, LLamaIndex, and the DSPY framework, as well as independent LLM providers such as OpenAI, Bedrock, and others.
A. BeyondLLM supports evaluating the Retrieval Augmented Generation (RAG) pipeline using the LLMs it supports. You can easily evaluate RAG on BeyondLLM with Ollama and HuggingFace models. The evaluation metrics include context relevancy, answer relevancy, groundedness, and ground truth.
A. OpenAI API cost is spent on the number of tokens you utilise. This is where observability can help you keep monitoring and trace of Tokens per request, Overall tokens, Costs per request, latency. This metrics really help to trigger a function to alert the cost to the user.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.