The release of OpenAI’s ChatGPT has inspired a lot of interest in large language models (LLMs), and everyone is now talking about artificial intelligence. But it’s not just friendly conversations; the machine learning (ML) community has introduced a new term called LLMOps. We have all heard of MLOps, but what is LLMOps? Well, it’s all about how we treat and manage these powerful language models throughout their lifecycle.
LLMs are converting the way we create and maintain AI-driven products, and this shift is leading to the need for new tools and best practices. In this article, we’ll melt down LLMOps and its background. We’ll also examine how building AI products with LLMs differs from traditional ML models. Plus, we’ll look at how MLOps (Machine Learning Operations) differs from LLMOps due to these differences. Finally, we’ll discuss what exciting developments we can expect in the world of LLMOps space shortly.
Learning Objectives:
This article was published as a part of the Data Science Blogathon.
LLMOps stands for Large Language Model Operations, similar to MLOps but specifically designed for Large Language Models (LLMs). It requires using new tools and best practices to handle everything related to LLM-powered applications, from development to deployment and continuing maintenance.
To understand this better, let’s break down what LLMs and MLOps mean:
Now that we’ve explained the basics, let’s dive into this topic more deeply.
Firstly, LLMs like BERT and GPT-2 have been around since 2018. Yet, it is now, almost five years later, that we are encountering a flashing rise of the idea of LLMOps. The main reason is that LLMs obtained much media attention with the release of ChatGPT in December 2022.
Since then, we have seen many different types of applications exploiting the power of LLMs. This includes chatbots ranging from familiar examples like ChatGPT, to more personal writing assistants for editing or summarization (e.g., Notion AI) and skilled ones for copywriting (e.g., Jasper and copy.ai). It also includes programming assistants for writing and debugging code (e.g., GitHub Copilot), testing the code (e.g., Codium AI), and identifying security trouble (e.g., Socket AI).
With many people developing and carrying LLM-powered applications to production, people are contributing their experiences.
“It’s easy to make something cool with LLMs, but very hard to make something production-ready with them.” - Chip Huyen
It is clear that building production-ready LLM-powered applications comes with its own set of difficulties, distinct from building AI products with classical ML models. We must develop new tools and best practices to deal with these challenges to govern the LLM application lifecycle. Thus, we see an expanded use of the term “LLMOps.”
The steps involved in LLMOps are at least similar to MLOps. However, the steps of building an LLM-powered application are different due to the beginning of the foundation models. Instead of training LLMs from scratch, the focus lies on domesticating pre-trained LLMs to the following tasks.
Already over a year ago, Andrej Karpathy told how the process of building AI products will change in the future:
“But the most important trend is that the whole setting of training a neural network from scratch on some target task is quickly becoming outdated due to finetuning, especially with the emergence of base models like GPT. These base models are trained by only a few institutions with substantial computing resources, and most applications are achieved via lightweight finetuning of part of the network, prompt engineering, or an optional step of data or model processing into smaller, special-purpose inference networks.” - Andrej Karpathy.
This quote may be stunning the first time you read it. But it exactly summarizes everything that has been going on lately, so let’s describe it step by step in the following subsections.
Foundation models or base models are LLMs pre-trained on large amounts of data that can be used for a wide range of tasks. Because training a base model from scratch is difficult, time-consuming, and extremely expensive, only a few institutions have the required training resources.
To put it into perspective, according to a study from Lambda Labs in 2020, training OpenAI’s GPT-3 (with 175 billion parameters) would require 355 years and $4.6 million using a Tesla V100 cloud instance.
AI is currently going through what the community calls its “Linux Moment.” Currently, developers have to choose between two types of base models based on an exchange between performance, cost, ease of use, and flexibility of proprietary models or open-source models.
Exclusive or proprietary models are closed-source foundation models possessed by companies with large expert teams and big AI budgets. They usually are larger than open-source models and have better performance. They are also bought and generally rather easy to use. The main downside of proprietary models is their expensive APIs (application programming interfaces). Additionally, closed-source foundation models offer less or no elasticity for adaptation for developers.
Examples of proprietary model providers are:
Open-source models are frequently organized and hosted on HuggingFace as a community hub. Usually, they are smaller models with lower capabilities than proprietary models. But on the upside, they are more economical than proprietary models and offer more flexibility for developers.
Examples of open-source models are:
Code:
This step involves importing all required libraries.
from transformers import GPT2LMHeadModel, GPT2Tokenizer
# Can you load pre-trained GPT-3 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
Output of the above code:
Once you have chosen your base model, you can access the LLM through its API. If you usually work with other APIs, working with LLM APIs will primarily feel a little weird because it is not always clear what input will cause what output earlier. Given any text prompt, the API will return a text completion, attempting to match your pattern.
Here is an example of how you would use the OpenAI API. You give the API input as a prompt, e.g., prompt = “Correct this to standard English:\n\nHe no went to the market.”
import openai
openai.api_key = ...
response = openai.Completion.create(
engine = "text-davinci-003",
prompt = "Correct this to standard English:\n\nHe no went to the market.",
# ...
)
The API will output a reply containing the completion response[‘choices’][0][‘text’] = “He did not go to the market.”
The main challenge is that LLMs aren’t mighty despite being powerful, and thus, the key question is: How do you get an LLM to give the output you want?
One concern respondents mentioned in the LLM in-production survey was model accuracy and hallucination. That means getting the output from the LLM API in your desired format might take some iterations, and also, LLMs can hallucinate if they don’t have the required specific knowledge. To deal with these concerns, you can adapt the base models to the following tasks in the following ways:
Code:
from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, Trainer, TrainingArguments
# Load your dataset
dataset = TextDataset(tokenizer=tokenizer, file_path="your_dataset.txt")
# Fine-tune the model
training_args = TrainingArguments(
output_dir="./your_fine_tuned_model",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=4,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()
trainer.save_model()
In classical MLOps, ML models are demonstrated on a hold-out validation set with a metric that denotes the models’ performance. But how do you evaluate the execution of an LLM? How do you decide whether an output is good or bad? Currently, it seems like organizations are A/B testing their models.
To help evaluate LLMs, tools like HoneyHive or HumanLoop have emerged.
Code:
from transformers import pipeline
# Create a text generation pipeline
generator = pipeline("text-generation", model="your_fine_tuned_model")
# Generate text and evaluate
generated_text = generator("Prompt text")
print(generated_text)
The achievement of LLMs can extremely change between releases. For example, OpenAI has updated its models to relieve inappropriate content generation, e.g., hate speech. As a result, scanning for the phrase “as an AI language model” on Twitter now reveals countless bots.
There are already tools for monitoring LLMs appearing, such as Whylabs or HumanLoop.
Code:
# Import your necessary libraries
from flask import Flask, request, jsonify
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import logging
# Initialize Flask app
app = Flask(__name__)
# you can load the fine-tuned GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("./your_fine_tuned_model")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Set up logging
logging.basicConfig(filename='app.log', level=logging.INFO)
# Define a route for text generation
@app.route('/generate_text', methods=['POST'])
def generate_text():
try:
data = request.get_json()
prompt = data['prompt']
# Generate text
generated_text = model.generate(
tokenizer.encode(prompt, return_tensors='pt'),
max_length=100, # Adjust max length as needed
num_return_sequences=1,
no_repeat_ngram_size=2,
top_k=50,
top_p=0.95,
)[0]
generated_text = tokenizer.decode(generated_text, skip_special_tokens=True)
# Log the request and response
logging.info(f"Generated text for prompt: {prompt}")
logging.info(f"Generated text: {generated_text}")
return jsonify({'generated_text': generated_text})
except Exception as e:
# Log any exceptions
logging.error(f"Error: {str(e)}")
return jsonify({'error': 'An error occurred'}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Input prompt:
#{
"prompt": "Once upon a time"
}
Output prompt:
{
"generated_text": "Once upon a time, in a faraway land, there lived a..."
}import csv
The differences between MLOps and LLMOps arise from the differences in how we build AI products with classical ML models versus LLMs. The differences mostly affect data management, experimentation, evaluation, cost, and latency.
In standard MLOps, we are used to data-hungry ML models. Training a neural network from scratch needs a lot of labeled data, and even fine-tuning a pre-trained model involves at least a few hundred samples. However, data cleaning is essential to the ML development process, as we know and accept that large datasets have defects.
In LLMOps, fine-tuning is similar to MLOps. But prompt engineering is a zero-shot or few-shot learning circumstance. That means we have few but hand-picked samples.
In MLOps, the investigation looks similar to whether you train a model from scratch or fine-tune a pre-trained one. In both cases, you will route inputs, such as model architecture, hyperparameters, and data augmentations, and outputs, such as metrics.
But in the LLMOps, the question is whether to engineer prompts or to fine-tune. However, fine-tuning will look similar to MLOps in LLMOps, while prompt engineering involves a different experimentation setup involving the management of prompts.
In classical MLOps, a hold-out validation set with an evaluation metric evaluates a model’s performance. Because the performance of LLMs is more difficult to evaluate, currently, organizations seem to be using A/B testing.
While the cost of traditional MLOps usually lies in data collection and model training, the cost of LLMOps lies in inference. Although we can expect some costs from using expensive APIs during experimentation, Chip Huyen shows that the cost of long prompts is in inference.
Another concern respondents mentioned in the LLM in the production survey was latency. The completion length of an LLM significantly affects latency. Although latency concerns exist in MLOps as well, they are much more prominent in LLMOps because this is a big issue for the experimentation velocity during development and the user experience in production.
LLMOps is an upcoming field. With the speed at which this space is evolving, making any predictions is difficult. It is even doubtful if the term “LLMOps” is here to stay. We are only sure that we will see a lot of new use cases of LLMs and tools and the best trials to manage the LLM lifecycle.
The field of AI is rapidly growing, potentially making anything we write now outdated in a month. We’re still in the early stages of transporting LLM-powered applications to production. There are many questions we don’t have the answers to, and only time will tell how things will play out:
We can say with certainty that we will see many developments and new toolings and best practices soon. Also, we are already looking at efforts being made toward cost and latency reduction for base models. These are definitely interesting times!
Since the release of OpenAI’s ChatGPT, LLMs have become a hot topic in the field of AI. These deep learning models can generate outputs in human language, making them a strong tool for tasks such as conversational AI, programming assistants, and writing assistants.
However, carrying LLM-powered applications to production presents its own set of challenges, which has led to the arrival of a new term, “LLMOps”. It refers to the set of tools and best practices used to manage the lifecycle of LLM-powered applications, including development, deployment, and maintenance.
LLMOps can be seen as a subcategory of MLOps. However, the steps involved in building an LLM-powered application are different from those in building applications with base ML models. Instead of training an LLM from scratch, the focus is on adapting pre-trained LLMs to the following tasks. This involves selecting a foundation model, using LLMs in the following tasks, evaluating them, and deploying and monitoring the model. While LLMOps is still a relatively new field, it is sure to continue to develop and evolve as LLMs become more popular in the AI industry.
Key Takeaways:
Overall, the rise of LLMs and LLMOps describes a significant shift in building and maintaining AI-powered products. I hope you liked this article. You can connect with me here on LinkedIn.
Ans. Large language models (LLMs) are recent improvements in deep learning models to work on human languages. A large language model is a trained deep-learning model that understands and generates text in a human-like fashion. Behind the scenes, a large transformer model does all the magic.
Ans. The key steps followed in LLMOps are:
1. Select a pre-trained Large Language Model as the base for your application.
2. Modify the LLM for particular tasks using techniques like prompt engineering and fine-tuning.
3. Frequently estimate the LLM’s performance through A/B testing and tools like HoneyHive.
4. Deploy the LLM-powered application, continuously monitor its performance, and streamline it.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.