Fine-tuning enables large language models to better align with specific tasks, teach new facts, and incorporate new information. Fine-tuning significantly improves performance compared to prompting, typically surpassing larger models due to its speed and cost-effectiveness. It offers superior task alignment because it undergoes specific training for these tasks. Additionally, fine-tuning enables the model to be taught using advanced tools or complicated workflows. This article will explore how to fine-tune a large language model using the Mistral AI platform.
Learning Objectives
For dataset preparation, data must be stored in JSON Lines (.jsonl) files, which allow multiple JSON objects to be stored, each on a new line. Datasets should follow an instruction-following format that represents a user-assistant conversation. Each JSON data sample should either consist of only user and assistant messages (“Default Instruct”) or include function-calling logic (“Function-calling Instruct”).
Let us look at a few use cases for constructing a dataset.
Let’s say we want to extract medical information from notes. We can use the medical_knowledge_from_extracts dataset to get the desired output format, which is a JSON object with the following:
Conditions, and Interventions
Interventions can be categorized into behavioral, drug, and other interventions.
Here’s an example of output:
{
"conditions": "Proteinuria",
"interventions": [
"Drug: Losartan Potassium",
"Other: Comparator: Placebo (Losartan)",
"Drug: Comparator: amlodipine besylate",
"Other: Comparator: Placebo (amlodipine besylate)",
"Other: Placebo (Losartan)",
"Drug: Enalapril Maleate"
]
}
The following code demonstrates how to load this data, format it accordingly, and save it as a .jsonl file. Additionally, you can randomize the order and split the data into training and validation files for further processing.
import pandas as pd
import json
df = pd.read_csv(
"https://huggingface.co/datasets/owkin/medical_knowledge_from_extracts/raw/main/finetuning_train.csv"
)
df_formatted = [
{
"messages": [
{"role": "user", "content": row["Question"]},
{"role": "assistant", "content": row["Answer"]}
]
}
for index, row in df.iterrows()
]
with open("data.jsonl", "w") as f:
for line in df_formatted:
json.dump(line, f)
f.write("\n")
Also Read: Fine-Tuning Large Language Language Models
To generate SQL from the text, we can use the data containing SQL questions and the context of the SQL table to train the model to output the correct SQL syntax.
The formatted output will be like this:
The code below shows how to format the data for text-to-SQL generation:
import pandas as pd
import json
df = pd.read_json(
"https://huggingface.co/datasets/b-mc2/sql-create-context/resolve/main/sql_create_context_v4.json"
)
df_formatted = [
{
"messages": [
{
"role": "user",
"content": f"""
You are a powerful text-to-SQL model. Your job is to answer questions about a database.
You are given a question and context regarding one or more tables.
You must output the SQL query that answers the question.
### Input: {row["question"]}
### Context: {row["context"]}
### Response:
"""
},
{
"role": "assistant",
"content": row["answer"]
}
]
}
for index, row in df.iterrows()
]
with open("data.jsonl", "w") as f:
for line in df_formatted:
json.dump(line, f)
f.write("\n")
We can also fine-tune an LLM to improve its performance for RAG. RAG introduced Retrieval Augmented Fine-Tuning (RAFT). This method fine-tunes an LLM to answer questions based on relevant documents and ignore irrelevant documents, resulting in substantial improvement in RAG performance across all specialized domains.
To create a fine-tuning dataset for RAG, begin with the context, which is the document’s original text of interest. Using this context, generate questions and answers to form query-context-answer triplets. Below are two prompt templates for generating these questions and answers:
You can use the prompt template below to generate questions based on the context:
Context information is below.
---------------------
{context_str}
---------------------
Given the context information and not prior knowledge, generate {num_questions_per_chunk} questions based on the context. The questions should be diverse in nature across the document. Restrict the questions to the context of the information provided.
Prompt template to generate answers based on the context and the question from the previous prompt template:
Context information is below
--------------------- {context_str} ---------------------
Given the context information andnot prior knowledge, answer the query. Query: {generated_query_str}
Mistral AI’s function-calling capabilities are enhanced through fine-tuning function-calling data. However, in some cases, the native function calling features may not be sufficient, especially when working with specific tools and domains. In these instances, it is essential to fine-tune using your agent data for function calling. This approach can significantly improve the agent’s performance and accuracy, enabling it to select the appropriate tools and actions effectively.
Here is a simple example to train the model to call the generate_anagram() function as needed:
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant with access to the following functions to help the user. You can use the functions if needed."
},
{
"role": "user",
"content": "Can you help me generate an anagram of the word 'listen'?"
},
{
"role": "assistant",
"tool_calls": [
{
"id": "TX92Jm8Zi",
"type": "function",
"function": {
"name": "generate_anagram",
"arguments": "{\"word\": \"listen\"}"
}
}
]
},
{
"role": "tool",
"content": "{\"anagram\": \"silent\"}",
"tool_call_id": "TX92Jm8Zi"
},
{
"role": "assistant",
"content": "The anagram of the word 'listen' is 'silent'."
},
{
"role": "user",
"content": "That's amazing! Can you generate an anagram for the word 'race'?"
},
{
"role": "assistant",
"tool_calls": [
{
"id": "3XhQnxLsT",
"type": "function",
"function": {
"name": "generate_anagram",
"arguments": "{\"word\": \"race\"}"
}
}
]
}
],
"tools": [
{
"type": "function",
"function": {
"name": "generate_anagram",
"description": "Generate an anagram of a given word",
"parameters": {
"type": "object",
"properties": {
"word": {
"type": "string",
"description": "The word to generate an anagram of"
}
},
"required": ["word"]
}
}
}
]
}
Also Read: How Codestral 22B is Leading the Charge in AI Code Generation
You can validate the dataset format and also correct it by modifying the script as needed:
# Download the validation script
wget https://raw.githubusercontent.com/mistralai/mistral-finetune/main/utils/validate_data.py
# Download the reformat script
wget https://raw.githubusercontent.com/mistralai/mistral-finetune/main/utils/reformat_data.py
# Reformat data
python reformat_data.py data.jsonl
# Validate data
python validate_data.py data.jsonl
Once you have the data file with the right format, you can upload the data file to the Mistral Client, making them available for use in fine-tuning jobs.
import os
from mistralai.client import MistralClient
api_key = os.environ.get("MISTRAL_API_KEY")
client = MistralClient(api_key=api_key)
with open("training_file.jsonl", "rb") as f:
training_data = client.files.create(file=("training_file.jsonl", f))
Please note that finetuning happens on the Mistral LLM hosted on the Mistral platform. So, each fine-tuning job costs $2 per 1M tokens for the Mistral 7B model with a minimum of $4.
Once we load the dataset, we can create a fine-tuning job
from mistralai.models.jobs import TrainingParameters
created_jobs = client.jobs.create(
model="open-mistral-7b",
training_files=[training_data.id],
validation_files=[validation_data.id],
hyperparameters=TrainingParameters(
training_steps=10,
learning_rate=0.0001,
)
)
created_jobs
Expected Output
The parameters are as follows:
For LoRA fine-tuning, the recommended learning rate is 1e-4 (default) or 1e-5.
Here, the learning rate specified is the peak rate rather than a flat rate. The learning rate warms up linearly and decays by cosine schedule. During the warmup phase, the learning rate increases linearly from a small initial value to a larger value over several training steps. Then, the learning rate decreases following a cosine function.
We can also include Weights and Biases to monitor and track the metrics
from mistralai.models.jobs import WandbIntegrationIn, TrainingParameters
import os
wandb_api_key = os.environ.get("WANDB_API_KEY")
created_jobs = client.jobs.create(
model="open-mistral-7b",
training_files=[training_data.id],
validation_files=[validation_data.id],
hyperparameters=TrainingParameters(
training_steps=10,
learning_rate=0.0001,
),
integrations=[
WandbIntegrationIn(
project="test_api",
run_name="test",
api_key=wandb_api_key,
).dict()
]
)
created_jobs
You can also use dry_run=True argument to know the number of token the model is being trained on.
Then, we can list jobs, retrieve a job, or cancel a job.
# List jobs
jobs = client.jobs.list()
print(jobs)
# Retrieve a job
retrieved_jobs = client.jobs.retrieve(created_jobs.id)
print(retrieved_jobs)
# Cancel a job
canceled_jobs = client.jobs.cancel(created_jobs.id)
print(canceled_jobs)
When completing a fine-tuned job, you can get the fine-tuned model name with retrieved_jobs.fine_tuned_model.
from mistralai.models.chat_completion import ChatMessage
chat_response = client.chat(
model=retrieved_job.fine_tuned_model,
messages=[
ChatMessage(role='user', content='What is the best French cheese?')
]
)
We can also use open-source libraries from Mistral AI to fine-tune and perform inference on Large Language Models (LLMs) completely locally. Utilize the following repositories for these tasks:
Fine-Tuning: https://github.com/mistralai/mistral-finetune
Inference: https://github.com/mistralai/mistral-inference
In conclusion, fine-tuning large language models on the Mistral platform enhances their performance for specific tasks, integrates new information, and manages complex workflows. You can achieve superior task alignment and efficiency by preparing datasets correctly and using Mistral’s tools. Whether dealing with medical data, generating SQL queries, or improving retrieval-augmented generation systems, fine-tuning is essential for maximizing your models’ potential. The Mistral platform provides the flexibility and power to achieve your AI development goals.
Key Takeaways
A. Fine-tuning large language models significantly improves their alignment with specific tasks, making them better. It also allows the models to incorporate new facts and handle complex workflows more effectively than traditional prompting methods.
A. Datasets must be stored in JSON Lines (.jsonl) format, with each line containing a JSON object. The data should follow an instruction-following format that represents user-assistant conversations. The “role” must be “user,” “assistant,” “system,” or “tool.”
A. The Mistral platform offers tools for uploading and preparing datasets, configuring fine-tuning jobs with specific models and hyperparameters, and monitoring training with integrations like Weights and Biases. It also supports performing inference using fine-tuned models, providing a comprehensive environment for AI development.