In the rapidly evolving field of Natural Language Processing (NLP), one of the most intriguing challenges is converting natural language queries into SQL statements, known as Text2SQL. The ability to transform a simple English question into a complex SQL query opens up numerous possibilities in database management and data analysis. This is where TinyLlama, a variant of the large language model Llama, comes into play. In this guide, we will explore how to fine-tune TinyLlama to generate SQL statements from natural language queries.
This article was published as a part of the Data Science Blogathon.
TinyLlama is a variant of the larger Llama model, tailored for tasks like text generation and question answering. By fine-tuning it on specific datasets, it can be adapted for specialized tasks like generating SQL queries from natural language.
The first step involves preparing the Python environment by installing the necessary libraries. To do this, we will follow the following steps:
Installation of Libraries
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python
!pip3 install huggingface-hub
!pip3 install accelerate peft bitsandbytes transformers trl
These libraries are necessary for training and fine-tuning large language models, which are powerful AI models that can be used for a variety of tasks, such as text generation, question answering, and summarization.
Now we are ready. The next phase is downloading the TinyLlama model and initializing it for use.
from huggingface_hub import hf_hub_download
model_name = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
# Define the name of the model file to download.
model_file = "tinyllama-1.1b-chat-v1.0.Q8_0.gguf"
# Download the model from the Hugging Face Hub and store the
# path to the downloaded file in the `model_path` variable.
model_path = hf_hub_download(model_name, filename=model_file)
# Print a message indicating that the model has been downloaded.
print(f"Model downloaded to: {model_path}")
After running this code, it will download the 8-bit quantized GGUF model of the TinyLlama 1.1B from the HuggingFace hub. Then the path to the model is stored in the model_path variable. Printing it will show the following result
from llama_cpp import Llama
# Initialize a `Llama` object with the downloaded model path.
llm = Llama(
model_path=model_path,
# Set the number of context tokens.
n_ctx=512,
# Set the number of threads to use.
n_threads=8,
# Set the number of GPU layers to work with.
n_gpu_layers=40
)
# Print a message indicating that the Llama object has been initialized.
print("Llama object initialized successfully.")
Running the code will initiate the Llama object from the downloaded model, and thus prints a message stating that the model is initiated. With this initialized model, we will be able to pass in Prompts and perform tasks like text generation, classification, and summarization. Let’s test the model by passing in some example Prompts
# Use the Llama object to generate an answer to the question.
output = llm(
# Prompt
"<|im_start|>user\nAre you a robot?<|im_end|>\n<|im_start|>assistant\n",
# Set the maximum number of tokens to generate.
max_tokens=512,
# Set the stop sequences to indicate the end of the generated text.
stop=["</s>"],
)
# Print the generated text.
print(output['choices'][0]['text'])
Here we pass in the Input Prompt to the llm in the format that the TinyLlama understands. We even set the max_tokens to 512 and even provided the stop sequence, so the model knows when to stop generating text. Running this has produced the following output
As we will be fine-tuning the model to generate SQL statements, why not test the model without fine-tuning itself? Before that, let’s define a function that will take our data and outputs in a format that the Large Language Model can understand. We will work with the below function
def chat_template(question, context):
"""
Creates a chat template for the Llama model.
Args:
question: The question to be answered.
context: The context information to be used for generating the answer.
Returns:
A string containing the chat template.
"""
template = f"""\
<|im_start|>user
Given the context, generate an SQL query for the following question
context:{context}
question:{question}
<|im_end|>
<|im_start|>assistant
"""
# Remove any leading whitespace characters from each line in the template.
template = "\n".join([line.lstrip() for line in template.splitlines()])
return template
question = "How many heads of the departments are older than 56 ?"
context = "CREATE TABLE head (age INTEGER)"
print(chat_template(question,context))
The output generated by the function can be seen in the below pic
So this Template that we are generating will instruct the model to create an SQL query for a given question based on the provided context. Now let’s input this to the model and check the output generated
# Use the Llama object to generate an answer to the question.
output = llm(
chat_template(question, context),
# Set the maximum number of tokens to generate.
max_tokens=512,
# Set the stop sequences to indicate the end of the generated text.
stop=["</s>"],
)
# Print the generated text.
print(output['choices'][0]['text'])
So, when we run the code, the code first creates a Chat Template that includes the question and context information. The Chat Template is then passed to the llm object, which generates an answer based on the template. The answer is stored in the output variable. Finally, the generated answer is printed. The TinyLlama has printed the following response
Here the model did produce the correct answer at the end of the generation, but it has produced a lot of gibberish characters. Most of the text is unnecessary and not meaningful. This can be rectified through training the TinyLlama on an SQL dataset.
Fine-tuning requires a specialized dataset that pairs natural language questions with SQL queries. We have such a dataset in the HuggingFace Hub itself. Click Here to view the dataset. The Dataset is Open Source and, hence can be worked with for commercial purposes too. Let’s download the dataset from HuggingFace
from datasets import load_dataset, Dataset
# Define the dataset for fine-tuning
dataset_id = "b-mc2/sql-create-context"
data = load_dataset(dataset_id, split="train")
df = data.to_pandas()
Now, the TinyLlama cannot understand this, because it needs its input to be in a specific format. Hence we will take these 3 columns and create a single column that combines these columns in a format that the TinyLlama can understand. Before that, we will need to define some helper functions.
That is we need to define a Chat Template that takes care of this formatting, that is taking in these columns and generating a formatted text that can be understood by the TinyLlama model. The function will look like the below
def chat_template_for_training(context, answer, question):
"""
Creates a chat template for training the TinyLlama model.
Args:
question: The question to be answered.
context: The context information to be used for generating the answer.'
answer: The answer to be generated by the LLM
Returns:
A string containing the chat template.
"""
template = f"""\
<|im_start|>user
Given the context, generate an SQL query for the following question
context:{context}
question:{question}
<|im_end|>
<|im_start|>assistant
{answer}
<|im_end|>
"""
# Remove any leading whitespace characters from each line in the template.
template = "\n".join([line.lstrip() for line in template.splitlines()])
return template
This function is similar to the one defined earlier. The only difference is that we are evening adding the assistant’s answer here. So the TinyLlama will know what to generate if it receives this kind of input. Now we will create a new column that contains the data in this format
# Apply the chat_template_for_training function to each row in the
# dataframe and store the result in a new "text" column.
df["text"] = df.apply(lambda x: chat_template_for_training(x["context"],
x["answer"], x["question"]), axis=1)
# Convert the dataframe back to a Dataset object.
formatted_data = Dataset.from_pandas(df)
Let’s try printing one of the rows from the text column in the dataset and observe how the data from the 3 columns combined
print(df['text'][1])
We can see that our final dataset will look like the pic above. This data in the text column will be sent to the model for training
Training the model as is will be difficult. With the available free resources in GPU, it will be really difficult to train the TinyLlama directly. Hence before starting the fine-tuning, we will convert the model into a 4-bit quantized format. This will let us fit the model in the free GPU in colab and will allow us to train it on the SQL data.
from transformers import AutoTokenizer
# Define the model to fine-tune
model_id = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# Load the tokenizer for the specified model.
tokenizer = AutoTokenizer.from_pretrained(model_id)
# Set the padding token to be the same as the end of sentence token.
tokenizer.pad_token = tokenizer.eos_token
Next, we will define our quantization configuration and load the model in that specified quantized format. For that, we will work with the bitsandbytes library.
from transformers import BitsAndBytesConfig, AutoModelForCausalLM
# Define the quantization configuration for memory-efficient training.
bnb_config = BitsAndBytesConfig(
# Load the model weights in 4-bit quantized format.
load_in_4bit=True,
# Specify the quantization type to use for 4-bit quantization.
bnb_4bit_quant_type="nf4",
# Specify the data type to use for computations during training.
bnb_4bit_compute_dtype="float16",
# Specify whether to use double quantization for 4-bit quantization.
bnb_4bit_use_double_quant=True
)
# Load the model from the specified model ID and apply the quantization configuration.
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto"
)
The bnb_config variable defines the quantization configuration for the BitsAndBytesConfig class. This configuration specifies how the model will be quantized for memory-efficient training.
Finally, we work with the AutoModelForCausalLM.from_pretrained function to download the model from the HuggingFace, along with the model name, we even give it the quantization configuration that we have just defined. The device_map=”auto” will get the device to GPU, if a GPU is being utilized
Finally, our model will be downloaded according to the quantization configuration. Along with this, we even specify some other things like the below
# Disable cache to improve training speed.
model.config.use_cache = False
# Set the temperature for pretraining to 1.
model.config.pretraining_tp = 1
With this, we are done with loading the model part
Fine-tuning adapts the pre-trained model to the specific task of generating SQL queries. Here the fine-tuning method we will be applying is one of the Peft(Parameter Efficient Fine-Tuning) techniques called the QLoRA(Quantized Low Rank Adaption). Click here to learn more about it. With this, we only train a smaller matrix of data which will be later combined with the actual model to generate the final output.
To fine-tune it with QLoRA, we first need to define the LoRA configuration. The below code helps in doing the same
from peft import LoraConfig
# Define the PEFT configuration.
peft_config = LoraConfig(
# Set the rank of the LoRA projection matrix.
r=8,
# Set the alpha parameter for the LoRA projection matrix.
lora_alpha=16,
# Set the dropout rate for the LoRA projection matrix.
lora_dropout=0.05,
# Set the bias term to "none".
bias="none",
# Set the task type to "CAUSAL_LM".
task_type="CAUSAL_LM"
)
Now, we need to set our Training Arguments. For this, we define a TrainingArguments class and pass it the following parameters
from transformers import TrainingArguments
# Define the training arguments.
training_args = TrainingArguments(
# Set the output directory for the training run.
output_dir="tinyllama-sqllm-v1",
# Set the per-device training batch size.
per_device_train_batch_size=6,
# Set the number of gradient accumulation steps.
gradient_accumulation_steps=2,
# Set the optimizer to use.
optim="paged_adamw_32bit",
# Set the learning rate.
learning_rate=2e-4,
# Set the learning rate scheduler type.
lr_scheduler_type="cosine",
# Set the save strategy.
save_strategy="epoch",
# Set the logging steps.
logging_steps=10,
# Set the number of training epochs.
num_train_epochs=2,
# Set the maximum number of training steps.
max_steps=500,
# Enable fp16 training.
fp16=True,
)
We are done with setting up different training arguments. Now we will be creating the trainer which will train our model on our dataset. The code for this is
from trl import SFTTrainer
# Initialize the SFTTrainer.
trainer = SFTTrainer(
# Set the model to be trained.
model=model,
# Set the training dataset.
train_dataset=formatted_data,
# Set the PEFT configuration.
peft_config=peft_config,
# Set the name of the text field in the dataset.
dataset_text_field="text",
# Set the training arguments.
args=training_args,
# Set the tokenizer.
tokenizer=tokenizer,
# Disable packing.
packing=False,
# Set the maximum sequence length.
max_seq_length=1024
)
trainer.train()
Finally the trainer.train() will start training for 500 steps. This should take around 8 to 9 minutes in the T4 GPU provided by the free colab. After the training, a file will be available that contains our trained PEFT model
What we have trained earlier is a PEFT model, that is a small number of parameters. These parameters themselves cannot be worked with to generate text. We need to combine this peft model with the actual model to start inferencing the new model
from peft import AutoPeftModelForCausalLM, PeftModel
# Load the pre-trained model.
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
load_in_8bit=False,
device_map="auto",
trust_remote_code=True
)
# Load the PEFT model from a checkpoint.
model_path = "/content/tinyllama-sqllm-v1/checkpoint-500"
peft_model = PeftModel.from_pretrained(model, model_path, from_transformers=True, device_map="auto")
# Wrap the model with the PEFT model.
model = peft_model.merge_and_unload()
PeftModel.from_pretrained function loads the PeftModel object from the pre-trained model checkpoint
Overall, the code snippet demonstrates how to use the peft library to load a pre-trained model and merge it with an existing PeftModel object. Now we can infer this model, which has been merged with the PeftModel that has been trained on the SQL data
Post training, TinyLlama should be able to convert natural language questions into SQL queries. Let’s test this with an example
# Prepare the Prompt.
question = "How many heads of the departments are older than 56 ?"
context = "CREATE TABLE head (age INTEGER)"
prompt = chat_template(question,context)
# Encode the prompt.
inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
# Generate the output.
output = model.generate(**inputs, max_new_tokens=512)
# Decode the output.
text = tokenizer.decode(output[0], skip_special_tokens=True)
# Print the generated SQL query.
print(text)
The output generated from running the model can be seen below
We can see that the model has followed our Prompt. Initially before training, the model had generated some unwanted words. But now after fine-tuning for just 500 epochs, we were able to make the model generate model clear and concise answers. This way we can train the model for a higher number of epochs to make it more robust and then the fine-tuned TinyLlama will be able to answer complex tasks
Fine-tuning TinyLlama for the Text2SQL task is a significant step towards making data querying more intuitive and accessible. By transforming natural language into SQL queries, it bridges the gap between complex database languages and user-friendly interfaces. This fine-tuning process illustrates the model’s adaptability and the potential of AI in enhancing data-driven decision-making.
The key takeaways from this guide include
A. You need to install specific libraries such as llama-cpp-python, huggingface-hub, accelerate, peft, bitsandbytes, and transformers. This can be done using pip commands provided in the article.
A. The TinyLlama model can be downloaded from the Hugging Face Hub using the `hf_hub_download` function. Initialization involves creating a Llama object with the downloaded model path and setting parameters like context tokens and GPU layers.
A. A specialized dataset that pairs natural language questions with corresponding SQL queries is needed for fine-tuning. Such datasets are available on the HuggingFace Hub.
A. The dataset needs to be converted into a format understandable by TinyLlama, which typically involves creating a chat template that combines context, question, and SQL answers into a single formatted string.
A. Fine-tuning involves using a specialized dataset to adapt TinyLlama’s model to convert natural language into SQL queries. This may include using techniques like QLoRA and setting up configurations for efficient training.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.