The rise of Retrieval-Augmented Generation (RAG) and Knowledge Graphs has revolutionized how we interact with complex data sets by providing a structured, interconnected representation of information. Knowledge Graphs, such as those used in Neo4j, facilitate the querying and visualization of relationships within data. However, translating natural language into structured query languages like Cypher remains a challenging task. This guide aims to bridge this gap by detailing the fine-tuning of the Phi-3 Medium model to generate Cypher queries from natural language inputs. By leveraging the compact yet powerful capabilities of the Phi-3 Medium model, even small-scale developers can efficiently convert text to Cypher queries, enhancing the accessibility and usability of Knowledge Graphs.
This article was published as a part of the Data Science Blogathon.
The Phi family of Large Language Models is introduced by Microsoft to represent that even small language models can perform better and may be on par with the bigger models. Microsoft has trained this small family of models with different types of datasets, thus making these models good at different tasks including entity extraction, summarization, chatbots, roleplay, and more.
Microsoft has released these models keeping in mind that their small size can help even small developers work with them, and train them on their very own datasets, thus bringing up many different applications. Recently, Microsoft has announced the third generation of the phi family called the Phi 3 series of Large Language Models.
In the Phi 3 series, the context length was bought from 4k tokens to now 128k tokens, thus allowing more context to fit in. The Phi 3 family of models comes with different sizes starting from the smallest 3.8 billion parameter model called the Phi 3 Mini, followed by the Phi 3 Small which is a 7B parameter model, and finally the Phi 3 Medium which is a 14 billion parameter model, the one we will train in this Guide. All of these models have a long context version extending the context length to 128k tokens.
Developed by Daniel and Michael Han, Unsloth emerged to be one the best Optimized Frameworks designed to improve the fine-tuning process for large language models (LLMs). Known for its blazing speed and memory efficiency, Unsloth can increase training speeds by up to 30 times while reducing memory usage by an impressive 60%. All these capabilities make it the right framework for developers aiming to fine-tune LLMs with accuracy and speed.
Unsloth supports different types of Hardware Configs, from NVIDIA GPUs like the Tesla T4 and H100 to AMD and Intel GPUs. It even employs complex methodologies like intelligent weight upcasting, which minimizes the need for weight upscaling during QLoRA, thereby optimizing memory use.
As an open-source tool under the Apache 2.0 license, Unsloth integrates seamlessly into the fine-tuning of prominent LLMs like Mistral 7B, Llama, and Gemma, achieving up to a 5x increase in fine-tuning speed while simultaneously reducing memory usage by 60%. Furthermore, it is compatible with alternative fine-tuning methods like Flash-Attention 2, which not only speeds up inference but even the fine-tuning process.
We will first create our environment. For this we will download Unsloth for Google Colab.
!pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
Then we will create some default Unsloth values for training. These are:
from unsloth import FastLanguageModel
import torch
sequence_length_maximum = 2048
weights_data_type = None
quantize_to_4bit = True
We start by importing the FastLanguageModel class from the Unsloth library. Then we define some variables to be worked with throughout the guide:
Here, we will start downloading the Phi 3 Medium Model. We will do this with the Unsloth’s FastLanguageModel class.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Phi-3-medium-4k-instruct",
max_seq_length = sequence_length_maximum,
dtype = weights_data_type,
load_in_4bit = quantize_to_4bit,
token = "YOUR_HF_TOKEN"
)
When we run the code, the output generated can be seen in the pic above. Both the Phi 3 Medium model and its tokenizer will be downloaded to the Colab environment by fetching it from the HuggingFace Repository.
We cannot finetune the whole Phi 3 Medium model. So we just train a few weights of the Phi 3 Model. For this, we work with LoRA (Low-Rank Adaptation), which works by training only a subset of parameters. So for this, we need to create a LoRA config and get the Parameter Efficient Finetuned Model (peft model) from this LoRA config. The code for this will be:
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "down_proj", "v_proj", "o_proj",
"up_proj", "gate_proj"],
lora_alpha = 16,
bias = "none",
lora_dropout = 0,
random_state = 3407,
use_gradient_checkpointing = "True",
)
After running this code, the LoRA Adapters for the Phi 3 Medium will be created. Now we can work with this peft model and finetune it with a dataset of our choice.
Here, we will be training the Phi 3 Medium Large Language Model with a dataset that will allow the model to generate Cypher Queries which are necessary for querying the Knowledge Graph Databases like the neo4j. So for this, we will download the dataset provided from a GitHub Repository. The command for this will be:
!wget https://raw.githubusercontent.com/neo4j-labs/text2cypher\
/main/datasets/synthetic_gpt4turbo_demodbs/text2cypher_gpt4turbo.csv
The above command will download a CSV file. This CSV file contains the dataset that we will be working with to train the Phi 3 Medium LLM. Before that, we need to do some preprocessing. We are only taking a certain part i.e. a subset of the dataset. The code for this will be:
import pandas as pd
df = pd.read_csv('/content/text2cypher_gpt4turbo.csv')
df = df[(df['database'] == 'recommendations') &
(df['syntax_error'] == False) & (df['timeout'] == False)]
df = df[['question','cypher']]
df.rename(columns={'question': 'input','cypher':'output'}, inplace=True)
df.reset_index(drop=True, inplace=True)
Here, we filter the data. We need the data coming from the recommendations database. We need only those columns which do not have any syntax error and where there is no timeout. This is necessary because we need the Phi 3 to give us a syntax error-free Cypher Queries when asked.
The dataset contains many columns, but only the question and the cypher column are the ones we need. And we even renamed these columns to input and output, where the question column is the input and the cypher column is the output that needs to be generated by the Large Language Model.
In the output pic, we can see the first 5 rows of the dataset. It contains only two columns, input and output. The database we are working with, for the training data, has a schema to it.
graph_schema = """
Node properties:
- **Movie**
- `url`: STRING Example: "https://themoviedb.org/movie/862"
- `runtime`: INTEGER Min:1, Max: 915
- `revenue`: INTEGER Min: 1, Max: 2787965087
- `budget`: INTEGER Min: 1, Max: 380000000
- `imdbRating`: FLOAT Min: 1.6, Max: 9.6
- `released`: STRING Example: "1995-11-22"
- `countries`: LIST Min Size: 1, Max Size: 16
- `languages`: LIST Min Size: 1, Max Size: 19
- `imdbVotes`: INTEGER Min: 13, Max: 1626900
- `imdbId`: STRING Example: "0114709"
- `year`: INTEGER Min: 1902, Max: 2016
- `poster`: STRING Example: "https://image.tmdb.org/t/p/w440_and_h660_face/uXDf"
- `movieId`: STRING Example: "1"
- `tmdbId`: STRING Example: "862"
- `title`: STRING Example: "Toy Story"
- **Genre**
- `name`: STRING Example: "Adventure"
- **User**
- `userId`: STRING Example: "1"
- `name`: STRING Example: "Omar Huffman"
- **Actor**
- `url`: STRING Example: "https://themoviedb.org/person/1271225"
- `bornIn`: STRING Example: "France"
- `bio`: STRING Example: "From Wikipedia, the free encyclopedia Lillian Di"
- `died`: DATE Example: "1954-01-01"
- `born`: DATE Example: "1877-02-04"
- `imdbId`: STRING Example: "2083046"
- `name`: STRING Example: "François Lallement"
- `poster`: STRING Example: "https://image.tmdb.org/t/p/w440_and_h660_face/6DCW"
- `tmdbId`: STRING Example: "1271225"
- **Director**
- `url`: STRING Example: "https://themoviedb.org/person/88953"
- `bornIn`: STRING Example: "Burchard, Nebraska, USA"
- `bio`: STRING Example: "Harold Lloyd has been called the cinema’s “first m"
- `died`: DATE Min: 1930-08-26, Max: 2976-09-29
- `born`: DATE Min: 1861-12-08, Max: 2018-05-01
- `imdbId`: STRING Example: "0516001"
- `name`: STRING Example: "Harold Lloyd"
- `poster`: STRING Example: "https://image.tmdb.org/t/p/w440_and_h660_face/er4Z"
- `tmdbId`: STRING Example: "88953"
- **Person**
- `url`: STRING Example: "https://themoviedb.org/person/1271225"
- `bornIn`: STRING Example: "France"
- `bio`: STRING Example: "From Wikipedia, the free encyclopedia Lillian Di"
- `died`: DATE Example: "1954-01-01"
- `born`: DATE Example: "1877-02-04"
- `imdbId`: STRING Example: "2083046"
- `name`: STRING Example: "François Lallement"
- `poster`: STRING Example: "https://image.tmdb.org/t/p/w440_and_h660_face/6DCW"
- `tmdbId`: STRING Example: "1271225"
Relationship properties:
- **RATED**
- `rating: FLOAT` Example: "2.0"
- `timestamp: INTEGER` Example: "1260759108"
- **ACTED_IN**
- `role: STRING` Example: "Officer of the Marines (uncredited)"
- **DIRECTED**
- `role: STRING`
The relationships:
(:Movie)-[:IN_GENRE]->(:Genre)
(:User)-[:RATED]->(:Movie)
(:Actor)-[:ACTED_IN]->(:Movie)
(:Actor)-[:DIRECTED]->(:Movie)
(:Director)-[:DIRECTED]->(:Movie)
(:Director)-[:ACTED_IN]->(:Movie)
(:Person)-[:ACTED_IN]->(:Movie)
(:Person)-[:DIRECTED]->(:Movie)
"""
The schema contains all the Node properties and the Relationships between the nodes that are presented in the recommendations graph database. Now, we will convert these to an instruction format, so the model will only output a cypher query only when it has been instructed to do so. The function for this will be.
prompt = """Given are the instruction below, having an input \
that provides further context.
### Instruction:
{}
### Input:
{}
### Response:
{}"""
token_eos = tokenizer.eos_token
def format_prompt(columns):
instructions = f"Use the below text to generate a cypher query. \
The schema is given below:\n{graph_schema}"
inps = columns["input"]
outs = columns["output"]
text_list = []
for input, output in zip(inps, outs):
text = prompt.format(instructions, input, output) + token_eos
text_list.append(text)
return { "text" : texts, }
This function above will be passed to our dataset to create the final column. The code for this will be:
from datasets import Dataset
dataset = Dataset.from_pandas(df)
dataset = dataset.map(format_prompt, batched = True)
Running the code will create a new column called “text”, which will contain the prompts that we have defined in the format_prompt() function. From the pic above, we can see that there are a total of 700+ rows of data in our dataset and there are three columns, that are text, input, and output. With this, we have our data ready for fine-tuning.
We are now ready to fine-tune the Phi 3 Medium on the Cypher Query dataset. In this section, we start by creating our Trainer and the corresponding Training Arguments that we need to train our model on this dataset that we have prepared. The code for this will be:
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer(
model = model,
tokenizer = tokenizer,
train_dataset = dataset,
dataset_text_field = "text",
max_seq_length = sequence_length_maximum,
dataset_num_proc = 2,
packing = False,
args = TrainingArguments(
per_device_train_batch_size = 2,
gradient_accumulation_steps = 4,
warmup_steps = 5,
num_train_epochs=1,
learning_rate = 2e-4,
fp16 = True,
logging_steps = 1,
optim = "adamw_8bit",
weight_decay = 0.02,
lr_scheduler_type = "linear",
output_dir = "outputs",
),
)
While training a Large Language Model or a Deep Learning model, we must set many different hyperparameters, which bring out the best-performing model. These include different parameters.
We are now done with defining our Trainer and the TrainingArguments for training our quantized Phi 3 Medium 14Billion Large Language Model. Running the trainer.train() will start the training.
trainer_stats = trainer.train()
Running the above will start the training process. In Google Colab, working with the free T4 GPU, it takes around 1 hour and 40 minutes to go through 1 epoch on the training data. It has taken around 95 epochs to complete one epoch. Finally, the training is completed.
We have now finished training the model. Now we will test the model to check how well it generates cypher queries given a text.
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
prompt.format(
f"Convert text to cypher query based on this schema: \n{graph_schema}",
"What are the top 5 movies with a runtime greater than 120 minutes"
"",
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 128)
print(tokenizer.decode(outputs[0], skip_special_tokens = True))
We can see the results of running this code in the above pic. We see that the Cypher Query generated by the model matches the ground truth, Cypher Query. Let us test with some more examples to see the performance of the fine-tuned Phi 3 Medium for Cypher Query generation.
inputs = tokenizer(
[
prompt.format(
f"Convert text to cypher query based on this schema: \n{graph_schema}",
"Which 3 directors have the longest bios in the database?"
"",
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 128)
print(tokenizer.decode(outputs[0], skip_special_tokens = True))
inputs = tokenizer(
[
prompt.format(
f"Convert text to cypher query based on this schema: \n{graph_schema}",
"List the genres that have movies with an imdbRating less than 4.0.",
"",
)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 128)
print(tokenizer.decode(outputs[0], skip_special_tokens = True))
We can see that in both the examples above, the fine-tuned Phi 3 Medium model has generated the correct Cypher Query for the provided question. In the first example, the Phi 3 Medium did provide the right answer but took slightly a different approach. With this, we can say that finetuning Phi 3 Medium on the Cypher Dataset has made its generation slightly more accurate while generating Cypher Queries given a text.
This guide has detailed the fine-tuning process of the Phi 3 Medium model for generating Cypher queries from natural language inputs, aimed at enhancing accessibility to Knowledge Graphs like Neo4j. Through leveraging tools like Unsloth for efficient model training and deploying techniques such as LoRA adapters to optimize parameter usage, developers can effectively translate complex data queries into structured Cypher commands.
A. Phi-3 Medium is a compact and powerful LLM, making it suitable for developers with limited resources. Fine-tuning allows it to specialize in Cypher query generation, improving accuracy and efficiency.
A. Unsloth is a framework specifically designed to optimize the fine-tuning process for large language models. It offers significant speed and memory usage improvements compared to traditional methods
A. The guide uses a dataset containing pairs of natural language questions and their corresponding Cypher queries. This dataset helps the model learn the relationship between text and the structured query language.
A. The guide outlines steps for setting up the training environment, downloading the pre-trained model, and preparing the dataset. It then details how to fine-tune the model using Unsloth and a specific training configuration.
A. Once trained, the model can be used to generate Cypher Query from Text. The guide provides an example of how to structure the input and decode the generated query.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.