With the rising advent of large language models and advancements in the field of AI we are witnessing new developments and opportunities in a way we work and interact with digital world around us. In this blog, we delve into the fundamentals of LlamaIndex, a groundbreaking technology that helps to build applications using LLMs.
This course covers setting up a project environment, connecting to a database, and developing NL2SQL applications using LlamaIndex modules. It provides a comprehensive roadmap for integration and explores real-world use cases of this cutting-edge technology in large language models and database management systems.
This article was published as a part of the Data Science Blogathon.
In this section, we will learn about components and concepts used in LlamaIndex library to build LLM powered applications for RAG and NL2SQL use-cases.
LlamaIndex has many different components that can be used in LLM applications. Large language models require to interact with user queries, vector stores, databases and external data sources.
LlamaIndex has mainly 3 stages for LLM based applications:
Using above stages in LLM based applications, we can develop RAG or NL2SQL based applications. Llamaindex functions and tools help to build such agentic applications. In next section, we will look at natural language to SQL application with step by step guide.
In this section, we will learn about architecture of natural language to SQL application to get an understanding of how large language models, database and user query interacts with each other to convert natural language query to SQL query resulting answer to the user query.
Above diagram shows how user query interacts with database and GPT model to generate output from database tables. let’s look at the step by step workflow of the application.
The process begins with the user uploading a CSV file, which contains data in tabular form. The user uploads the CSV file to a MySQL database, which then inserts the CSV data using “CREATE TABLE…” and “INSERT INTO…” functions.
MySQL database processes and stores the uploaded CSV data. This database will serve as the source of data for the user’s queries.
After uploading the CSV, the user has the option to upload example queries. These queries enhance natural language understanding and improve the robustness of prompt generation.
Stored examples from the user are later used to generate prompts for the NL2SQL query engine.
The application prompts the user to select a table from the uploaded CSV data. This table will be the target of the subsequent queries. user can select multiple tables too depending upon features of application.
The user formulates a query on the selected table using natural language.
The NL2SQL query engine takes the user’s query and information about the selected table to create a prompt. The system generates this prompt using a “create table” function with a few of the samples from the table.
The system submits the prompt to GPT models, which are natural language processing models capable of generating SQL queries based on the provided prompt. For more detailed prompt we can use few shot prompting to stir the GPT models to generate better SQL queries.
The GPT models generate an SQL query based on the prompt. The SQL query retrieves data from the selected table.
The application submits the generated SQL query to the MySQL database for execution. The database processes the query and returns the result set.
The SQL database executes the query and obtains the result set (DB Result).
The system submits the obtained DB result back to the GPT 3.5/4 models. These models generate a descriptive answer based on the results of the executed SQL query.
The workflow completes when the system returns the descriptive answer to the user. The user receives an understandable and informative response based on their original natural language query.
Now that we have learned workflow of the NL2SQL applications. let’s look at how to build such applications with hands on code examples.
In this section, we will set up the environment by installing important library for our project followed by connecting to A MySQL database for a development of a application.
!pip install pymysql
!pip install llama_index
!pip install --force-reinstall 'sqlalchemy<2.0.0'
!pip install llama-index-callbacks-aim
!pip install openai
!pip install sqlalchemy
After installing the required libraries, we connect the database to our environment using ‘pymysql’ and ‘sqlalchemy’.
Before we connect to database using sqlalchemy and pymysql, we will look at how to add tables to database for our NL2SQL application. we will be using demo examples for our blog purpose. one can use any tables available to them to add into database.
# import necessary library to add data into sql database
from sqlalchemy import (create_engine, MetaData, Table, Column, String, Integer)
import re
import pymysql
# db credentials
db_user = "your_username"
db_password = "your_password"
db_host = "your_host"
db_name = "your_name"
# Construct the connection string
connection_string = f"mysql+pymysql://{db_user}:{db_password}@{db_host}/{db_name}"
engine = create_engine(connection_string)
metadata_obj = MetaData()
# Function to create a sanitized column name
def sanitize_column_name(col_name):
# Remove special characters and replace spaces with underscores
return re.sub(r"\W+", "_", col_name)
# Function to create a table from a DataFrame using SQLAlchemy
def create_table_from_dataframe(
df: pd.DataFrame, table_name: str, engine, metadata_obj
):
# Sanitize column names
sanitized_columns = {col: sanitize_column_name(col) for col in df.columns}
df = df.rename(columns=sanitized_columns)
# Dynamically create columns based on DataFrame columns and data types
columns = [
Column(col, String if dtype == "object" else Integer)
for col, dtype in zip(df.columns, df.dtypes)
]
# Create a table with the defined columns
table = Table(table_name, metadata_obj, *columns)
# Create the table in the database
metadata_obj.create_all(engine)
# Insert data from DataFrame into the table
with engine.connect() as conn:
for _, row in df.iterrows():
insert_stmt = table.insert().values(**row.to_dict())
conn.execute(insert_stmt)
conn.commit()
# loop over your dataframes to add tables into sql database
for idx, df in enumerate(dfs):
create_table_from_dataframe(df, table_name, engine, metadata_obj)
Above code illustrates function to add dataframe into SQL databases such as MySQL, PostgreSQL, etc. The function uses a dataframe, table name, database engine, and metadata object to sanitize columns, create tables, and connect to the database engine to insert the table with relevant column names and data types. Once the table is added to the database, we can fetch it using regular SQL queries for our application development.
# import logging modules to configure login parameters
import logging
import sys
import pymysql
logging.basicConfig(stream=sys.stdout, level=logging.INFO, force=True)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from IPython.display import Markdown, display
# create_engine, text classes from sqlalchemy
from sqlalchemy import text
# Create an engine instance
engine = create_engine(connection_string)
# Test the connection using raw SQL
with engine.connect() as connection:
result = connection.execute(text("show tables"))
for row in result:
print(row)
---------------------------------[Output]-------------------------------------
('Customers',)
('OrderDetails',)
('Orders',)
('info',)
('table_36370395d9314572970b325bee42817d',)
('table_59af3c5a5388482193f88093fc2eaa36',)
In the code above, the first step involves connecting to the MySQL database using the available database credentials. After creating a connection string object to the database, the next step involves creating an engine that connects the database tables. Table details can be retrieved for querying purposes once the connection is established.
from llama_index.core import SQLDatabase
# select tables to connect for application
tables = ['table_36370395d9314572970b325bee42817d', 'table_59af3c5a5388482193f88093fc2eaa36']
sql_database = SQLDatabase(engine, include_tables=tables, sample_rows_in_table_info=2)
sql_database
After connecting to the database, we utilize the ‘SQLDatabase’ class from the llamaindex core library to retrieve the table schema of selected tables. With the SQL database set up, we can proceed with NL2SQL application development.
We begin by setting up OpenAI API key and importing OpenAI library to call GPT models for natural language to SQL queries. OpenAI provides wide range of LLMs for many tasks.
import os
import openai
openai.api_key = "YOUR_OPENAI_KEY"
Above code configures OpenAI GPT models with application environment. Now let’s create LLM encoder function that will help to convert input queries into encoded vectors.
import tiktoken
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
token_counter = TokenCountingHandler(
tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)
# callback manager for encoding input queries
callback_manager = CallbackManager([token_counter])
Above code sets up a callback mechanism using a CallbackManager instance to manage callbacks. One of the callbacks is a TokenCountingHandler instance, which uses a tokenizer from the tiktoken module, specifically tailored for the “gpt-3.5-turbo” model. This setup is likely part of a larger text processing or analysis pipeline.
from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
# llm instance usign OpenAI function and embedding model
Settings.llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="gpt-3.5-turbo")
Above code initializes various components necessary for utilizing OpenAI’s GPT-3.5 model with specific settings. It creates an instance of OpenAI class for model interaction, and a ServiceContext instance for managing service-related configurations, including the OpenAI model instance and callback manager that we initialized in previous code.
#creating SQL table node mapping
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex, SQLTableNodeMapping, SQLTableSchema
import pandas as pd
# list all the tables from database and crate table schema for prompt to LLM
tables = list(sql_database._all_tables)
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = []
for table in tables:
table_schema_objs.append((SQLTableSchema(table_name = table)))
Above code retrieves information about tables from a SQL database, creates a mapping between SQL tables and internal representation, and generates table schemas for each table in the database. Users likely utilize these table schemas for interacting with an object indexing system offered by the llama_index library.
from llama_index.core.indices.struct_store.sql_query import SQLTableRetrieverQueryEngine
# create a object index to store vectors
obj_index = ObjectIndex.from_objects(
table_schema_objs,
table_node_mapping,
VectorStoreIndex
)
# initializing query engine
query_engine = SQLTableRetrieverQueryEngine(
sql_database, obj_index.as_retriever(similarity_top_k=3),
)
# Ask query to query engine and store it in a response variable
response = query_engine.query("How many people have previous work experience?")
Above code sets up an object index, initializes a query engine for SQL tables, and then executes a query on the SQL database using the provided text and generated SQL query from the service context. The response variable stores the result of the query. This process likely involves some form of natural language processing and database interaction to provide relevant information based on the query. The resultant answer can then be provided back to user.
NL2SQL applications find utility across various industries like finance, banking, healthcare, and e-commerce. Analyzing user data and deriving insights to enable data driven decision making is at the core of any industry in today’s time. one of biggest of enabling data driven insights across the teams is that it is time consuming and requires specialized skills. but by using NL2SQL applications any business stakeholder can just ask query to database in natural language and get the relevant answer using large language models. Let’s explore how these applications find use in these industries.
We have learned about llama Index and their workings such a storage, indexing and query engines. after that we learned in depth about NL2SQL workflow along with explanation of each step to develop application from scratch. we learned how to integrate large language models with tools such as a llama Index so that users can interact with database using natural language queries without knowing any technical skill. Moreover, we learned to implement NL2SQL application using hands on code example along with its applications in various use-cases. Let’s look at few key takeaways from this blog that everyone should must learn.
llama_index
, pymysql
, and sqlalchemy
for application development.A: LlamaIndex has a inbuilt Text-to-SQL functions and methods like SQL table retrieval query engine to retrieve data from database and generate response for the users.
A: LlamaIndex operates on 3 fundamental functions: Loading, Indexing, and Storing stages, enabling the development of LLM-based applications for a wide range of uses.
A: Natural language to SQL applications can reduce time and efforts needed to generate actionable insights by only querying in natural language rather than writing complex SQL queries.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Very indepth article