Building Natural Language to SQL Applications using LlamaIndex

Avikumar talaviya Last Updated : 11 Apr, 2024

9 min read

Introduction

With the rising advent of large language models and advancements in the field of AI we are witnessing new developments and opportunities in a way we work and interact with digital world around us. In this blog, we delve into the fundamentals of LlamaIndex, a groundbreaking technology that helps to build applications using LLMs.

This course covers setting up a project environment, connecting to a database, and developing NL2SQL applications using LlamaIndex modules. It provides a comprehensive roadmap for integration and explores real-world use cases of this cutting-edge technology in large language models and database management systems.

Learning Objectives

Understand the components and concepts of LlamaIndex, a technology for building applications with Large Language Models.
Learn the step-by-step workflow of developing Natural Language to SQL applications using LlamaIndex modules.
Gain knowledge on setting up environments, connecting to databases like MySQL, and creating SQL table mappings for NL2SQL applications.
Explore how to integrate OpenAI’s GPT models for encoding input queries and generating SQL queries from natural language prompts.
Discover real-world applications of NL2SQL in industries such as Finance & Banking, Healthcare, and E-commerce for automated reporting, customer insights, clinical data analysis, and more.

This article was published as a part of the Data Science Blogathon.

How to Get Started with LlamaIndex?
Architecture of NL2SQL Application
Step by Step Workflow of Application
Setting up Environment and Connecting to the Database
Developing NL2SQL Application Using Llama Index Modules
Use Cases Across the Industries
Frequently Asked Questions

How to Get Started with LlamaIndex?

In this section, we will learn about components and concepts used in LlamaIndex library to build LLM powered applications for RAG and NL2SQL use-cases.

Building Natural Language to SQL Applications using LlamaIndex

LlamaIndex has many different components that can be used in LLM applications. Large language models require to interact with user queries, vector stores, databases and external data sources.

LlamaIndex has mainly 3 stages for LLM based applications:

Loading Stage: At this stage, data is loaded from external API output, PDFs, or database. Once the data is loaded it will be converted into chunks for indexing and storing purpose. Such chunks are also called as a nodes as an atomic unit of data in llamaindex.
Indexing Stage: Once you have loaded data, we have to vectorize the data and store each vector into Vectorstore or Vectorindex. LLM generate numerical representation of each sentences which are called as embeddings. Llamaindex has many components that helps to embed chunks into vector representation.
Querying Stage: At this stage, Retrievers are used to extract relevant vectors from the vector stores based on context of the query. Similarity search is one of the most important concepts in LLM based applications which is key to how data is retrieved with efficiency and accuracy.

Using above stages in LLM based applications, we can develop RAG or NL2SQL based applications. Llamaindex functions and tools help to build such agentic applications. In next section, we will look at natural language to SQL application with step by step guide.

Architecture of NL2SQL Application

In this section, we will learn about architecture of natural language to SQL application to get an understanding of how large language models, database and user query interacts with each other to convert natural language query to SQL query resulting answer to the user query.

Above diagram shows how user query interacts with database and GPT model to generate output from database tables. let’s look at the step by step workflow of the application.

Step by Step Workflow of Application

Step1: User Uploads CSV

The process begins with the user uploading a CSV file, which contains data in tabular form. The user uploads the CSV file to a MySQL database, which then inserts the CSV data using “CREATE TABLE…” and “INSERT INTO…” functions.

Step2: Upload to MySQL Database

MySQL database processes and stores the uploaded CSV data. This database will serve as the source of data for the user’s queries.

Step3: User Uploads Examples

After uploading the CSV, the user has the option to upload example queries. These queries enhance natural language understanding and improve the robustness of prompt generation.

Step4: Upload Examples

Stored examples from the user are later used to generate prompts for the NL2SQL query engine.

Step5: Select Table

The application prompts the user to select a table from the uploaded CSV data. This table will be the target of the subsequent queries. user can select multiple tables too depending upon features of application.

Step6: Ask Query

The user formulates a query on the selected table using natural language.

Step7: NL2SQL Query Engine

The NL2SQL query engine takes the user’s query and information about the selected table to create a prompt. The system generates this prompt using a “create table” function with a few of the samples from the table.

Step8: Generate Prompt

The system submits the prompt to GPT models, which are natural language processing models capable of generating SQL queries based on the provided prompt. For more detailed prompt we can use few shot prompting to stir the GPT models to generate better SQL queries.

Step9: GPT Models Generate SQL

The GPT models generate an SQL query based on the prompt. The SQL query retrieves data from the selected table.

Step10: SQL Database

The application submits the generated SQL query to the MySQL database for execution. The database processes the query and returns the result set.

Step11: Execute Query

The SQL database executes the query and obtains the result set (DB Result).

Step12: Generate Descriptive Answer

The system submits the obtained DB result back to the GPT 3.5/4 models. These models generate a descriptive answer based on the results of the executed SQL query.

Step13: GPT Models Return Answer

The workflow completes when the system returns the descriptive answer to the user. The user receives an understandable and informative response based on their original natural language query.

Now that we have learned workflow of the NL2SQL applications. let’s look at how to build such applications with hands on code examples.

Setting up Environment and Connecting to the Database

In this section, we will set up the environment by installing important library for our project followed by connecting to A MySQL database for a development of a application.

Installing Libraries

!pip install pymysql
!pip install llama_index
!pip install --force-reinstall 'sqlalchemy<2.0.0'
!pip install llama-index-callbacks-aim
!pip install openai
!pip install sqlalchemy

After installing the required libraries, we connect the database to our environment using ‘pymysql’ and ‘sqlalchemy’.

Database Connection

Before we connect to database using sqlalchemy and pymysql, we will look at how to add tables to database for our NL2SQL application. we will be using demo examples for our blog purpose. one can use any tables available to them to add into database.

# import necessary library to add data into sql database
from sqlalchemy import (create_engine, MetaData, Table, Column, String, Integer)
import re
import pymysql

# db credentials 
db_user = "your_username"
db_password = "your_password"
db_host = "your_host"
db_name = "your_name"

# Construct the connection string
connection_string = f"mysql+pymysql://{db_user}:{db_password}@{db_host}/{db_name}"

engine = create_engine(connection_string)
metadata_obj = MetaData()

# Function to create a sanitized column name
def sanitize_column_name(col_name):
    # Remove special characters and replace spaces with underscores
    return re.sub(r"\W+", "_", col_name)

# Function to create a table from a DataFrame using SQLAlchemy
def create_table_from_dataframe(
    df: pd.DataFrame, table_name: str, engine, metadata_obj
):
    # Sanitize column names
    sanitized_columns = {col: sanitize_column_name(col) for col in df.columns}
    df = df.rename(columns=sanitized_columns)

    # Dynamically create columns based on DataFrame columns and data types
    columns = [
        Column(col, String if dtype == "object" else Integer)
        for col, dtype in zip(df.columns, df.dtypes)
    ]

    # Create a table with the defined columns
    table = Table(table_name, metadata_obj, *columns)

    # Create the table in the database
    metadata_obj.create_all(engine)

    # Insert data from DataFrame into the table
    with engine.connect() as conn:
        for _, row in df.iterrows():
            insert_stmt = table.insert().values(**row.to_dict())
            conn.execute(insert_stmt)
        conn.commit()

# loop over your dataframes to add tables into sql database              
for idx, df in enumerate(dfs):
    create_table_from_dataframe(df, table_name, engine, metadata_obj)

Above code illustrates function to add dataframe into SQL databases such as MySQL, PostgreSQL, etc. The function uses a dataframe, table name, database engine, and metadata object to sanitize columns, create tables, and connect to the database engine to insert the table with relevant column names and data types. Once the table is added to the database, we can fetch it using regular SQL queries for our application development.

# import logging modules to configure login parameters
import logging
import sys
import pymysql

logging.basicConfig(stream=sys.stdout, level=logging.INFO, force=True)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
from IPython.display import Markdown, display


# create_engine, text classes from sqlalchemy
from sqlalchemy import text


# Create an engine instance
engine = create_engine(connection_string)

# Test the connection using raw SQL
with engine.connect() as connection:
    result = connection.execute(text("show tables"))
    for row in result:
        print(row)

---------------------------------[Output]-------------------------------------
('Customers',)
('OrderDetails',)
('Orders',)
('info',)
('table_36370395d9314572970b325bee42817d',)
('table_59af3c5a5388482193f88093fc2eaa36',)

In the code above, the first step involves connecting to the MySQL database using the available database credentials. After creating a connection string object to the database, the next step involves creating an engine that connects the database tables. Table details can be retrieved for querying purposes once the connection is established.

Llamaindex SQL Database Instance

from llama_index.core import SQLDatabase

# select tables to connect for application
tables = ['table_36370395d9314572970b325bee42817d', 'table_59af3c5a5388482193f88093fc2eaa36']
sql_database = SQLDatabase(engine, include_tables=tables, sample_rows_in_table_info=2)
sql_database

After connecting to the database, we utilize the ‘SQLDatabase’ class from the llamaindex core library to retrieve the table schema of selected tables. With the SQL database set up, we can proceed with NL2SQL application development.

Developing NL2SQL Application Using Llama Index Modules

We begin by setting up OpenAI API key and importing OpenAI library to call GPT models for natural language to SQL queries. OpenAI provides wide range of LLMs for many tasks.

import os
import openai

openai.api_key = "YOUR_OPENAI_KEY"

Above code configures OpenAI GPT models with application environment. Now let’s create LLM encoder function that will help to convert input queries into encoded vectors.

import tiktoken
from llama_index.core.callbacks import CallbackManager, TokenCountingHandler
token_counter = TokenCountingHandler(
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)

# callback manager for encoding input queries
callback_manager = CallbackManager([token_counter])

Above code sets up a callback mechanism using a CallbackManager instance to manage callbacks. One of the callbacks is a TokenCountingHandler instance, which uses a tokenizer from the tiktoken module, specifically tailored for the “gpt-3.5-turbo” model. This setup is likely part of a larger text processing or analysis pipeline.

Setting up OpenAI LLM Service Context

from llama_index.core import Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

# llm instance usign OpenAI function and embedding model
Settings.llm = OpenAI(temperature=0.1, model="gpt-3.5-turbo")
Settings.embed_model = OpenAIEmbedding(model="gpt-3.5-turbo")

Above code initializes various components necessary for utilizing OpenAI’s GPT-3.5 model with specific settings. It creates an instance of OpenAI class for model interaction, and a ServiceContext instance for managing service-related configurations, including the OpenAI model instance and callback manager that we initialized in previous code.

Creating SQL Table Node Mapping

#creating SQL table node mapping 
from llama_index.core import VectorStoreIndex
from llama_index.core.objects import ObjectIndex, SQLTableNodeMapping, SQLTableSchema
import pandas as pd

# list all the tables from database and crate table schema for prompt to LLM
tables = list(sql_database._all_tables)
table_node_mapping = SQLTableNodeMapping(sql_database)
table_schema_objs = []
for table in tables:
    table_schema_objs.append((SQLTableSchema(table_name = table)))

Above code retrieves information about tables from a SQL database, creates a mapping between SQL tables and internal representation, and generates table schemas for each table in the database. Users likely utilize these table schemas for interacting with an object indexing system offered by the llama_index library.

Initializing SQL Table Query Retrieval Engine

from llama_index.core.indices.struct_store.sql_query import SQLTableRetrieverQueryEngine

# create a object index to store vectors 
obj_index = ObjectIndex.from_objects(
    table_schema_objs,
    table_node_mapping,
    VectorStoreIndex
)

# initializing query engine 
query_engine = SQLTableRetrieverQueryEngine(
    sql_database, obj_index.as_retriever(similarity_top_k=3), 
)

# Ask query to query engine and store it in a response variable
response = query_engine.query("How many people have previous work experience?")

Above code sets up an object index, initializes a query engine for SQL tables, and then executes a query on the SQL database using the provided text and generated SQL query from the service context. The response variable stores the result of the query. This process likely involves some form of natural language processing and database interaction to provide relevant information based on the query. The resultant answer can then be provided back to user.

Use Cases Across the Industries

NL2SQL applications find utility across various industries like finance, banking, healthcare, and e-commerce. Analyzing user data and deriving insights to enable data driven decision making is at the core of any industry in today’s time. one of biggest of enabling data driven insights across the teams is that it is time consuming and requires specialized skills. but by using NL2SQL applications any business stakeholder can just ask query to database in natural language and get the relevant answer using large language models. Let’s explore how these applications find use in these industries.

Finance and Banking:
- Automated Reporting: Using natural language to SQL application we can extract key insights from the database and prepare automated reports that can save time and cost for the business.
- Customer Insights: NL2SQL application can ease task for business stakeholders to find customer insights for tactical and managerial decision making.

Healthcare:
- Clinical Data Analysis: In healthcare, Clinics and Hospitals possess large databases of patients health data. Such applications can help doctors to analyze data quickly to improve the speed of diagnosis.

Conclusion

We have learned about llama Index and their workings such a storage, indexing and query engines. after that we learned in depth about NL2SQL workflow along with explanation of each step to develop application from scratch. we learned how to integrate large language models with tools such as a llama Index so that users can interact with database using natural language queries without knowing any technical skill. Moreover, we learned to implement NL2SQL application using hands on code example along with its applications in various use-cases. Let’s look at few key takeaways from this blog that everyone should must learn.

Key Takeaways

Explored the three vital stages: Loading, Indexing, and Querying, essential for building LLM-based applications.
Gained insights into how to seamlessly integrate OpenAI’s GPT models for efficient natural language processing and SQL query generation.
Learned the practical aspects of setting up environments and connecting to databases like MySQL.
Installed and utilized essential libraries such as llama_index, pymysql, and sqlalchemy for application development.

Frequently Asked Questions

Q1: How does LlamaIndex help to develop Text to SQL applications?

A: LlamaIndex has a inbuilt Text-to-SQL functions and methods like SQL table retrieval query engine to retrieve data from database and generate response for the users.

Q2: What are the 3 main stages of LlamaIndex library?

A: LlamaIndex operates on 3 fundamental functions: Loading, Indexing, and Storing stages, enabling the development of LLM-based applications for a wide range of uses.

Q3: How does NL2SQL applications can help businesses to make faster decisions?

A: Natural language to SQL applications can reduce time and efforts needed to generate actionable insights by only querying in natural language rather than writing complex SQL queries.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Avikumar talaviya

I specialize in data science and machine learning with hands-on experience in working on various end-to-end data science projects. I am the chapter co-lead of the Mumbai local chapter of Omdena. I am also a kaggle master and educator ambassador at streamlit with volunteers around the world.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Building Natural Language to SQL Applications using LlamaIndex

Introduction

Learning Objectives

Table of contents

How to Get Started with LlamaIndex?

Architecture of NL2SQL Application

Step by Step Workflow of Application

Step1: User Uploads CSV

Step2: Upload to MySQL Database

Step3: User Uploads Examples

Step4: Upload Examples

Step5: Select Table

Step6: Ask Query

Step7: NL2SQL Query Engine

Step8: Generate Prompt

Step9: GPT Models Generate SQL

Step10: SQL Database

Step11: Execute Query

Step12: Generate Descriptive Answer

Step13: GPT Models Return Answer

Setting up Environment and Connecting to the Database

Installing Libraries

Database Connection

Llamaindex SQL Database Instance

Developing NL2SQL Application Using Llama Index Modules

Setting up OpenAI LLM Service Context

Creating SQL Table Node Mapping

Initializing SQL Table Query Retrieval Engine

Use Cases Across the Industries

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)