Pandasql -The Best Way to Run SQL Queries in Python

nilabh Last Updated : 14 Oct, 2024

7 min read

Introduction

Pandas have evolved remarkably in data handling, yet some still swear by SQL’s magic. Good news! With Pandassql, you can use SQL-like tricks right in Python, especially in Jupyter Notebooks. Picture querying pandas DataFrames using just SQL syntax. Ready for the adventure of blending SQL with Pandas? And guess what? No SQL servers needed! 😎

This article was published as a part of the Data Science Blogathon.

What is Pandasql?
How to Install Panadsql?
How to Use PandaSQL?
Why Should One Use SQL in Pandas?
- Basics
- Syntax
Import libraries and Data
Read Data using SQL Query
Join (or merge) DataFrames using SQL Queries
GROUP BY using SQL
UNION ALL to club multiple variables in SQL
Nested Queries of SQL
Conclusion
Frequently Asked Questions

What is Pandasql?

The saviour is python’s library, pandasql.

As the libraries’ documentation mentions:

pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.

How to Install Panadsql?

You need to install the Python’s Library, pandasql first. It’s very simple to install. Use any of the below two methods, both use PIP installation.

Open the terminal and run
pip install -U pandasql
Open your Jupyter Notebook and in any cell run
!pip install -U pandasql

How to Use PandaSQL?

1.Install Pandas and pandasql:

pip install pandas pandasql

2. Import Libraries:

import pandas as pd
from pandasql import sqldf

3. Create DataFrame:

df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
                   'Age': [25, 30, 22],
                   'City': ['New York', 'SF', 'LA']})

4. Use SQL-like Queries:

pysql = sqldf("SELECT * FROM df WHERE Age > 25")

5. Print or Use Resulting DataFrame:

print(pysql)

Why Should One Use SQL in Pandas?

Using SQL in Pandas can be advantageous for various data manipulation and analysis tasks, especially when you are already familiar with SQL or when your data processing requirements align with SQL’s capabilities. Let’s explore some examples to illustrate why you might want to use SQL in Pandas:

Example 1: Filtering Data

Suppose you have a Pandas DataFrame containing sales data, and you want to filter it to select only the rows where the sales amount is greater than $1,000:

pythonCopy codeimport pandas as pd
from pandasql import sqldf

# Create a sample DataFrame
data = {'Product': ['A', 'B', 'C', 'D'],
        'Sales': [800, 1200, 950, 1400]}
df = pd.DataFrame(data)

# Using SQL in Pandas
pysqldf = sqldf("SELECT * FROM df WHERE Sales > 1000")

print(pysqldf)

In this example, using SQL-like syntax makes the filtering condition clear and concise.

Example 2: Aggregation

Suppose you have a dataset of customer orders, and you want to calculate the total sales amount for each customer using Pandas:

pythonCopy codeimport pandas as pd
from pandasql import sqldf

# Create a sample DataFrame
data = {'Customer': ['Alice', 'Bob', 'Alice', 'David'],
        'OrderAmount': [100, 150, 200, 120]}
df = pd.DataFrame(data)

# Using SQL in Pandas to calculate total sales per customer
pysqldf = sqldf("SELECT Customer, SUM(OrderAmount) AS TotalSales FROM df GROUP BY Customer")

print(pysqldf)

SQL-like syntax simplifies the aggregation task, allowing you to calculate sums, averages, and other aggregations easily.

Example 3: Joining DataFrames

Suppose you have two DataFrames, one containing customer information and the other containing order information. You want to perform an inner join to combine them based on a common column, such as customer ID:

pythonCopy codeimport pandas as pd
from pandasql import sqldf

# Create sample DataFrames
customers = pd.DataFrame({'CustomerID': [1, 2, 3],
                          'CustomerName': ['Alice', 'Bob', 'Charlie']})

orders = pd.DataFrame({'OrderID': [101, 102, 103],
                       'CustomerID': [1, 2, 1],
                       'OrderAmount': [100, 150, 200]})

# Using SQL in Pandas to perform an inner join
pysqldf = sqldf("SELECT c.CustomerName, o.OrderAmount FROM customers c INNER JOIN orders o ON c.CustomerID = o.CustomerID")

print(pysqldf)

SQL-like syntax provides a clear and familiar way to express join operations.

Example 4: Complex Filtering and Subsetting

In scenarios where you need to apply complex filtering conditions or subsetting operations, SQL can simplify the task. For example, you can use SQL’s WHERE clause with logical operators:

pythonCopy codeimport pandas as pd
from pandasql import sqldf

# Create a sample DataFrame
data = {'Product': ['A', 'B', 'C', 'D'],
        'Sales': [800, 1200, 950, 1400],
        'Region': ['North', 'South', 'North', 'West']}
df = pd.DataFrame(data)

# Using SQL in Pandas for complex filtering
pysqldf = sqldf("SELECT * FROM df WHERE Sales > 1000 AND Region = 'North'")

print(pysqldf)

This approach is concise and readable when dealing with intricate conditions.

Basics

There is this one function that is used the most from this library. Its the main function sqldf. sqldf takes two parameters.

A SQL query in string format
A set of session/environment variables (globals() or locals())
It becomes tedious to specify globals() or locals(), hence whenever you import the library, run the following helper function along with. This will make things simple going forward.

from pandasql import sqldf
mysql = lambda q: sqldf(q, globals())

Syntax

Many variants of SQL are in use, and their syntaxes vary a little. Here pandasql uses the SQLite syntax. Most of the standard SQL language SQLite understands. However, it adds a few features of its own while at the same time it does omit some features. Click Here to read the document that attempts to describe what parts of the SQL language SQLite do and does not support.

pandasql automatically detects any pandas DataFrame. You can call them or query them by their name like you would have done with a SQL table.

We are going to use any one of these two basic code samples.

from pandasql import sqldf

from pandasql import sqldf

Import libraries and Data

For this article, we will use the data from the pandasql library itself. Let us import the dependencies and the data.

import pandas as pd
from pandasql import sqldf
from pandasql import load_meat, load_births          # Importing Data
# Bring data in Python environment as pandas DataFrame
meat = load_meat()
births = load_births()
print(meat.head())
print(births.head())

Read Data using SQL Query

We will read the first 5 rows of data, for the meat and births data frames using SQL. The result shall be similar to what we get from using .head()

In [4]:

# specify globals() or locals() using the following helper function
mysql = lambda q: sqldf(q, globals())
mysql("SELECT * FROM meat LIMIT 5;")

Out[4]:

	date	beef	veal	pork	lamb_and_mutton	broilers	other_chicken	turkey
0	1944-01-01 00:00:00.000000	751.0	85.0	1280.0	89.0	None	None	None
1	1944-02-01 00:00:00.000000	713.0	77.0	1169.0	72.0	None	None	None
2	1944-03-01 00:00:00.000000	741.0	90.0	1128.0	75.0	None	None	None
3	1944-04-01 00:00:00.000000	650.0	89.0	978.0	66.0	None	None	None
4	1944-05-01 00:00:00.000000	681.0	106.0	1029.0	78.0	None	None	None

In [5]:

mysql("SELECT * FROM births LIMIT 5;")

Out[5]:

	date	births
0	1975-01-01 00:00:00.000000	265775
1	1975-02-01 00:00:00.000000	241045
2	1975-03-01 00:00:00.000000	268849
3	1975-04-01 00:00:00.000000	247455
4	1975-05-01 00:00:00.000000	254545

Join (or merge) DataFrames using SQL Queries

Joining tables is one of the most common tasks being performed by SQL. Understandably so, as the relational databases have data segregated in separate tables. Hence, SQL users are pretty used to using join() tables in SQL. We can use the power of SQL JOIN here with pandas DataFrame.

In [6]:

query = '''
SELECT m.date, m.beef, m.veal, m.pork, b.births
FROM meat AS m
INNER JOIN
births AS b
ON m.date = b.date;
'''

mysql(query)

Out[6]:

	date	beef	veal	pork	births
0	1975-01-01 00:00:00.000000	2106.0	59.0	1114.0	265775
1	1975-02-01 00:00:00.000000	1845.0	50.0	954.0	241045
2	1975-03-01 00:00:00.000000	1891.0	57.0	976.0	268849
3	1975-04-01 00:00:00.000000	1895.0	60.0	1100.0	247455
4	1975-05-01 00:00:00.000000	1849.0	59.0	934.0	254545
…	…	…	…	…	…
402	2012-07-01 00:00:00.000000	2200.8	9.5	1721.8	368450
403	2012-08-01 00:00:00.000000	2367.5	10.1	1997.9	359554
404	2012-09-01 00:00:00.000000	2016.0	8.8	1911.0	361922
405	2012-10-01 00:00:00.000000	2343.7	10.3	2210.4	347625
406	2012-11-01 00:00:00.000000	2206.6	10.1	2078.7	320195

407 rows × 5 columns

GROUP BY using SQL

The data of meat production is month-wise. We want to see the beef production per year. For that we need to groupby() and aggregate. We can do this using the SQL GROUP BY function.

In [7]:

query = '''SELECT
            strftime('%Y', date) as year
           , SUM(beef) as beef_total
           FROM
              meat
           GROUP BY
              year
              LIMIT 5;
    '''

mysql(query)

Out[7]:

	year	beef_total
0	1944	8801.0
1	1945	9936.0
2	1946	9010.0
3	1947	10096.0
4	1948	8766.0

In the above code, we used SQL query to limit the number of rows for the grouped and aggregated table to 5 rows. But the output and the input, both are not SQL tables. They are pandas DataFrames. And this gives us the liberty to use Pandas functions and methods on the same.

Let us do the same operation, and this time the output shall be the first 10 rows. But the SQL query will give a full table and we will use pandas head() function to get the final output truncated to 10 rows.

In [8]:

query = '''SELECT
            strftime('%Y', date) as year
           , SUM(beef) as beef_total
           FROM
              meat
           GROUP BY
              year;
    '''

mysql(query).head(10)

Out[8]:

	year	beef_total
0	1944	8801.0
1	1945	9936.0
2	1946	9010.0
3	1947	10096.0
4	1948	8766.0
5	1949	9142.0
6	1950	9248.0
7	1951	8549.0
8	1952	9337.0
9	1953	12055.0

UNION ALL to club multiple variables in SQL

We have beef, pork, and veal as meat types, in separate columns. Here we want all the production values in one column and the identifier in another column. We can use UNION ALL function from SQL to achieve this easily.

In [9]:

#executing union all statements
query = """
        SELECT
            date
            , 'beef' AS meat_type
            , beef AS value
        FROM meat
        UNION ALL
        SELECT
            date
            , 'veal' AS meat_type
            , veal AS value
        FROM meat
        UNION ALL
        SELECT
            date
            , 'pork' AS meat_type
            , pork AS value
        FROM meat
        UNION ALL
        SELECT
            date
            , 'lamb_and_mutton' AS meat_type
            , lamb_and_mutton AS value
        FROM meat
        ORDER BY 1
    """

mysql(query).head(10)

Out[9]:

	date	meat_type	value
0	1944-01-01 00:00:00.000000	beef	751.0
1	1944-01-01 00:00:00.000000	veal	85.0
2	1944-01-01 00:00:00.000000	pork	1280.0
3	1944-01-01 00:00:00.000000	lamb_and_mutton	89.0
4	1944-02-01 00:00:00.000000	beef	713.0
5	1944-02-01 00:00:00.000000	veal	77.0
6	1944-02-01 00:00:00.000000	pork	1169.0
7	1944-02-01 00:00:00.000000	lamb_and_mutton	72.0
8	1944-03-01 00:00:00.000000	beef	741.0
9	1944-03-01 00:00:00.000000	veal	90.0

Nested Queries of SQL

In SQL, writing queries within another query is commonplace. The same kind of nesting of queries is possible here as well. We will create one table (or say DataFrame) and without assigning it any variable (or name), we will use that to create another table.

In [10]:

# use queries within queries
query = """
    SELECT
        m1.date
        , m1.beef
    FROM
        meat m1
    WHERE m1.date IN
        (SELECT
            date
        FROM meat
        WHERE
            beef >= broilers
        ORDER BY date)
"""

mysql(query)

Out[10]:

	date	beef
0	1960-01-01 00:00:00.000000	1196.0
1	1960-02-01 00:00:00.000000	1089.0
2	1960-03-01 00:00:00.000000	1201.0
3	1960-04-01 00:00:00.000000	1066.0
4	1960-05-01 00:00:00.000000	1202.0
…	…	…
416	1995-08-01 00:00:00.000000	2316.0
417	1995-09-01 00:00:00.000000	2220.0
418	1995-11-01 00:00:00.000000	2098.0
419	1996-05-01 00:00:00.000000	2302.0
420	1996-06-01 00:00:00.000000	2186.0

421 rows × 2 columns

Conclusion

In this article, we saw that how easily we can use SQL queries to operate upon the DataFrames. This gives us a unique opportunity. This weapon can be a potent one in any Data Scientist’s arsenal, who knows SQL and Python, both.

They both are powerful languages and have their respective strengths and weaknesses. Using the method shown in this article, or in other words, using the pandasql library and sqldf function, we can use the best and most efficient method to manipulate data, well within the python environment and even Jupyter Notebook. This is music to my ears. I hope you enjoyed the song too 🤓.

In this article, you saw how to use SQL queries inside python. But if you want to connect the two most powerful workhorses of the Data Science world, SQL and Python. This is not the end, but only the first step towards getting the “Best of Both Worlds”.

Resources

Now you can start using Python to work upon your data which rests in SQL Databases. In able to connect to your SQL databases, go thru my article How to Access & Use SQL Database with pyodbc in Python. Once you brought it as DataFrame, then all the operations are usual Pandas operations or SQL queries being operated on Pandas DataFrame as you saw in this article.

Apart from the function of SQL shown in this article, many other popular SQL functions are easily implementable in Python. Read 15 Pandas functions to replicate basic SQL Queries in Python for learning how to do that.

Frequently Asked Questions

Q1. What is PandaSQL?

A. PandaSQL is a Python package that allows SQL-like queries to be performed on pandas DataFrames, enabling seamless data manipulation within the Python environment.

Q2. Is Pandasql faster than pandas?

A. Pandasql might be faster for specific operations that leverage SQL’s optimization, but the speed difference depends on the task and dataset size.

Q3. How to install PandaSQL package in Python?

A. Install Pandasql using pip: pip install pandasql. Then, import it in your Python script or Jupyter Notebook to use SQL-like queries with pandas DataFrames.

Q4. Is pandas better than SQL?

A. Pandas and SQL serve different purposes. Pandas are great for data analysis and manipulation within Python, while SQL is essential for efficiently managing databases and querying large datasets. Both have their strengths.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

nilabh

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Pandasql -The Best Way to Run SQL Queries in Python

Introduction

Table of contents

What is Pandasql?

How to Install Panadsql?

How to Use PandaSQL?

Why Should One Use SQL in Pandas?

Basics

Syntax

Import libraries and Data

Read Data using SQL Query

Join (or merge) DataFrames using SQL Queries

GROUP BY using SQL

UNION ALL to club multiple variables in SQL

Nested Queries of SQL

Conclusion

Resources

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt