Pandas have evolved remarkably in data handling, yet some still swear by SQL’s magic. Good news! With Pandassql, you can use SQL-like tricks right in Python, especially in Jupyter Notebooks. Picture querying pandas DataFrames using just SQL syntax. Ready for the adventure of blending SQL with Pandas? And guess what? No SQL servers needed! 😎
This article was published as a part of the Data Science Blogathon.
The saviour is python’s library, pandasql
.
As the libraries’ documentation mentions:
pandasql
allows you to query pandas DataFrames
using SQL
syntax. It works similarly to sqldf
in R
. pandasql
seeks to provide a more familiar way of manipulating and cleaning data for people new to Python or pandas.
You need to install the Python’s Library, pandasql
first. It’s very simple to install. Use any of the below two methods, both use PIP
installation.
pip install -U pandasql
!pip install -U pandasql
1.Install Pandas and pandasql:
pip install pandas pandasql
2. Import Libraries:
import pandas as pd
from pandasql import sqldf
3. Create DataFrame:
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['New York', 'SF', 'LA']})
4. Use SQL-like Queries:
pysql = sqldf("SELECT * FROM df WHERE Age > 25")
5. Print or Use Resulting DataFrame:
print(pysql)
Using SQL in Pandas can be advantageous for various data manipulation and analysis tasks, especially when you are already familiar with SQL or when your data processing requirements align with SQL’s capabilities. Let’s explore some examples to illustrate why you might want to use SQL in Pandas:
Example 1: Filtering Data
Suppose you have a Pandas DataFrame containing sales data, and you want to filter it to select only the rows where the sales amount is greater than $1,000:
pythonCopy codeimport pandas as pd
from pandasql import sqldf
# Create a sample DataFrame
data = {'Product': ['A', 'B', 'C', 'D'],
'Sales': [800, 1200, 950, 1400]}
df = pd.DataFrame(data)
# Using SQL in Pandas
pysqldf = sqldf("SELECT * FROM df WHERE Sales > 1000")
print(pysqldf)
In this example, using SQL-like syntax makes the filtering condition clear and concise.
Example 2: Aggregation
Suppose you have a dataset of customer orders, and you want to calculate the total sales amount for each customer using Pandas:
pythonCopy codeimport pandas as pd
from pandasql import sqldf
# Create a sample DataFrame
data = {'Customer': ['Alice', 'Bob', 'Alice', 'David'],
'OrderAmount': [100, 150, 200, 120]}
df = pd.DataFrame(data)
# Using SQL in Pandas to calculate total sales per customer
pysqldf = sqldf("SELECT Customer, SUM(OrderAmount) AS TotalSales FROM df GROUP BY Customer")
print(pysqldf)
SQL-like syntax simplifies the aggregation task, allowing you to calculate sums, averages, and other aggregations easily.
Example 3: Joining DataFrames
Suppose you have two DataFrames, one containing customer information and the other containing order information. You want to perform an inner join to combine them based on a common column, such as customer ID:
pythonCopy codeimport pandas as pd
from pandasql import sqldf
# Create sample DataFrames
customers = pd.DataFrame({'CustomerID': [1, 2, 3],
'CustomerName': ['Alice', 'Bob', 'Charlie']})
orders = pd.DataFrame({'OrderID': [101, 102, 103],
'CustomerID': [1, 2, 1],
'OrderAmount': [100, 150, 200]})
# Using SQL in Pandas to perform an inner join
pysqldf = sqldf("SELECT c.CustomerName, o.OrderAmount FROM customers c INNER JOIN orders o ON c.CustomerID = o.CustomerID")
print(pysqldf)
SQL-like syntax provides a clear and familiar way to express join operations.
Example 4: Complex Filtering and Subsetting
In scenarios where you need to apply complex filtering conditions or subsetting operations, SQL can simplify the task. For example, you can use SQL’s WHERE
clause with logical operators:
pythonCopy codeimport pandas as pd
from pandasql import sqldf
# Create a sample DataFrame
data = {'Product': ['A', 'B', 'C', 'D'],
'Sales': [800, 1200, 950, 1400],
'Region': ['North', 'South', 'North', 'West']}
df = pd.DataFrame(data)
# Using SQL in Pandas for complex filtering
pysqldf = sqldf("SELECT * FROM df WHERE Sales > 1000 AND Region = 'North'")
print(pysqldf)
This approach is concise and readable when dealing with intricate conditions.
There is this one function that is used the most from this library. Its the main function sqldf. sqldf takes two parameters.
A SQL query in string format
A set of session/environment variables (globals() or locals())
It becomes tedious to specify globals() or locals(), hence whenever you import the library, run the following helper function along with. This will make things simple going forward.
from pandasql import sqldf
mysql = lambda q: sqldf(q, globals())
Many variants of SQL are in use, and their syntaxes vary a little. Here pandasql uses the SQLite syntax. Most of the standard SQL language SQLite understands. However, it adds a few features of its own while at the same time it does omit some features. Click Here to read the document that attempts to describe what parts of the SQL language SQLite do and does not support.
pandasql
automatically detects any pandas DataFrame. You can call them or query them by their name like you would have done with a SQL table.
We are going to use any one of these two basic code samples.
from pandasql import sqldf
or
from pandasql import sqldf
For this article, we will use the data from the pandasql
library itself. Let us import the dependencies and the data.
import pandas as pd
from pandasql import sqldf
from pandasql import load_meat, load_births # Importing Data
# Bring data in Python environment as pandas DataFrame
meat = load_meat()
births = load_births()
print(meat.head())
print(births.head())
We will read the first 5 rows of data, for the meat and births data frames using SQL. The result shall be similar to what we get from using .head()
In [4]:
# specify globals() or locals() using the following helper function
mysql = lambda q: sqldf(q, globals())
mysql("SELECT * FROM meat LIMIT 5;")
Out[4]:
date | beef | veal | pork | lamb_and_mutton | broilers | other_chicken | turkey | |
---|---|---|---|---|---|---|---|---|
0 | 1944-01-01 00:00:00.000000 | 751.0 | 85.0 | 1280.0 | 89.0 | None | None | None |
1 | 1944-02-01 00:00:00.000000 | 713.0 | 77.0 | 1169.0 | 72.0 | None | None | None |
2 | 1944-03-01 00:00:00.000000 | 741.0 | 90.0 | 1128.0 | 75.0 | None | None | None |
3 | 1944-04-01 00:00:00.000000 | 650.0 | 89.0 | 978.0 | 66.0 | None | None | None |
4 | 1944-05-01 00:00:00.000000 | 681.0 | 106.0 | 1029.0 | 78.0 | None | None | None |
In [5]:
mysql("SELECT * FROM births LIMIT 5;")
Out[5]:
date | births | |
---|---|---|
0 | 1975-01-01 00:00:00.000000 | 265775 |
1 | 1975-02-01 00:00:00.000000 | 241045 |
2 | 1975-03-01 00:00:00.000000 | 268849 |
3 | 1975-04-01 00:00:00.000000 | 247455 |
4 | 1975-05-01 00:00:00.000000 | 254545 |
Joining tables is one of the most common tasks being performed by SQL. Understandably so, as the relational databases have data segregated in separate tables. Hence, SQL users are pretty used to using join()
tables in SQL. We can use the power of SQL JOIN here with pandas DataFrame.
In [6]:
query = '''
SELECT m.date, m.beef, m.veal, m.pork, b.births
FROM meat AS m
INNER JOIN
births AS b
ON m.date = b.date;
'''
mysql(query)
Out[6]:
date | beef | veal | pork | births | |
---|---|---|---|---|---|
0 | 1975-01-01 00:00:00.000000 | 2106.0 | 59.0 | 1114.0 | 265775 |
1 | 1975-02-01 00:00:00.000000 | 1845.0 | 50.0 | 954.0 | 241045 |
2 | 1975-03-01 00:00:00.000000 | 1891.0 | 57.0 | 976.0 | 268849 |
3 | 1975-04-01 00:00:00.000000 | 1895.0 | 60.0 | 1100.0 | 247455 |
4 | 1975-05-01 00:00:00.000000 | 1849.0 | 59.0 | 934.0 | 254545 |
… | … | … | … | … | … |
402 | 2012-07-01 00:00:00.000000 | 2200.8 | 9.5 | 1721.8 | 368450 |
403 | 2012-08-01 00:00:00.000000 | 2367.5 | 10.1 | 1997.9 | 359554 |
404 | 2012-09-01 00:00:00.000000 | 2016.0 | 8.8 | 1911.0 | 361922 |
405 | 2012-10-01 00:00:00.000000 | 2343.7 | 10.3 | 2210.4 | 347625 |
406 | 2012-11-01 00:00:00.000000 | 2206.6 | 10.1 | 2078.7 | 320195 |
407 rows × 5 columns
The data of meat production is month-wise. We want to see the beef production per year. For that we need to groupby()
and aggregate
. We can do this using the SQL GROUP BY
function.
In [7]:
query = '''SELECT
strftime('%Y', date) as year
, SUM(beef) as beef_total
FROM
meat
GROUP BY
year
LIMIT 5;
'''
mysql(query)
Out[7]:
year | beef_total | |
---|---|---|
0 | 1944 | 8801.0 |
1 | 1945 | 9936.0 |
2 | 1946 | 9010.0 |
3 | 1947 | 10096.0 |
4 | 1948 | 8766.0 |
In the above code, we used SQL query to limit the number of rows for the grouped and aggregated table to 5 rows. But the output and the input, both are not SQL tables. They are pandas DataFrames. And this gives us the liberty to use Pandas functions and methods on the same.
Let us do the same operation, and this time the output shall be the first 10 rows. But the SQL query will give a full table and we will use pandas head()
function to get the final output truncated to 10 rows.
In [8]:
query = '''SELECT
strftime('%Y', date) as year
, SUM(beef) as beef_total
FROM
meat
GROUP BY
year;
'''
mysql(query).head(10)
Out[8]:
year | beef_total | |
---|---|---|
0 | 1944 | 8801.0 |
1 | 1945 | 9936.0 |
2 | 1946 | 9010.0 |
3 | 1947 | 10096.0 |
4 | 1948 | 8766.0 |
5 | 1949 | 9142.0 |
6 | 1950 | 9248.0 |
7 | 1951 | 8549.0 |
8 | 1952 | 9337.0 |
9 | 1953 | 12055.0 |
We have beef, pork, and veal as meat types, in separate columns. Here we want all the production values in one column and the identifier in another column. We can use UNION ALL
function from SQL to achieve this easily.
In [9]:
#executing union all statements
query = """
SELECT
date
, 'beef' AS meat_type
, beef AS value
FROM meat
UNION ALL
SELECT
date
, 'veal' AS meat_type
, veal AS value
FROM meat
UNION ALL
SELECT
date
, 'pork' AS meat_type
, pork AS value
FROM meat
UNION ALL
SELECT
date
, 'lamb_and_mutton' AS meat_type
, lamb_and_mutton AS value
FROM meat
ORDER BY 1
"""
mysql(query).head(10)
Out[9]:
date | meat_type | value | |
---|---|---|---|
0 | 1944-01-01 00:00:00.000000 | beef | 751.0 |
1 | 1944-01-01 00:00:00.000000 | veal | 85.0 |
2 | 1944-01-01 00:00:00.000000 | pork | 1280.0 |
3 | 1944-01-01 00:00:00.000000 | lamb_and_mutton | 89.0 |
4 | 1944-02-01 00:00:00.000000 | beef | 713.0 |
5 | 1944-02-01 00:00:00.000000 | veal | 77.0 |
6 | 1944-02-01 00:00:00.000000 | pork | 1169.0 |
7 | 1944-02-01 00:00:00.000000 | lamb_and_mutton | 72.0 |
8 | 1944-03-01 00:00:00.000000 | beef | 741.0 |
9 | 1944-03-01 00:00:00.000000 | veal | 90.0 |
In SQL, writing queries within another query is commonplace. The same kind of nesting of queries is possible here as well. We will create one table (or say DataFrame) and without assigning it any variable (or name), we will use that to create another table.
In [10]:
# use queries within queries
query = """
SELECT
m1.date
, m1.beef
FROM
meat m1
WHERE m1.date IN
(SELECT
date
FROM meat
WHERE
beef >= broilers
ORDER BY date)
"""
mysql(query)
Out[10]:
date | beef | |
---|---|---|
0 | 1960-01-01 00:00:00.000000 | 1196.0 |
1 | 1960-02-01 00:00:00.000000 | 1089.0 |
2 | 1960-03-01 00:00:00.000000 | 1201.0 |
3 | 1960-04-01 00:00:00.000000 | 1066.0 |
4 | 1960-05-01 00:00:00.000000 | 1202.0 |
… | … | … |
416 | 1995-08-01 00:00:00.000000 | 2316.0 |
417 | 1995-09-01 00:00:00.000000 | 2220.0 |
418 | 1995-11-01 00:00:00.000000 | 2098.0 |
419 | 1996-05-01 00:00:00.000000 | 2302.0 |
420 | 1996-06-01 00:00:00.000000 | 2186.0 |
In this article, we saw that how easily we can use SQL queries to operate upon the DataFrames. This gives us a unique opportunity. This weapon can be a potent one in any Data Scientist’s arsenal, who knows SQL and Python, both.
They both are powerful languages and have their respective strengths and weaknesses. Using the method shown in this article, or in other words, using the pandasql
library and sqldf
function, we can use the best and most efficient method to manipulate data, well within the python environment and even Jupyter Notebook. This is music to my ears. I hope you enjoyed the song
too 🤓.
In this article, you saw how to use SQL queries inside python. But if you want to connect the two most powerful workhorses of the Data Science world, SQL and Python. This is not the end, but only the first step towards getting the “Best of Both Worlds”.
Now you can start using Python to work upon your data which rests in SQL Databases. In able to connect to your SQL databases, go thru my article How to Access & Use SQL Database with pyodbc in Python. Once you brought it as DataFrame, then all the operations are usual Pandas operations or SQL queries being operated on Pandas DataFrame as you saw in this article.
Apart from the function of SQL shown in this article, many other popular SQL functions are easily implementable in Python. Read 15 Pandas functions to replicate basic SQL Queries in Python for learning how to do that.
A. PandaSQL is a Python package that allows SQL-like queries to be performed on pandas DataFrames, enabling seamless data manipulation within the Python environment.
A. Pandasql might be faster for specific operations that leverage SQL’s optimization, but the speed difference depends on the task and dataset size.
A. Install Pandasql using pip: pip install pandasql
. Then, import it in your Python script or Jupyter Notebook to use SQL-like queries with pandas DataFrames.
A. Pandas and SQL serve different purposes. Pandas are great for data analysis and manipulation within Python, while SQL is essential for efficiently managing databases and querying large datasets. Both have their strengths.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Explained very briefly. A very unique and easy way to approach the topic.
Pandasql current version is from 2016 and the last commit on github was in 2017. Looks like it's been abandoned.