This article was published as a part of the Data Science Blogathon
“As the different streams having their sources in different places all mingle their water in the sea, so, O Lord, the different paths which men take through different tendencies, various though they appear, crooked or straight, all lead to Thee.”
Swami Vivekanand
As the wise men like Vivekanand expressed, there are many paths that men/women take to reach their destinations (be it search of God or Model Accuracy). And all of those who have learned data science and related streams will be able to relate with this, that these different paths start with different languages, different tools, and different expectations. During this long and arduous journey towards excellence, some start from SQL, and others learn Python. But as we all know by now, in long run knowing only one of these will not suffice.
SQL is necessary for talking to the database and Python is the de-facto leader in taking it all to the Machine Learning journey. This is my second article in this series to make people more conversant in both languages. More specifically, those who are better off with SQL and learning Python now, they will be happy to know that almost all the functions and operations performed in SQL can be replicated in Python. Some of them may even be more efficiently done in Python, as compared to SQL.
Before we can start with looking at the equivalent “SQL functions in Python, let us revise how we linked SQL databases to Python. You may go through my previous article, the first one in this series “How-to-Access-use-SQL-Database-with-pyodbc-in-Python” for detailed understanding. The code to start importing SQL Database data from MS SQL is as below.
Photo by Author
# Let's start with connecting SQL with Python and Importing the SQL data as DataFrame import pyodbc import pandas as pd import numpy as np connection_string = ("Driver={SQL Server Native Client 11.0};" "Server=Your_Server_Name;" "Database=My_Database_Name;" "UID=Your_User_ID;" "PWD=Your_Password;") connection = pyodbc.connect(connection_string) # Using the same query as above to get the output in dataframe # We are importing top 10 rows and all the columns of State_Population Table population = pd.read_sql('SELECT TOP(10) * FROM State_Population', connection) # OR # write the query and assign it to variable query = 'SELECT * FROM STATE_AREAS WHERE [area (sq. mi)] > 100000' # use the variable name in place of query string area = pd.read_sql(query, connection)
Once we have the data in the form of DataFrame, now we can see how to manipulate them using Pandas in Python. In this article, we are going to see how we can replicate the SQL constructs in Python. There is no one “Best” way, but many good paths. You chose the one you wish for.
We are going to deconstruct the most basic of the SQL queries and see how the same result can be achieved in Python. The queries which we will discuss in this article are
The methodology we are going to adopt is like this: We will write a SQL query, and then list some possible ways in which the same result can be achieved in Python. We have three tables in the database which we are going to use, and we have imported two of them as DataFrames in Python already. We will use one of these Data Frames (population) to understand these concepts.
The table State_Population is already imported and the DataFrame is named as population.
In [2]:
population.head()
Out[2]:
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
Let us see how to replicate the SQL function in Python and get the same or similar results.
Note: The main headings are the broad SQL query names. And inside those headings, the actual SQL query being used to replicate are written in BOLD and note format. Below them, all the Python ways to replicate them are mentioned as numbered methods, one after the other.
population
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
… | … | … | … | … |
2539 | USA | total | 2010 | 309326295.0 |
2540 | USA | under18 | 2011 | 73902222.0 |
2541 | USA | total | 2011 | 311582564.0 |
2542 | USA | under18 | 2012 | 73708179.0 |
2543 | USA | total | 2012 | 313873685.0 |
2544 rows × 4 columns
SELECT year FROM State_Population;
This SQL query will fetch the column(year) and all the rows from the state_population table. In Python, it can be achieved in the following ways. One thing to notice here is that when we select only one column, it gets converted to pandas series object from a pandas DataFrame object. We convert it back to DataFrame by using the DataFrame function.
# By calling the dataframe.column pd.DataFrame(population.year)
Out[4]:
year | |
---|---|
0 | 2012 |
1 | 2012 |
2 | 2010 |
3 | 2010 |
4 | 2011 |
… | … |
2539 | 2010 |
2540 | 2011 |
2541 | 2011 |
2542 | 2012 |
2543 | 2012 |
2544 rows × 1 columns
SELECT population, year FROM State_Population;
This query will fetch two columns(population and year) and all the rows from the state_population table. In Python, it can be achieved in the following ways.
Notice the names of the columns as a list, inside the indexing brackets [].
population[['population', 'year']]
Out[5]:
population | year | |
---|---|---|
0 | 1117489.0 | 2012 |
1 | 4817528.0 | 2012 |
2 | 1130966.0 | 2010 |
3 | 4785570.0 | 2010 |
4 | 1125763.0 | 2011 |
… | … | … |
2539 | 309326295.0 | 2010 |
2540 | 73902222.0 | 2011 |
2541 | 311582564.0 | 2011 |
2542 | 73708179.0 | 2012 |
2543 | 313873685.0 | 2012 |
2544 rows × 2 columns
population.loc[:,['population', 'year']]
Out[6]:
population | year | |
---|---|---|
0 | 1117489.0 | 2012 |
1 | 4817528.0 | 2012 |
2 | 1130966.0 | 2010 |
3 | 4785570.0 | 2010 |
4 | 1125763.0 | 2011 |
… | … | … |
2539 | 309326295.0 | 2010 |
2540 | 73902222.0 | 2011 |
2541 | 311582564.0 | 2011 |
2542 | 73708179.0 | 2012 |
2543 | 313873685.0 | 2012 |
2544 rows × 2 columns
The DataFrame above is the output from all the above codes. Different methods, same output.
SELECT * FROM State_Population WHERE year = 2010;
This query will fetch all the columns and only those rows from the state_population table where the year column has a value equal to 2010. In Python, it can be achieved in the following ways.
population[population.year == 2010]
Out[7]:
state/region | ages | year | population | |
---|---|---|---|---|
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
90 | AK | under18 | 2010 | 187902.0 |
91 | AK | total | 2010 | 713868.0 |
100 | AZ | under18 | 2010 | 1628563.0 |
… | … | … | … | … |
2405 | WY | total | 2010 | 564222.0 |
2490 | PR | total | 2010 | 3721208.0 |
2491 | PR | under18 | 2010 | 896945.0 |
2538 | USA | under18 | 2010 | 74119556.0 |
2539 | USA | total | 2010 | 309326295.0 |
106 rows × 4 columns
Out[8]:
state/region | ages | year | population | |
---|---|---|---|---|
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
90 | AK | under18 | 2010 | 187902.0 |
91 | AK | total | 2010 | 713868.0 |
100 | AZ | under18 | 2010 | 1628563.0 |
… | … | … | … | … |
2405 | WY | total | 2010 | 564222.0 |
2490 | PR | total | 2010 | 3721208.0 |
2491 | PR | under18 | 2010 | 896945.0 |
2538 | USA | under18 | 2010 | 74119556.0 |
2539 | USA | total | 2010 | 309326295.0 |
106 rows × 4 columns
population.query('year == 2010')
Out[9]:
state/region | ages | year | population | |
---|---|---|---|---|
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
90 | AK | under18 | 2010 | 187902.0 |
91 | AK | total | 2010 | 713868.0 |
100 | AZ | under18 | 2010 | 1628563.0 |
… | … | … | … | … |
2405 | WY | total | 2010 | 564222.0 |
2490 | PR | total | 2010 | 3721208.0 |
2491 | PR | under18 | 2010 | 896945.0 |
2538 | USA | under18 | 2010 | 74119556.0 |
2539 | USA | total | 2010 | 309326295.0 |
106 rows × 4 columns
population[population.apply(lambda x: x["year"] == 2010, axis=1)]
Out[10]:
state/region | ages | year | population | |
---|---|---|---|---|
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
90 | AK | under18 | 2010 | 187902.0 |
91 | AK | total | 2010 | 713868.0 |
100 | AZ | under18 | 2010 | 1628563.0 |
… | … | … | … | … |
2405 | WY | total | 2010 | 564222.0 |
2490 | PR | total | 2010 | 3721208.0 |
2491 | PR | under18 | 2010 | 896945.0 |
2538 | USA | under18 | 2010 | 74119556.0 |
2539 | USA | total | 2010 | 309326295.0 |
106 rows × 4 columns
The DataFrame above is the output from all the above codes. Different methods, same output.
SELECT state/region, population, year FROM State_Population WHERE year = 2010 or 2012 and ages = under18;
This query will fetch these three columns(state/region, population, year) and only those rows from the state_population table where the year column has a value equal to 2010 or 2012 and the ages columns have a value equal to “under18”. In Python, it can be achieved in the following ways.
# By using Pythons indexing and slicing population[(population.year.isin([2010, 2012])) & (population.ages == "under18")][['state/region', 'population', 'year']]
Out[11]:
state/region | population | year | |
---|---|---|---|
0 | AL | 1117489.0 | 2012 |
2 | AL | 1130966.0 | 2010 |
90 | AK | 187902.0 | 2010 |
94 | AK | 188162.0 | 2012 |
96 | AZ | 1617149.0 | 2012 |
… | … | … | … |
2404 | WY | 135351.0 | 2010 |
2491 | PR | 896945.0 | 2010 |
2494 | PR | 841740.0 | 2012 |
2538 | USA | 74119556.0 | 2010 |
2542 | USA | 73708179.0 | 2012 |
106 rows × 3 columns
population.loc[(population.year.isin([2010, 2012])) & (population.ages == "under18"),['state/region', 'population', 'year']]
Out[12]:
state/region | population | year | |
---|---|---|---|
0 | AL | 1117489.0 | 2012 |
2 | AL | 1130966.0 | 2010 |
90 | AK | 187902.0 | 2010 |
94 | AK | 188162.0 | 2012 |
96 | AZ | 1617149.0 | 2012 |
… | … | … | … |
2404 | WY | 135351.0 | 2010 |
2491 | PR | 896945.0 | 2010 |
2494 | PR | 841740.0 | 2012 |
2538 | USA | 74119556.0 | 2010 |
2542 | USA | 73708179.0 | 2012 |
106 rows × 3 columns
population.query('(year==2010 | year==2012) & ages == "under18"')[['state/region', 'population', 'year']]
Out[13]:
state/region | population | year | |
---|---|---|---|
0 | AL | 1117489.0 | 2012 |
2 | AL | 1130966.0 | 2010 |
90 | AK | 187902.0 | 2010 |
94 | AK | 188162.0 | 2012 |
96 | AZ | 1617149.0 | 2012 |
… | … | … | … |
2404 | WY | 135351.0 | 2010 |
2491 | PR | 896945.0 | 2010 |
2494 | PR | 841740.0 | 2012 |
2538 | USA | 74119556.0 | 2010 |
2542 | USA | 73708179.0 | 2012 |
106 rows × 3 columns
population[population.apply(lambda x: (x["year"] in [2010, 2012]) & (x["ages"] == "under18"), axis=1)]
Out[14]:
state/region | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
2 | AL | under18 | 2010 | 1130966.0 |
90 | AK | under18 | 2010 | 187902.0 |
94 | AK | under18 | 2012 | 188162.0 |
96 | AZ | under18 | 2012 | 1617149.0 |
… | … | … | … | … |
2404 | WY | under18 | 2010 | 135351.0 |
2491 | PR | under18 | 2010 | 896945.0 |
2494 | PR | under18 | 2012 | 841740.0 |
2538 | USA | under18 | 2010 | 74119556.0 |
2542 | USA | under18 | 2012 | 73708179.0 |
106 rows × 4 columns
The DataFrame above is the output from all the above codes. Different methods, same output.
SELECT * FROM State_Population WHERE ages = total GROUP BY state/region HAVING AVG(population) > 10000000;
The Group By functions of SQL and Pandas look the same on the surface, but pandas groupby is far more capable and efficient, especially for more complex operations. To implement the above operation from SQL in python, let us see the pandas groupby function up close.
It is possible to group by using one or more columns. For one column, just pass the column name, and for more than one column, pass the names as a list.
# grouped by state/region population.groupby(by = 'state/region')
Out[15]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016982CD0408>
# grouped by state/region and year
population.groupby(by = ['state/region', 'year'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000016982D7A908>
The output is a groupby object. The output shows that the grouping has been done, and the groupby function has done its job. But as we have not been told to aggregate by which function, the output is not in form of DataFrame. So let’s do that now.
population.groupby(by = ['state/region', 'year']).count()
Out[17]:
ages | population | ||
---|---|---|---|
state/region | year | ||
AK | 1990 | 2 | 2 |
1991 | 2 | 2 | |
1992 | 2 | 2 | |
1993 | 2 | 2 | |
1994 | 2 | 2 | |
… | … | … | … |
WY | 2009 | 2 | 2 |
2010 | 2 | 2 | |
2011 | 2 | 2 | |
2012 | 2 | 2 | |
2013 | 2 | 2 |
1272 rows × 2 columns
We can assign this groupby object to a variable, and then use that variable for further operations.
grouped = population.groupby(by = ['state/region', 'year'])
df = pd.DataFrame(population.loc[population.ages == 'total', :].groupby(by = 'state/region').aggregate('population').mean())
df.loc[df.population > 10000000, :]
population | |
---|---|
state/region | |
CA | 3.433414e+07 |
FL | 1.649654e+07 |
IL | 1.237080e+07 |
NY | 1.892581e+07 |
OH | 1.134238e+07 |
PA | 1.236960e+07 |
TX | 2.160626e+07 |
USA | 2.849979e+08 |
SELECT * FROM State_Population WHERE ages = total GROUP BY state/region HAVING AVG(population) > 10000000 ORDER BY population;
The order by in SQL is used to sort the table in the given order. In the above SQL code, the table needs to be ordered in ascending order (default). This task can be accomplished by using the pandas sort_values() method.
df = pd.DataFrame(population.loc[population.ages == 'total', :].groupby(by = 'state/region').aggregate('population').mean())
df.loc[df.population > 10000000, :].sort_values(by = 'population')
population | |
---|---|
state/region | |
OH | 1.134238e+07 |
PA | 1.236960e+07 |
IL | 1.237080e+07 |
FL | 1.649654e+07 |
NY | 1.892581e+07 |
TX | 2.160626e+07 |
CA | 3.433414e+07 |
USA | 2.849979e+08 |
The sort is done in ascending order by default. To change that, ascending = False shall be used.
df.loc[df.population > 10000000, :].sort_values(by = 'population', ascending = False)
Out[21]:
population | |
---|---|
state/region | |
USA | 2.849979e+08 |
CA | 3.433414e+07 |
TX | 2.160626e+07 |
NY | 1.892581e+07 |
FL | 1.649654e+07 |
IL | 1.237080e+07 |
PA | 1.236960e+07 |
OH | 1.134238e+07 |
Pandas give the functionality to sort by multiple columns. Not only that, You can choose which columns to sort in ascending order and which one in descending order. Let us groupby and sort our population data set on. We will group them under state and year, and sort them on year and population.
# Grouping by and the grouped table
grouped = population.groupby(by = ['state/region', 'year']).mean()
grouped
population | ||
---|---|---|
state/region | year | |
AK | 1990 | 365396.0 |
1991 | 376186.5 | |
1992 | 386807.0 | |
1993 | 393312.0 | |
1994 | 395373.5 | |
… | … | … |
WY | 2009 | 347405.5 |
2010 | 349786.5 | |
2011 | 351368.0 | |
2012 | 356576.0 | |
2013 | 360168.5 |
1272 rows × 1 columns
# Sorting the Grouped table in
# Ascending order of Year and (Increasing Year)
# Descending order of population (decreasing population)
grouped.sort_values(by = ['year', 'population'], ascending=[True, False])
population | ||
---|---|---|
state/region | year | |
USA | 1990 | 156920663.0 |
CA | 1990 | 18970008.0 |
NY | 1990 | 11151213.5 |
TX | 1990 | 10981487.5 |
FL | 1990 | 8011057.0 |
… | … | … |
AK | 2013 | 461632.0 |
ND | 2013 | 443040.5 |
DC | 2013 | 378961.5 |
VT | 2013 | 374665.5 |
WY | 2013 | 360168.5 |
1272 rows × 1 columns
You must have wondered how organizations manage their huge databases. They most certainly do not keep it in Excel or other spreadsheet formats. The real-life business databases are maintained in a relational database system, which is created and accessed most commonly using SQL. Hence knowing SQL is a necessary tool for any data scientist. But SQL is more powerful than just a data picking tool. It is capable of many data wrangling and data manipulation tasks. But so is Python.
Now no single language is sufficient to complete all the tasks, with operational efficiency. Hence a deep understanding of both SQL and Python will help you chose which one to use for which task.
If you want to just select, filter, and basic operations on data, you can do that efficiently in SQL. But if there’s a need for complex grouping operations, and more data manipulation, Pandas in Python would be a more apt choice.
There are many benefits in using multiple data analysis languages, as you can customize and use a hybrid approach well suited for your ever-evolving needs.
To see in detail how to connect Python with SQL or SQL server, read How-to-Access-use-SQL-Database-with-pyodbc-in-Python.
The implied learning in this article was, that you can use Python to do things that you thought were only possible using SQL. There may or may not be straight forward solution to things, but if you are inclined to find it, there are enough resources at your disposal to find a way out. You can look at the mix and match the learning from my book, PYTHON MADE EASY – Step by Step Guide to Programming and Data Analysis using Python for Beginners and Intermediate Level.
About the Author: I am Nilabh Nishchhal. I like making seemingly difficult topics easy and write about them. Check out more at https://www.authornilabh.com/. My attempt to make Python easy and Accessible to all is “Python Made Easy”.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Very helpful codes.. Thank you