A window function (or Analytic Function) in SQL uses values from multiple rows (or one row) to return values for each row. Do not confuse this with aggregate functions, which return a single value after aggregating multiple rows. There’s a simple way to identify the window functions. They have an OVER clause. Any function in SQL without an OVER clause is not a window function. It will be either an aggregate function or a scalar (single-row) function.
In this article, I am going to use this table, named population
to showcase the examples and explain the codes. The first 5 rows of the table are as below. This is the data set which has a population for every state of USA for years from 1990 to 2013 and also has split over the total population and population of people above 18 years of age.
To understand the concepts, we will use the USA population dataset here. You may download the data set from my GitHub page here, and practice along with it.
Note: I am doing these operations in Jupyter Notebook(in Python Environment, on SQLite Syntax), and when you do the same in SQL editor, the table layout may look a bit different, but the content and codes are same.
SELECT * FROM population limit 5;
state | ages | year | population | |
---|---|---|---|---|
0 | AL | under18 | 2012 | 1117489.0 |
1 | AL | total | 2012 | 4817528.0 |
2 | AL | under18 | 2010 | 1130966.0 |
3 | AL | total | 2010 | 4785570.0 |
4 | AL | under18 | 2011 | 1125763.0 |
Example: Here is a SQL Query which uses a windows function to compare the population of each state for every year with the average population of that state over the period of all the years available in dataset.
SELECT *, avg(population) OVER (PARTITION BY state) FROM population;
state | ages | year | population | avg(population) OVER (PARTITION BY state) | |
---|---|---|---|---|---|
0 | AK | total | 1990 | 553290.0 | 416438.895833 |
1 | AK | under18 | 1990 | 177502.0 | 416438.895833 |
2 | AK | total | 1992 | 588736.0 | 416438.895833 |
3 | AK | under18 | 1991 | 182180.0 | 416438.895833 |
4 | AK | under18 | 1992 | 184878.0 | 416438.895833 |
… | … | … | … | … | … |
2539 | WY | under18 | 1993 | 137458.0 | 321286.541667 |
2540 | WY | total | 1991 | 459260.0 | 321286.541667 |
2541 | WY | under18 | 1991 | 136720.0 | 321286.541667 |
2542 | WY | under18 | 1990 | 136078.0 | 321286.541667 |
2543 | WY | total | 1990 | 453690.0 | 321286.541667 |
2544 rows × 5 columns
The PARTITION BY
is grouping the rows into partitions (by year) and the function (average function here) is applied to each partition separately. If you omit the PARTITION BY
clause, and the OVER()
a clause is empty, then the entire data set is treated as a single partition, and instead of getting the AVG population of states over the years, you will get the average population of all the states for the full period of time.
Now let us dive in and see different Window functions and their Python Counterpart along with codes and results.
lead()
and LAG()
are used to compare the rows with their previous(preceding) or next(following) rows. For this to work best, the data shall be in order or sorted. These functions are most useful when you want to compare one period of time with a previous period of time based on the certain metrics. Some generic examples can be as below.
Example:
Let’s see the comparison of the population for next year, as compared to the population of the previous year.
We can use LEAD()
and LAG()
to create new columns where we pull values from other rows. LEAD()
pulls the value from the next row and LAG()
pulls the value from the previous row.
SELECT *, LAG(population, 1) OVER (PARTITION BY state ORDER BY year) AS lag, LEAD(population, 1) OVER (PARTITION BY state ORDER BY year) AS lead FROM population
state | ages | year | population | lag | lead | |
---|---|---|---|---|---|---|
0 | AK | total | 1990 | 553290.0 | NaN | 177502.0 |
1 | AK | under18 | 1990 | 177502.0 | 553290.0 | 182180.0 |
2 | AK | under18 | 1991 | 182180.0 | 177502.0 | 570193.0 |
3 | AK | total | 1991 | 570193.0 | 182180.0 | 588736.0 |
4 | AK | total | 1992 | 588736.0 | 570193.0 | 184878.0 |
… | … | … | … | … | … | … |
2539 | WY | under18 | 2011 | 135407.0 | 567329.0 | 576626.0 |
2540 | WY | total | 2012 | 576626.0 | 135407.0 | 136526.0 |
2541 | WY | under18 | 2012 | 136526.0 | 576626.0 | 582658.0 |
2542 | WY | total | 2013 | 582658.0 | 136526.0 | 137679.0 |
2543 | WY | under18 | 2013 | 137679.0 | 582658.0 | NaN |
2544 rows × 6 columns
These are also known as running totals. Doing the running totals is simple when you use a windows function like sum()
. This is particularly used when you want to show the accumulation or growth of any metric over time.
The following example will show you how to do the cumulative sum of the population for states over years.
SELECT *, SUM(population) OVER (PARTITION BY state ORDER BY year) as Total_Population FROM population where ages = 'total'
state | ages | year | population | Total_Population | |
---|---|---|---|---|---|
0 | AK | total | 1990 | 553290.0 | 553290.0 |
1 | AK | total | 1991 | 570193.0 | 1123483.0 |
2 | AK | total | 1992 | 588736.0 | 1712219.0 |
3 | AK | total | 1993 | 599434.0 | 2311653.0 |
4 | AK | total | 1994 | 603308.0 | 2914961.0 |
… | … | … | … | … | … |
1267 | WY | total | 2009 | 559851.0 | 9957038.0 |
1268 | WY | total | 2010 | 564222.0 | 10521260.0 |
1269 | WY | total | 2011 | 567329.0 | 11088589.0 |
1270 | WY | total | 2012 | 576626.0 | 11665215.0 |
1271 | WY | total | 2013 | 582658.0 | 12247873.0 |
Moving Averages are very effective in forecasting values in time series. If you have ever looked at share price analysis, the analysts talk in terms of DMA. This DMA is nothing but Day Moving Average. So when they say this share is now priced above its 200 DMA, it means it is now trading at a price that is higher than the average price of this share, and this Average is the calculated average of the previous 200 Days closing prices.
Our Population data is for the years 1990 to 2013. Here, I am interested in knowing, how the population for the present year is as compared to the previous 5 years average population. This Moving Average for the last 5 years can be calculated in SQL by following the window function.
SELECT *, round(AVG(population) OVER (ORDER BY state, year ROWS 5 PRECEDING),2) as Avg_Populationfor5Year from population where ages = 'total'
state | ages | year | population | Avg_Populationfor5Year | |
---|---|---|---|---|---|
0 | AK | total | 1990 | 553290.0 | 553290.00 |
1 | AK | total | 1991 | 570193.0 | 561741.50 |
2 | AK | total | 1992 | 588736.0 | 570739.67 |
3 | AK | total | 1993 | 599434.0 | 577913.25 |
4 | AK | total | 1994 | 603308.0 | 582992.20 |
… | … | … | … | … | … |
1267 | WY | total | 2009 | 559851.0 | 531116.67 |
1268 | WY | total | 2010 | 564222.0 | 540302.67 |
1269 | WY | total | 2011 | 567329.0 | 549164.67 |
1270 | WY | total | 2012 | 576626.0 | 558157.83 |
1271 | WY | total | 2013 | 582658.0 | 566121.50 |
1272 rows × 5 columns
When you are working on data that has an element of time in it, or more specifically time series data, you can conclusively say that the events are happening in a particular order. Now you need to work upon groups of data and need to use some particular events, say the latest ones, or the oldest ones. These are still simpler, provided some of the time series functions (not the scope of this article).
But what if you want the 2nd last event or the 3rd from the latest? How nice it would be if you could number the events like they are separate rows, for each group, and then use the row numbers to access them when needed. This ROW_NUMBER()
the function helps you do exactly that, and much more.
Let us see an example where we will number the rows, and then use one particular row number for all groups in order to use that data.
SELECT *, ROW_NUMBER() OVER (PARTITION BY state ORDER BY year) as row_num FROM population WHERE ages = 'total'
state | ages | year | population | row_num | |
---|---|---|---|---|---|
0 | AK | total | 1990 | 553290.0 | 1 |
1 | AK | total | 1991 | 570193.0 | 2 |
2 | AK | total | 1992 | 588736.0 | 3 |
3 | AK | total | 1993 | 599434.0 | 4 |
4 | AK | total | 1994 | 603308.0 | 5 |
… | … | … | … | … | … |
1267 | WY | total | 2009 | 559851.0 | 20 |
1268 | WY | total | 2010 | 564222.0 | 21 |
1269 | WY | total | 2011 | 567329.0 | 22 |
1270 | WY | total | 2012 | 576626.0 | 23 |
1271 | WY | total | 2013 | 582658.0 | 24 |
1272 rows × 5 columns
Now let us use the row numbers to meaningfully take out the third year for each state. This we can now do easily as we have numbered the rows, and 3rd row for each group is what we seek.
SELECT * FROM population WHERE row_num = 3
If you notice, in the previous SQL code, we did not create a table to save the row_num column. You can do so if you want to, but here to show you how the code works, I am going to combine the two SQL queries, and they work like charm.
The output table here may look long and overwhelming, but notice that its population of all States and Regions of USA for the year 1992, the third year in the list of years from 1990 to 2013.
SELECT * FROM( SELECT *, ROW_NUMBER() OVER (PARTITION BY state ORDER BY year) as row_num FROM population WHERE ages = 'total' ) WHERE row_num = 3
state | ages | year | population | row_num | |
---|---|---|---|---|---|
0 | AK | total | 1992 | 588736.0 | 3 |
1 | AL | total | 1992 | 4154014.0 | 3 |
2 | AR | total | 1992 | 2415984.0 | 3 |
3 | AZ | total | 1992 | 3915740.0 | 3 |
4 | CA | total | 1992 | 30974659.0 | 3 |
5 | CO | total | 1992 | 3495939.0 | 3 |
6 | CT | total | 1992 | 3300712.0 | 3 |
7 | DC | total | 1992 | 597567.0 | 3 |
8 | DE | total | 1992 | 694927.0 | 3 |
9 | FL | total | 1992 | 13650553.0 | 3 |
10 | GA | total | 1992 | 6817203.0 | 3 |
11 | HI | total | 1992 | 1158613.0 | 3 |
12 | IA | total | 1992 | 2818401.0 | 3 |
13 | ID | total | 1992 | 1071685.0 | 3 |
14 | IL | total | 1992 | 11694184.0 | 3 |
15 | IN | total | 1992 | 5674547.0 | 3 |
16 | KS | total | 1992 | 2532395.0 | 3 |
17 | KY | total | 1992 | 3765469.0 | 3 |
18 | LA | total | 1992 | 4293003.0 | 3 |
19 | MA | total | 1992 | 6028709.0 | 3 |
20 | MD | total | 1992 | 4923369.0 | 3 |
21 | ME | total | 1992 | 1238508.0 | 3 |
22 | MI | total | 1992 | 9479065.0 | 3 |
23 | MN | total | 1992 | 4495572.0 | 3 |
24 | MO | total | 1992 | 5217101.0 | 3 |
25 | MS | total | 1992 | 2623734.0 | 3 |
26 | MT | total | 1992 | 825770.0 | 3 |
27 | NC | total | 1992 | 6897214.0 | 3 |
28 | ND | total | 1992 | 638223.0 | 3 |
29 | NE | total | 1992 | 1611687.0 | 3 |
30 | NH | total | 1992 | 1117785.0 | 3 |
31 | NJ | total | 1992 | 7880508.0 | 3 |
32 | NM | total | 1992 | 1595442.0 | 3 |
33 | NV | total | 1992 | 1351367.0 | 3 |
34 | NY | total | 1992 | 18246653.0 | 3 |
35 | OH | total | 1992 | 11029431.0 | 3 |
36 | OK | total | 1992 | 3220517.0 | 3 |
37 | OR | total | 1992 | 2991755.0 | 3 |
38 | PA | total | 1992 | 12049450.0 | 3 |
39 | PR | total | 1992 | NaN | 3 |
40 | RI | total | 1992 | 1012581.0 | 3 |
41 | SC | total | 1992 | 3620464.0 | 3 |
42 | SD | total | 1992 | 712801.0 | 3 |
43 | TN | total | 1992 | 5049742.0 | 3 |
44 | TX | total | 1992 | 17759738.0 | 3 |
45 | USA | total | 1992 | 256514231.0 | 3 |
46 | UT | total | 1992 | 1836799.0 | 3 |
47 | VA | total | 1992 | 6414307.0 | 3 |
48 | VT | total | 1992 | 572751.0 | 3 |
49 | WA | total | 1992 | 5160757.0 | 3 |
50 | WI | total | 1992 | 5025398.0 | 3 |
51 | WV | total | 1992 | 1806451.0 | 3 |
52 | WY | total | 1992 | 466251.0 | 3 |
We have the following rank functions in SQL.
To give rank as per the value of records in your table, you may use RANK()
function in SQL. The working of the same is similar to the ROW_NUMBER()
in SQL, with a slight difference. The RANK()
function gives same rank to equal values. The need for a ranking function is quite intuitive. You might want to
SELECT *, RANK() OVER (PARTITION BY state ORDER BY population) as rank FROM population
state | ages | year | population | rank | |
---|---|---|---|---|---|
0 | AK | under18 | 1990 | 177502.0 | 1 |
1 | AK | under18 | 1991 | 182180.0 | 2 |
2 | AK | under18 | 2008 | 183124.0 | 3 |
3 | AK | under18 | 2007 | 184344.0 | 4 |
4 | AK | under18 | 1992 | 184878.0 | 5 |
… | … | … | … | … | … |
2539 | WY | total | 2009 | 559851.0 | 44 |
2540 | WY | total | 2010 | 564222.0 | 45 |
2541 | WY | total | 2011 | 567329.0 | 46 |
2542 | WY | total | 2012 | 576626.0 | 47 |
2543 | WY | total | 2013 | 582658.0 | 48 |
2544 rows × 5 columns
Use the DESC
option when you want the ranking to be done in Descending order. In this case you will observe that the rank here is in the reverse order of the rank in table above, and row_num in the table before that.
SELECT *, RANK() OVER (PARTITION BY state ORDER BY population DESC) as rank FROM population
state | ages | year | population | rank | |
---|---|---|---|---|---|
0 | AK | total | 2013 | 735132.0 | 1 |
1 | AK | total | 2012 | 730307.0 | 2 |
2 | AK | total | 2011 | 723375.0 | 3 |
3 | AK | total | 2010 | 713868.0 | 4 |
4 | AK | total | 2009 | 698895.0 | 5 |
… | … | … | … | … | … |
2539 | WY | under18 | 2006 | 125525.0 | 44 |
2540 | WY | under18 | 2002 | 125495.0 | 45 |
2541 | WY | under18 | 2003 | 124182.0 | 46 |
2542 | WY | under18 | 2005 | 124022.0 | 47 |
2543 | WY | under18 | 2004 | 123974.0 | 48 |
2544 rows × 5 columns
In this article, you got introduced to the windows functions in SQL. These are the backbone of any data preparation operation using SQL. They will come in handy when you prepare data to feed into your machine learning model, Feature engineering or maybe just analyzing some pattern in the data.
In this article, you saw how to use SQL window functions work. But if you want to connect the two most powerful workhorses of the Data Science world, SQL and Python. This is not the end, but only the first step towards getting the “Best of Both Worlds”.
Now you can start using Python to work upon your data which rests in SQL Databases. In able to connect to your SQL databases, go thru my article How to Access & Use SQL Database with pyodbc in Python. Once you brought it as DataFrame, then all the operations are usual Pandas operations or SQL queries being operated on Pandas DataFrame as you saw in this article.
Apart from the function of SQL shown in this article, many other popular SQL functions are easily implementable in Python. Read 15 Pandas functions to replicate basic SQL Queries in Python for learning how to do that.
The implied learning in this article was, that you can use Python to do things that you thought were only possible using SQL. There may or may not be straight forward solution to things, but if you are inclined to find it, there are enough resources at your disposal to find a way out. You can look at the mix and match the learning from my book, PYTHON MADE EASY – Step by Step Guide to Programming and Data Analysis using Python for Beginners and Intermediate Level.
About the Author: I am Nilabh Nishchhal. I like making seemingly difficult topics easy and write about them. Check out more at https://www.authornilabh.com/. My attempt to make Python easy and Accessible to all is Python Made Easy.
Cover Photo Credit: Photo by Mari Madriz on Unsplash