The realm of data offers vast capabilities and numerous challenges. Whether you are a data analyst, data scientist, or data engineer, summarizing and aggregating data is essential. This skill helps distill complex information into meaningful insights, driving informed decisions across various industries like finance, healthcare, retail, and technology.
As a data engineer working on a project for one of the largest banks in my country, I realized the importance of managing data with precision. Misinterpreting a single rule can lead to significant errors, affecting business decisions and regulatory compliance. This experience highlighted the importance of conditional aggregation, where data is aggregated based on specific conditions or criteria. Mastering this skill is crucial for delivering accurate and meaningful insights.
In this article, we will explore conditional aggregation in SQL. We’ll discuss its key components, benefits, and how and when to use it. Whether you’re a seasoned SQL developer or a newcomer, understanding these concepts will significantly enhance your data querying skills.
At first glance, aggregation and conditional aggregation might seem similar, but there is a significant difference between the two. As the name suggests, conditional aggregation is based on specific conditions. Before diving deeper, let’s explore these concepts in detail.
Known as standard aggregations these are represented by using functions such as SUM() , AVG() , MIN() , COUNT() , MAX() . These are used in the SELECT statements and are applied over the entire dataset or for specific groups of data, where you need to specify the needed column in the GROUP BY or PARTITION BY clause.
SELECT
SUM(column) AS column_sum
FROM
table_name
In this example:
SELECT column1,
SUM(column) AS column_sum
FROM
table_name
GROUP BY column1
In this example:
SELECT column1,
column2,
column3,
SUM(column) OVER (PARTITION BY column2) AS column_sum
FROM
table_name
In this example:
Now, let’s focus on conditional aggregation. Look a bit at the title above… it says something about conditional, right? Well, yes because in the case of this scenario, this type of aggregation applies functions only to a specific “portion” of the dataset; more specifically only on a slight piece of the whole dataset that met a condition. This type of aggregation is used with CASE statements.
SELECT
SUM(CASE WHEN condition THEN column ELSE 0 END) AS conditional_sum
FROM
table_name;
In this example:
Now take a moment and see if you can identify the difference yourself.
As you can see, although the aggregation functions used in both scenarios are the same, the key difference lies in their application. Standard aggregations affect the entire dataset, while conditional aggregations operate on the subset of data that meets the specified condition (a condition that is specified using the CASE statement). This nuanced approach enables more precise data analysis and insight extraction.
Let’s continue with the banking scenario and have the following table:
CREATE TABLE Transactions (
TransactionID INT PRIMARY KEY,
AccountID INT,
Date DATE,
TransactionType VARCHAR(50),
Amount DECIMAL(15, 2)
);
To see how conditional aggregation works, we need some data, right?
INSERT INTO Transactions (TransactionID, AccountID, Date, TransactionType, Amount) VALUES
(1001, 101, '2023-06-15', 'Deposit', 5000.00),
(1002, 102, '2023-06-17', 'Withdrawal', 1200.00),
(1003, 103, '2023-06-19', 'Deposit', 15000.00),
(1004, 104, '2023-06-21', 'Withdrawal', 3000.00),
(1005, 105, '2023-06-25', 'Deposit', 7000.00),
(1016, 101, '2023-08-02', 'Deposit', 6000.00),
(1017, 102, '2023-08-05', 'Withdrawal', 1800.00),
(1018, 103, '2023-08-08', 'Deposit', 9000.00),
(1019, 104, '2023-08-12', 'Withdrawal', 2500.00),
(1020, 105, '2023-08-15', 'Deposit', 4000.00),
(1021, 101, '2023-08-18', 'Withdrawal', 1200.00),
(1022, 102, '2023-08-21', 'Deposit', 3000.00),
(1023, 103, '2023-08-25', 'Withdrawal', 3500.00),
(1024, 104, '2023-08-28', 'Deposit', 6000.00),
(1025, 105, '2023-08-30', 'Withdrawal', 5000.00),
(1006, 101, '2023-06-28', 'Deposit', 3000.00),
(1007, 102, '2023-07-02', 'Withdrawal', 1500.00),
(1008, 103, '2023-07-05', 'Deposit', 8000.00),
(1009, 104, '2023-07-10', 'Withdrawal', 2000.00),
(1010, 105, '2023-07-12', 'Deposit', 5000.00),
(1011, 101, '2023-07-15', 'Withdrawal', 1000.00),
(1012, 102, '2023-07-18', 'Deposit', 2000.00),
(1013, 103, '2023-07-21', 'Withdrawal', 3000.00),
(1014, 104, '2023-07-25', 'Deposit', 7000.00),
(1015, 105, '2023-07-28', 'Withdrawal', 4000.00),
(1026,102, '2023-08-30', 'Withdrawal', 500.00);
Our manager approached us with a request for a detailed analysis of the monthly deposits and withdrawals associated with each account. This analytical task holds significant importance as it is instrumental in comprehensively understanding customer behavior and ensuring the precision of our financial tracking mechanisms.
We’ll use our Transactionstable to solve this urgent request, but more is needed. We still have to find a way to retrieve all the details the manager requested: the account, the month, and the amounts for each type of transaction.
Somehow conditional aggregation seems to check all our needs, so with that in mind let’s start and write our code…but before that let’s identify which columns we need.
First of all, we need to identify the account, so we’ll take AccountID ; and because we need to do a monthly analysis, we also take the Date column.
We also need the TransactionType column since the request was to analyze both deposits and withdrawals and the mandatory Amount column which will be used to calculate the total sum for each type of transaction.
SELECT
AccountID,
FORMAT(Date, 'yyyy-MM') AS Month,
SUM(CASE WHEN TransactionType = 'Deposit' THEN Amount ELSE 0 END) AS TotalDeposits,
SUM(CASE WHEN TransactionType = 'Withdrawal' THEN Amount ELSE 0 END) AS TotalWithdrawals
FROM
Transactions
As we can see, we used the SUM() function with CASE statement to calculate the total amount for each type of transaction (Deposit or Withdrawal).
However, if we run this query we will receive the error below because we haven’t specified how to group the non-aggregated data ( AccountID and Date ).
In SQL, when using aggregate functions, like the one we used, we need to include a GROUP BY clause for the non-aggregated columns.
SELECT
AccountID,
FORMAT(Date, 'yyyy-MM') AS Month,
SUM(CASE WHEN TransactionType = 'Deposit' THEN Amount ELSE 0 END) AS TotalDeposits,
SUM(CASE WHEN TransactionType = 'Withdrawal' THEN Amount ELSE 0 END) AS TotalWithdrawals
FROM
Transactions
GROUP BY
AccountID,
FORMAT(Date, 'yyyy-MM')
SELECT statement:
This query will provide a monthly summary of deposits and withdrawals for each account, fulfilling the manager’s request for a detailed financial analysis.
To be easier to see if this worked, let’s take the AccountID = 102(we do this by adding a WHEREclause in our script).
SELECT
AccountID,
FORMAT(Date, 'yyyy-MM') AS Month,
SUM(CASE WHEN TransactionType = 'Deposit' THEN Amount ELSE 0 END) AS TotalDeposits,
SUM(CASE WHEN TransactionType = 'Withdrawal' THEN Amount ELSE 0 END) AS TotalWithdrawals
FROM
Transactions
WHERE AccountID = 102
GROUP BY
AccountID,
FORMAT(Date, 'yyyy-MM')
Here are the contents of the main table along with the results of our script:
Yes, it’s more than possible. You can also use multiple conditions within the CASE statement for more complex conditional aggregation. For example, let’s take our manager’s request a bit further and categorize the transactions into small, medium, and large.
Having the previous query is a good start but we need an additional condition
SELECT
AccountID,
FORMAT(Date, 'yyyy-MM') AS Month,
SUM(CASE WHEN TransactionType = 'Deposit' AND Amount < 5000 THEN Amount ELSE 0 END) AS SmallDeposits,
SUM(CASE WHEN TransactionType = 'Deposit' AND Amount BETWEEN 5000 AND 10000 THEN Amount ELSE 0 END) AS MediumDeposits,
SUM(CASE WHEN TransactionType = 'Deposit' AND Amount > 10000 THEN Amount ELSE 0 END) AS LargeDeposits,
SUM(CASE WHEN TransactionType = 'Withdrawal' AND Amount < 2000 THEN Amount ELSE 0 END) AS SmallWithdrawals,
SUM(CASE WHEN TransactionType = 'Withdrawal' AND Amount BETWEEN 2000 AND 5000 THEN Amount ELSE 0 END) AS MediumWithdrawals,
SUM(CASE WHEN TransactionType = 'Withdrawal' AND Amount > 5000 THEN Amount ELSE 0 END) AS LargeWithdrawals
FROM
Transactions
GROUP BY
AccountID,
FORMAT(Date, 'yyyy-MM');
Aside from checking if the type of the transaction is ‘Deposit’ or ‘Withdrawal’, we also add different conditions over Amount column, like below:
The same logic, but with different conditions, is applied for the ‘Withdrawal’ type of transaction.
At the end this query will return a monthly summary of deposits and withdrawals for each account, categorized into small, medium, and large transactions, providing a deeper insight into transaction behaviors.
Some scenarios when using a conditional aggregation is the best approach are:
Like any other SQL function, conditional aggregations have their benefits. The most significant are:
Harnessing the power of conditional aggregation is vital for data analysis and reporting, especially in the financial sector. It allows you to aggregate data based on specific conditions, which is often necessary. In this article, we highlighted the differences between standard and conditional aggregation and explored practical examples to deepen our understanding of this powerful tool. Keep in mind that mastering conditional aggregation will enhance your SQL skills. It will also bring you closer to becoming pro in SQL.
A. Conditional aggregation in SQL involves applying aggregate functions like SUM(), COUNT(), or AVG() based on specific conditions within the dataset using CASE statements.
A. Standard aggregation applies functions to the entire dataset or specified groups, while conditional aggregation applies functions only to subsets of data that meet certain conditions.
A. Common use cases include calculating specific aggregates, filtering data within aggregation functions, and creating custom metrics or reports.
A. Yes, you can use multiple conditions within a CASE statement to perform more complex conditional aggregation.