This article was published as a part of the Data Science Blogathon.
SQL — A structured query language is a must-know tool for everyone working with datasets. As its name suggests, it is primarily used to query, i.e., fetch the data from the relational database where data is stored in the form of tables.
SQL helps you to get the data from these tables using different SQL commands. Therefore, if you are beginning with the data science domain, you must learn SQL.
This article will give you an idea about what you can do using SQL and what are the commonly used SQL commands. I have listed all the basic commands and their usage, making you understand their purpose more easily.
📍 Note: I’m using SQLite DB Browser and a self created Dummy_Sales_Data created using Faker which you can get on my Github repo for Free!
Let’s get started..
This is the most basic and almost every time used command in SQL. As its name suggests, it is used to select some data from the database. You will not select the data randomly; you need to provide specific column names in the SELECT statement.
For example, suppose you want to select only the columns OrderID
, Shipping_Address
, UnitPrice
, and Sales_Manager
from the dataset, all you need to do is —
SELECT OrderID, Sales_Manager, Shipping_Address, UnitPrice
However, when you need to select all the columns from the dataset, you don’t need to type all the column names one by one; you can simply use an asterisk or star-shaped symbol(*
) instead of column names as shown below —
SELECT *
This is how you can select one or more or all the columns of the dataset. But, going ahead, you also need to tell the computer from which dataset you want these columns. And that’s when the next important SQL command comes into the picture.
In SQL, FROM tells your program from which dataset you want to extract the data. Ideally, the database contains multiple datasets stored as a table. So FROM is the only way you can tell your program which dataset to look for.
To get the data from the table of your choice, all you need to do is specify the table name next to the FROM
keyword.
Continuing the example from SELECT
, suppose you want these columns from the dummy sales dataset. The name of this dataset in your database is Dummy_Sales_Data_v1. So, to get the required data, you need to write the below piece of SQL code.
SELECT OrderID, Sales_Manager, Shipping_Address, UnitPrice FROM Dummy_Sales_Data_v1
The output of SELECT FROM statements | Image by Author
Well, this can give you a lot of return data that might not be needed for your data analysis task. However, you need only some part of the data matching some conditions. The next important statement is useful for this purpose.
In SQL, the WHERE clause filters the records, i.e., rows of the dataset. You need to specify a condition in the WHERE clause, and only the rows which satisfy this condition will be retrieved.
For instance, suppose in the previous example, you want to select only those rows where the sales manager is Pablo. You can get this with —
SELECT OrderID, Sales_Manager, Shipping_Address, UnitPrice FROM Dummy_Sales_Data_v1 WHERE Sales_Manager = 'Pablo'
The last line of the above code tells you how to write a condition in the WHERE
clause. Similarly, you can add multiple conditions in the WHERE clause using Logical AND and OR operators.
AND
: This operator retrieves the records if all the conditions separated by AND are satisfied.OR
: This operator retrieves the records if any conditions separated by OR are satisfied.Continuing with the previous example, suppose you want to select all the rows when the sales manager is Pablo or Stella. You can write these conditions in WHERE clause using OR operator as below —
SELECT OrderID, Sales_Manager, Shipping_Address, UnitPrice FROM Dummy_Sales_Data_v1 WHERE Sales_Manager = 'Pablo' OR Sales_Manager = 'Stella'
Similarly, you can put conditions on multiple columns and separate those conditions using AND
and OR
operators as per the requirement.
As you can see in the above picture, all the records are arranged randomly, with no specific order for UnitPrice
columns. You can arrange the retrieved dataset in any order using the next command.
In SQL, this clause is used to sort the retrieved records in ascending or descending order. The numerical columns are arranged in order based on their value, whereas for non-numerical columns, ascending corresponds to alphabetical order, i.e., from a to z.
To arrange the rows in order, all you need to do is write a keyword ORDER BY followed by a column name by which you want to order the result set.
Continuing the example, suppose you want to order the result set in ascending order UnitPrice
, i.e. lowest unit price appears on top. You can do it with —
SELECT OrderID, Sales_Manager, Shipping_Address, UnitPrice FROM Dummy_Sales_Data_v1 WHERE Sales_Manager = 'Pablo' ORDER BY UnitPrice
By default, as you can see, the dataset is arranged in the ascending order of the unit price. However, to arrange it in descending order, you need to mention another keyword DESC
next to the column name as below.
SELECT OrderID, Sales_Manager, Shipping_Address, UnitPrice FROM Dummy_Sales_Data_v1 WHERE Sales_Manager = 'Pablo' ORDER BY UnitPrice DESC
As you can see, the highest unit price appears on the top.
As of now, you see you are extracting all the rows based on some conditions, but there is actually no limit on how many rows you want to select. The next keyword is made for that.
In SQL, the LIMIT clause specifies how many records you want to retrieve. All you need to do is simply mention the number next to the keyword LIMIT, as shown below.
SELECT OrderID, Sales_Manager, Shipping_Address, UnitPrice FROM Dummy_Sales_Data_v1 WHERE Sales_Manager = 'Pablo' LIMIT 5
As you can see, only 5 records are now retrieved from the result set. This keyword is useful when dealing with large datasets, where retrieving all the rows from the result set is either unnecessary or affects the performance.
Remember that the LIMIT keyword comes at the end of your piece of code.
Last but not least.
Until now, whatever you did is simply extracted the data. However, data analysis is not just about data extraction. There are many more commands, clauses, and functions in SQL which help you with that. At this beginner stage, you can focus on one simplest of them.
In SQL, GROUP BY
groups the rows or records with the same value in the column specified in the GROUP BY clause. It is used when you want to group the rows and perform aggregate mathematical operations, e.g., addition, counting, or taking the average.
Within GROUP BY, you can use aggregate functions such as SUM()
, MAX()
, MIN()
, AVG()
and COUNT()
.
For example, suppose you want to see the total number of orders processed by each sales manager. This means you are essentially counting OrderID
from the dataset belonging to each sales manager, right?
That means you need to group all the rows belonging to the same sales manager and then count how many order Ids belong to each. This is easily done using GROUP BY
in SQL as below.
SELECT Sales_Manager, COUNT(OrderID) FROM Dummy_Sales_Data_v1 GROUP BY Sales_Manager
As you can see, it returns the number of orders processed by each sales manager. Similarly, you can use other aggregate functions.
That’s all!
This was a simple overview of what you can do using SQL as a data analyst or anyone working on data. Many more functions and commands enable you to do entire data analysis tasks using SQL. You can find them in Advanced SQL concepts and other SQL-related articles on my Medium page.
Below are some of the learnings and takeaways from this quick read —
I hope you found this content useful.
Feel free to connect with me on Medium, LinkedIn, and be sure to Sign-up to my Email list to never miss another article on data science guides, tricks and tips, SQL, and Python.
Thank you for reading!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.