SQL Commands for Data Science

Suraj Last Updated : 29 Jul, 2022
7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

SQL — A structured query language is a must-know tool for everyone working with datasets. As its name suggests, it is primarily used to query, i.e., fetch the data from the relational database where data is stored in the form of tables.

SQL helps you to get the data from these tables using different SQL commands. Therefore, if you are beginning with the data science domain, you must learn SQL. 

This article will give you an idea about what you can do using SQL and what are the commonly used SQL commands. I have listed all the basic commands and their usage, making you understand their purpose more easily.

📍 Note: I’m using SQLite DB Browser and a self created Dummy_Sales_Data created using Faker which you can get on my Github repo for Free!

Let’s get started..

SQL Vector
Sql vector created by freepik – www.freepik.com

 

SQL Command: SELECT

This is the most basic and almost every time used command in SQL. As its name suggests, it is used to select some data from the database. You will not select the data randomly; you need to provide specific column names in the SELECT statement.

For example, suppose you want to select only the columns OrderID, Shipping_Address, UnitPrice, and Sales_Manager from the dataset, all you need to do is —

SELECT OrderID,
       Sales_Manager,
       Shipping_Address,
       UnitPrice

However, when you need to select all the columns from the dataset, you don’t need to type all the column names one by one; you can simply use an asterisk or star-shaped symbol(*) instead of column names as shown below —

SELECT *

This is how you can select one or more or all the columns of the dataset. But, going ahead, you also need to tell the computer from which dataset you want these columns. And that’s when the next important SQL command comes into the picture.

SQL Command: FROM

In SQL, FROM tells your program from which dataset you want to extract the data. Ideally, the database contains multiple datasets stored as a table. So FROM is the only way you can tell your program which dataset to look for.

To get the data from the table of your choice, all you need to do is specify the table name next to the FROM keyword.

Continuing the example from SELECT, suppose you want these columns from the dummy sales dataset. The name of this dataset in your database is Dummy_Sales_Data_v1. So, to get the required data, you need to write the below piece of SQL code.

SELECT OrderID,
       Sales_Manager,
       Shipping_Address,
       UnitPrice
FROM Dummy_Sales_Data_v1
SELECT FROM

The output of SELECT FROM statements | Image by Author

Well, this can give you a lot of return data that might not be needed for your data analysis task. However, you need only some part of the data matching some conditions. The next important statement is useful for this purpose.

SQL Command: WHERE

In SQL, the WHERE clause filters the records, i.e., rows of the dataset. You need to specify a condition in the WHERE clause, and only the rows which satisfy this condition will be retrieved.

For instance, suppose in the previous example, you want to select only those rows where the sales manager is Pablo. You can get this with —

SELECT OrderID,
       Sales_Manager,
       Shipping_Address,
       UnitPrice
FROM Dummy_Sales_Data_v1
WHERE Sales_Manager = 'Pablo'
dataset
All records when Pablo was sales manager | Image by Author

The last line of the above code tells you how to write a condition in the WHERE clause. Similarly, you can add multiple conditions in the WHERE clause using Logical AND and OR operators.

  • AND: This operator retrieves the records if all the conditions separated by AND are satisfied.
  • OR: This operator retrieves the records if any conditions separated by OR are satisfied.

Continuing with the previous example, suppose you want to select all the rows when the sales manager is Pablo or Stella. You can write these conditions in WHERE clause using OR operator as below —

SELECT OrderID,
       Sales_Manager,
       Shipping_Address,
       UnitPrice
FROM Dummy_Sales_Data_v1
WHERE Sales_Manager = 'Pablo'
OR Sales_Manager = 'Stella'
WHERE clause
Output of two conditions in WHERE clause | Image by Author

Similarly, you can put conditions on multiple columns and separate those conditions using AND and OR operators as per the requirement.

As you can see in the above picture, all the records are arranged randomly, with no specific order for UnitPrice columns. You can arrange the retrieved dataset in any order using the next command.

SQL Command: ORDER BY

In SQL, this clause is used to sort the retrieved records in ascending or descending order. The numerical columns are arranged in order based on their value, whereas for non-numerical columns, ascending corresponds to alphabetical order, i.e., from a to z.

To arrange the rows in order, all you need to do is write a keyword ORDER BY followed by a column name by which you want to order the result set.

Continuing the example, suppose you want to order the result set in ascending order  UnitPrice , i.e. lowest unit price appears on top. You can do it with —

SELECT OrderID,
       Sales_Manager,
       Shipping_Address,
       UnitPrice
FROM Dummy_Sales_Data_v1
WHERE Sales_Manager = 'Pablo'
ORDER BY UnitPrice
WHERE clause
SQL arranges records by Ascending order | Image by Author

By default, as you can see, the dataset is arranged in the ascending order of the unit price. However, to arrange it in descending order, you need to mention another keyword DESC next to the column name as below.

SELECT OrderID,
       Sales_Manager,
       Shipping_Address,
       UnitPrice
FROM Dummy_Sales_Data_v1
WHERE Sales_Manager = 'Pablo'
ORDER BY UnitPrice DESC
SQL Arrange records
SQL Arrange records in descending order | Image by Author

As you can see, the highest unit price appears on the top.

As of now, you see you are extracting all the rows based on some conditions, but there is actually no limit on how many rows you want to select. The next keyword is made for that.

SQL Command: LIMIT

In SQL, the LIMIT clause specifies how many records you want to retrieve. All you need to do is simply mention the number next to the keyword LIMIT, as shown below.

SELECT OrderID,
       Sales_Manager,
       Shipping_Address,
       UnitPrice
FROM Dummy_Sales_Data_v1
WHERE Sales_Manager = 'Pablo'
LIMIT 5
SQL LIMIT Clause
SQL LIMIT Clause | Image by Author

As you can see, only 5 records are now retrieved from the result set. This keyword is useful when dealing with large datasets, where retrieving all the rows from the result set is either unnecessary or affects the performance.

Remember that the LIMIT keyword comes at the end of your piece of code.

Last but not least.

GROUP BY and Data Aggregation

Until now, whatever you did is simply extracted the data. However, data analysis is not just about data extraction. There are many more commands, clauses, and functions in SQL which help you with that. At this beginner stage, you can focus on one simplest of them.

In SQL, GROUP BY groups the rows or records with the same value in the column specified in the GROUP BY clause. It is used when you want to group the rows and perform aggregate mathematical operations, e.g., addition, counting, or taking the average.

Within GROUP BY, you can use aggregate functions such as SUM(), MAX(), MIN(), AVG() and COUNT().

For example, suppose you want to see the total number of orders processed by each sales manager. This means you are essentially counting OrderID from the dataset belonging to each sales manager, right?

That means you need to group all the rows belonging to the same sales manager and then count how many order Ids belong to each. This is easily done using GROUP BY in SQL as below.

SELECT Sales_Manager,
       COUNT(OrderID)
FROM Dummy_Sales_Data_v1
GROUP BY Sales_Manager
SQL GROUP
SQL GROUP BY and COUNT | Image by Author

As you can see, it returns the number of orders processed by each sales manager. Similarly, you can use other aggregate functions.

That’s all!

Conclusion

This was a simple overview of what you can do using SQL as a data analyst or anyone working on data. Many more functions and commands enable you to do entire data analysis tasks using SQL. You can find them in Advanced SQL concepts and other SQL-related articles on my Medium page.

Below are some of the learnings and takeaways from this quick read —

  • ways to extract data from database using SQL
  • ways to use multiple conditions to select rows
  • ways to use aggregate functions on dataset columns
  • The additional resources to learn more about SQL

I hope you found this content useful.

Feel free to connect with me on Medium, LinkedIn, and be sure to Sign-up to my Email list to never miss another article on data science guides, tricks and tips, SQL, and Python.

Thank you for reading!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Analytics Professional | I write about Data Science, Python and SQL

Follow me on Medium to get amazing tips on learning data science, changing job domains and winning job interviews. Join Medium Today with My referral link. https://medium.com/@17.rsuraj/membership

Also, Join my email list with 3K+ other readers https://medium.com/subscribe/@17.rsuraj

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details