This article was published as a part of the Data Science Blogathon.
In this article, we will see how to use SQL statements for data analysis. Data analysis can be done on only single tables or on multiple tables. In this particular article, we will mainly focus on only a single table analysis. Nowadays SQL is very much popular and is a very important topic. In this article, we will do querying and analysis of the database. In this overall practical section, we use the MYSQL workbench to run our queries. Make sure to download the workbench and community server.
Let’s get started.
First, let us understand SQL.
SQL stands for Structured Query Language. It is a language that is used for relational databases to query and get data from the database. SQL lets you access and manipulates databases. Learning SQL is always fun and very easy. There are many flavors of SQL like MYSQL, SQLite, SQL Server, Postgre SQL, and Oracle.
Out of these, we use MYSQL for now. All these flavors are very much similar to each other with just little differences. And we use a workbench editor for MYSQL.
Now for practice, we use the mavenmovies database. I uploaded the mavenmovies database SQL file to my GitHub repository. Please head into it and download it. Run the SQL file in the workbench to get a database with tables.
In this database, there are 16 related databases. And mainly it contains the details about customers, Business, and inventory. Each table has rows and columns. Rows refer to records and columns refer to the fields. Tables contain primary keys and Foreign keys. Primary keys are always unique. The primary key identifies a unique record in the database. Whereas Foreign keys are non-unique.
The elements of the SQL statement are called “BIG 6” elements. These elements form the structure of an SQL statement. These are,
SELECT – It identifies the column or columns that you need to query
FROM – Identifies the table(s) that you are querying from
WHERE – To filter the results based on some conditions. It is optional.
GROUP BY – To group the data in the results. It is optional
HAVING – To filter results by group filtering. It is optional
ORDER BY – To order the results either in ascending or descending order. It is optional
An SQL statement looks like this,
SELECT column-name FROM table-name WHERE logical-condition GROUP BY column-name HAVING logical-condition ORDER BY column-name
This like operator is very useful when you need to fetch records but you know only little about the words in it. It allows you to use a pattern like how it actually was.
Examples :
WHERE name LIKE ‘Denise%’ -- records where the name starts with ‘Denise’, with any number of characters after WHERE description LIKE ‘%fancy%’ -- records that contain ‘fancy’, with any characters before OR after WHERE name LIKE ‘%Johnson’ -- records that end with ‘Johnson’, with any number of characters before WHERE first_name LIKE ‘_erry’ -- records that end with ‘erry’, with exactly one character before (i.e. Terry, Jerry)
COUNT() – count the records
COUNT DISTINCT() – count the distinct values
MIN() – to find the smallest value
MAX() – to find the largest value
AVG() – to find the average value
SUM() – to find the sum of values
Fetch the list of the first
name, last name, and email of
each of our customers
SELECT first_name,last_name ,email from customer;
Fetch the records of films and see if there are
any other rental durations
SELECT DISTINCT rental_duration from film;
Fetch records of customers whose first name is Mary.
SELECT * from customer where first_name="mary";
Fetch payment records of the first 100 customers where the amount is greater than 5 and payments that are done after 6th January 2006.
SELECT * from payment where amount>5 AND payment_date>'2006-01-06' AND customer_id<=100;
Fetch the list of films that
include a Behind the Scenes
special feature
select title,special_features from film where special_features like "%Behind the Scenes%";
Fetch the count of
titles sliced by rental duration
select rental_duration , count(film_id) from film GROUP BY rental_duration;
Fetch the count of
films, along with the average,
min, and max rental rate,
grouped by replacement cost
select replacement_cost, count(film_id) as number_of_films, avg(rental_rate) as avg_rental_rate,min(rental_rate) as cheapest_rental,max(rental_rate) as most_expensive_rental from film group by replacement_cost;
Fetch the list of
customer_ids with less than 15
rentals all-time
select customer_id, count(rental_id) from rental group by customer_id having count(rental_id)<15;
Fetch the list of all film
titles along with their lengths
and rental rates, and sort them
from longest to shortest
select title,length,rental_rate from film order by length DESC;
CASE statements are very helpful when working with databases. These CASE statements allow us to make some conditional logic to specify how our output should be. CASE statements are defined using WHEN and THEN. The top WHEN/THEN pair executes
first. If true, the CASE is complete. If
not, it continues to test each
condition until the END statement is reached. If multiple conditions
are true, then the output will be based on the first condition, since it’s the
first to be evaluated (top to bottom). If none of the given conditions are true then the NULL value is returned as output.
CASE statements can also work with the logical operators that we used with WHERE statements. These logical operators include:
= | Equals |
Does NOT Equal | |
> | Greater Than |
< | Less Than |
>= | Greater Than Or Equal To |
<= | Less Than Or Equal To |
BETWEEN | A Range Between Two Values
LIKE Matching a Pattern Like This |
LIKE | Matching a Pattern Like This |
IN() | Equals One of These Values |
Fetch the list of first and
last names of all customers, and
label them as either ‘store 1
active’, ‘store 1 inactive’, ‘store 2
active’, or ‘store 2 inactive’
select distinct first_name, last_name, case when store_id =1 and active=0 then "store 1 inactive" when store_id =1 and active=1 then "store 1 active" when store_id =2 and active=0 then "store 2 inactive" when store_id =2 and active=1 then "store 2 active" else "no sense" end as Active_status from customer;
Create a table dividing the length of the film providing the length.
SELECT distinct length, CASE WHEN length<60 THEN 'UNDER 1 HR' WHEN length BETWEEN 60 AND 90 THEN '1 - 1.5' WHEN length>90 THEN 'OVER 1.5 HRS' ELSE ' UH NO... CHECK LOGIC' END AS length_bucket FROM film;
Create a table that contains the film title and specify whether the movie is suitable for the recommendation. If not please specify the reason for it.
select distinct title, case when rental_duration<=4 then 'rental too short' when rental_rate>=3.99 then'too expensive' when rating in('tNC-17', 'R') then 'too adult' when length not between 60 and 90 then ' too short or too long' when description like '%Shark%' then 'nope has sharks' else 'great recommendation' END as fit_for_recommendation from film;
Create a table to
count the number of customers
broken down by store_id (in
rows), and active status (in
columns)
select store_id, count(case when active=0 then active else null end) as inactive_count, count(case when active=1 then active else null end) as active_count from customer group by store_id;
SQL is very important to learn. SQL is a very useful and popular technology that many companies are using nowadays. It is a very easy and understandable language. You just need to understand the logic in the query. If you understand the language then it will be very easy for you to write the query to the given question. Overall in this article, we have seen,
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.