SQL is a language every analyst and data scientist should know. There’s no escaping from this. You will be peppered with SQL questions in your analytics or data science interview rounds, especially if you’re a fresher in this field.
If you’ve been putting off learning SQL recently, it’s time to get into action and start getting your hands dirty. You would have to learn about databases to work with data so why not start your SQL journey today?
I’ve personally been working with SQL for a while and can attest to how useful it is, especially in these golden data-driven times. SQL is a simple yet powerful language that helps us manage and query data directly from a database, without having to copy it first.
It is also very easy to understand because of the various clauses that are similar to those used in the English language. So writing SQL commands will be a piece of cake for you!
And given the proliferation of data all over the world, every business is looking for professionals who are proficient in SQL. So once you add SQL skill to your resume, you will be a hotshot commodity out in the market. Great, but where to begin?
There are many different database systems out there, but the simplest and easiest to work with is SQLite. It is fast, compact, and stores data in an easy to share file format. It is used inside countless mobile phones, computers, and various other applications used by people every day. And the most amazing part, it comes bundled with Python! Heck, there is a reason why giants like Facebook, Google, Dropbox, and others use SQLite!
In this article, we will explore how to work with databases in Python using SQLite and look into the most commonly used SQL commands. So let’s start by asking the very basic question – what on earth is a database?
I assume you are already familiar with the Python language. If not, head over here to learn it for free! And if you’re a beginner in analytics, you should check out the free ‘Introduction to Business Analytics‘ course as well.
A database is an organized collection of interrelated data stored in an electronic format.
It is structured so as to provide easy access and manipulation of the stored data. Organizations use it to store any information that may be necessary for the decision-making process. The major advantages of databases over normal file storage systems are that it reduces data redundancy to a large extent, facilitates sharing of data among various users, and ensures the security of data which may be of immense importance to an organization.
While there are various types of databases and their choice of usage varies from organization to organization, the most basic and widely used is the Relational Database model. It organizes the data into tables where each row holds a record and is called a tuple. And each column represents an attribute for which each record usually holds a value.
A Relational database breaks down different aspects of a problem into different tables so that storing them and manipulating them becomes an easy task. For example, an e-commerce website maintaining a separate table for products and customers will find it more useful for doing analytics than saving all of the information in the same table.
Database Management System (DBMS) is a software that facilitates users and different applications to store, retrieve, and manipulate data in a database. Relational Database Management System or RDBMS is a DBMS for relational databases. There are many RDBMS like MYSQL, Postgres, SQL Server, etc. which use SQL for accessing the database.
But wait – we’ve been hearing the word ‘SQL’ since the beginner. What in the world is SQL?
SQL stands for Structured Query Language. It is a querying language designed for accessing and manipulating information from RDBMS.
SQL lets us write queries or sets of instructions to either create a new table, manipulate data or query on the stored data. Being a data scientist, it becomes imperative for you to know the basics of SQL to work your way around databases because you can only perform analysis if you can retrieve data from your organization’s database!
SQLite is a relational database management system based on SQL. It is designed for embedded devices that require fast and reliable data. It is serverless, lightweight, and requires zero-configuration. Also, it reads and writes directly to a disk file that can be easily copied and is platform-independent.
SQLite stores data in variable-length records which requires less memory and makes it run faster. It is designed for improved performance, reduced cost, and optimized for concurrency.
The sqlite3 module facilitates the use of SQLite databases with Python. In this article, I will show you how to work with an SQLite database in Python. You don’t need to download SQLite as it is shipped by default along with Python version 2.5 onwards!
The first step to working with your database is to create a connection with it. We can do this by using the connect() method that returns a Connection object. It accepts a path to the existing database. If no database exists, it will create a new database on the given path.
The next step is to generate a Cursor object using the cursor() method which allows you to execute queries against a database:
You are now ready to execute queries against the database and manipulate the data. But after we have done that, it is very important to do two things:
Now that we have created a database, it is time to create a table to store values.
Let’s create a table that stores values for a customer of an e-commerce website. It stores values like customer name, the id of the product bought, name, gender, age, and the city the customer is from.
A table in SQL is created using the CREATE TABLE command. Here I am going to create a table called Customer with the following attributes:
The Primary key is an attribute or set of attributes that can determine individual records in a table.
Defining an attribute Not Null will make sure there is a value given to the attribute (otherwise it will give an error).
SQL keywords are case-insensitive so you can write the commands in UPPERCASE IF YOU WANT!
Any SQL command can be executed using the execute() method of the Cursor object. You just need to write your query inside quotes and you may choose to include a ; which is a requirement in some databases but not in SQLite. But it is always good practice so I will include it with my commands.
So, using the execute() method, we can create our table as shown here:
Perfect! Now that we have our table, let’s add some values to it.
A database table is of no use without values. So, we can use the INSERT INTO SQL command to add values to the table. The syntax for the command is as follows:
INSERT INTO table_name (column1, column2, column3, …)
VALUES (value1, value2, value3, …);
But if we are adding values for all the columns in the table, we can just simplify things and get rid of the column names in the SQL statement:
INSERT INTO table_name
VALUES (value1, value2, value3, …);
Like I said before, we can execute SQL statements using the execute() method. So let’s do that!
What if we want to write multiple Insert commands in a single go? We could use the executescript() method instead:
Or just simply use the executemany() method without having to repeatedly write the Insert Into command every time! executemany() actually executes an SQL command using an iterator to yield the values:
These methods are not limited to the Insert Into command and can be used to execute any SQL statement.
Now that we have a few values in our table, let’s try to fetch those values from the database.
For fetching values from the database, we use the SELECT command and the attribute values we want to retrieve:
SELECT column1, column2, … FROM table_name;
If you instead wanted to fetch values for all the attributes in the table, use the * character instead of the column names:
SELECT * FROM table_name;
To fetch only a single record from the database, we can use the fetchone() method:
To fetch multiple rows, you can execute a SELECT statement and iterate over it directly using only a single call on the Cursor object:
But a better way of retrieving multiple records would be to use the fetchall() method which returns all the records in a list format:
Awesome! We now know how to insert values into a table and fetch those values. But since data scientists love working with Pandas dataframes, wouldn’t it be great to somehow load the values from the database directly into a dataframe?
Yes, there is and I am going to show you how to do that. But first, I am going to show you how to store your Pandas dataframe into a database, which is obviously a better way to store your data!
Pandas let us quickly write our data from a dataframe into a database using the to_sql() method. The method takes the table name and Connection object as its arguments.
I will use the dataframes from the Food Demand Forecasting hackathon on the DataHack platform which has three dataframes: order information, meal information, and center fulfillment information.
We now have three tables in the database for each dataframe. It is easy to check them using the read_sql_query() method which we will explore in the next section where we will see how to load a database into a Pandas dataframe.
The read_sql_query() method of the Pandas library returns a DataFrame corresponding to the result of an SQL query. It takes as an argument the query and the Connection object to the database.
We can check the values in the tables using the real_sql_query() method:
Perfect! Now let’s try to run some queries on these tables and understand a few important SQL commands that will come in handy when we try to analyze data from the database.
The first important clause is the WHERE clause. It is used to filter the records based on a condition:
SELECT column1, column2, … FROM table_name
WHERE condition;
We can always use the * character if we want to retrieve values for all the columns in the table.
We can use it to query and retrieve only the Indian cuisine meals from the meal table:
Here, we have retrieved all the 12 records that matched our given condition. But what if we only wanted to retrieve the top 5 records that satisfy our condition? Well, we could use the LIMIT clause in that case.
LIMIT clause returns only the specified number of records and is useful when there are a large number of records in the table.
Here, we returned only the top 5 records from those that matched our given condition.
GROUP BY statement separates rows into different groups based on an attribute and can be used to apply an aggregate function (COUNT, MIN, MAX, SUM) on the resultant groups:
SELECT column1, column2, … FROM table_name
GROUP BY column_name;
We can use the GROUP BY statement to compare the number of orders for meals that received email promotions to those that did not. We will group the records on the emailer_for_promotion column and apply the COUNT aggregate function on the id column since it contains unique values. This will return the total number of rows belonging to each group:
Here we can see that there were more orders for meals that did not have an email promotion. But if we want to order our result, we can use the ORDER BY statement.
ORDER BY clause is used to sort the result into ascending or descending order using the keywords ASC or DESC respectively. By default, it sorts the records in ascending order:
SELECT column1, column2, … FROM table_name
ORDER BY column1, column2, … ASC|DESC;
Here I have combined two clauses: Group By and Order By. The Group By clause groups the values based on the email_for_promotion attribute and the Order By attribute orders output based on the count of the rows in each group. We can combine a bunch of clauses to extract more precise information from the database.
To sort the result in descending order, just type in the keyword DESC:
The HAVING clause is used to query on the results of another query run on the database. It applies a filter on the groups returned by a previous query. It should not be confused with the WHERE clause that applies the filter condition before grouping.
HAVING is used to filter records after grouping. Hence, the HAVING clause is always used after the GROUP BY statement:
SELECT column1, column2, … FROM table_name
GROUP BY column_name
HAVING condition;
Here, we returned only those groups that had a count of more than 15.
Join clause is a very interesting and important SQL clause. It retrieves and combines data from multiple tables on the same query based on a common attribute:
SELECT column1, column2, … FROM table1
INNER JOIN table2
ON table1.column_name= table2.column_name;
In our database, we can retrieve data from the centers and train tables since they share the common attribute center_id:
The INNER JOIN clause combines the two tables, train and centers, on the common attribute center_id specified by the statement train.center_id = centers.center_id. This means records having the same center_id in both the columns will concatenate horizontally.
This way we were able to retrieve the center_type, from centers, and the corresponding total number of orders from the train table. The . operator is very important here as it lets the database know which table the column belongs to.
If you want to know more about joins, I suggest going through this excellent article.
Now, let’s say there was a glitch in the system and the base price for all the orders was saved as 10 more than the actual amount. We want to make that update in the database as soon as we found the mistake.
In such a situation, we will use the UPDATE SQL command.
The UPDATE command is used to modify existing records in a table. However, always make you sure you provide which records need to be updated in the WHERE clause otherwise all the records will be updated!
UPDATE table_name
SET column1 = value1, column2 = value2, …
WHERE condition;
Let’s have a look at the table before the update:
Decrease all the base prices by 10 for orders containing meals that had an email promotion:
Finally, here’s a look at the updated table:
All the records have been correctly updated!
Now, suppose center number 11 no longer wants to continue to do business with the company. We have to delete their records from the database. We use the DELETE statement. However, make sure to use the WHERE clause otherwise all the records will be deleted from the table!
DELETE FROM table_name
WHERE condition;
Perfect! We no longer have any records corresponding to center 11.
Finally, if we had to drop an entire table from the database and not just its records, we use the DROP TABLE statement. But be extra careful before you run this command because all the records in the table along with the table structure will be lost after this!
Drop table table_name;
SQL is a super important language to learn as an analyst or a data science professional. And it’s not a difficult language to pick up, as we’ve seen in this article.
We saw how to create a table in SQL and how to add values to it. We covered some of the most basic and heavily used querying commands like where, having, group by, order by, and joins to retrieve records in the database. Finally, we covered some manipulation commands like update and delete to make changes to tables in the database.
This is by no means an exhaustive guide on SQL and I suggest going through the below great resources to build upon the knowledge that you gathered in this article:
Fantastic
Thanks!
Hi Anirud, Very nicely written tutorial. I had one query i.e after using the conn.close() command how do I open the DB connection? Is there any command for reopening the closed DB connection. Thanks Raji
Hi Raji Thanks, I am glad you liked it. To reconnect to the database, just use the sqlite3.connect() command which you used for creating the database.