Data is the new oil in this century. The database is the major element of a data science project. To generate actionable insights, the database must be centralized and organized efficiently. If a corrupted, unorganized, or redundant database is used, the results of the analysis may become inconsistent and highly misleading. So, we are going to learn how to normalize an RDBMS with the help of SQL programming. This article attempts to explain how to remove inconsistencies in relational databases with SQL using normalization and a dummy database, for example. Before delving deeper, one must know a few related terms.
Learning Objectives
This article was published as a part of the Data Science Blogathon.
Imagine a retail store wanting to store and manipulate data on their sales. Ideally, this data will contain the Item Code, Item Name, Price of each unit, Number of units sold, Category, and Date of Purchase. So, all these attributes are the names of the columns, and a particular row contains data on the purchase of a particular product on that day. Hence, here interrelated data has been collected and stored in rows and columns. This is what we call a “Relational Database.” An RDBMS (Relational Database Management System) is used to store this data.
If we go for a formal definition of a database, it can be said that it stores interrelated data from which users can efficiently retrieve information. Relational databases are structured, self-describing, and storing data according to predefined relationships.
But real-time data stored in this model can be messy most of the time, making it inconsistent and hard to work with. So, to easily manipulate and work with these data tables, we normalize the database.
Now, let’s delve deeper into the practical aspects of normalizing a database. But first, we must know a few terms to help us on the journey.
For more information on terms related to the type of database used in this article, click here.
This article mainly focuses on how to normalize a database. But first,
We normalize a database because:
We normalize the data if:
So, let’s use a practical example to understand how normalization works. Consider a hospital that maintains its database in the form:
|
One-to-one:
One-to-many:
Many-to-one:
Many-to-many:
Step 2: Once you know the relationships between the attributes, you need to normalize your data. To do this, you must convert your data to the best Normal Form (NF) that suits your business purpose. This article will look at normalizing a database using SQL (Structured Query Language). Click here to learn the basics and more about the SQL syntaxes used.
We can create the table in PostgreSQL using the following SQL code:
create table Patient_Information(
Patient_id varchar(10) NOT NULL,
P_Name varchar(100),
Age integer,
Marital_Status varchar(10),
Gender varchar(10),
Disease_Id varchar(10),
Disease varchar(100),
Category varchar(100),
Doctor_Name varchar(20),
Test_Id varchar(10),
Test varchar(40),
Price float,
PRIMARY KEY(Patient_Id)
);
Note: All SQL codes mentioned further in the blog are extended from those mentioned earlier.
To convert the first table to 1NF, we must ensure that every column consists only of a single data, and the attribute domain must not change. So, the relationship will become:
Patient_Information:
|
To convert into 2NF, the relation must be in 1NF, and no partial dependency should be present.So, let’s consider the table as follows:
Patient_Id | Disease_Id | Test_Id | Test | Doctor_Name |
P001 | HYTSH | T4 | Thyroxine | Dr. Rohit Shaw |
P001 | HYTSH | TSH | Thyroid Stimulating Hormone | Dr. Rohit Shaw |
P002 | HRTBLCK | ECG | Echocardiography | Dr. Rima Dhara |
P002 | HYPTSN | BP | Blood Pressure | Dr. Rima Dhara |
P003 | PCOD | USG | Ultrasonography | Dr. Shalini Thakur |
So, this part of the table stores the disease name with the disease’s id and the doctor’s id who specializes in treating that specific disease.
But we see that the doctor’s name only depends on the disease’s id, not the patient’s. Hence a partial dependency will exist. To remove this partial dependency, we divide the table into two:Patient_Info:
Patient_Id | Disease_Id | Test_Id | Test |
P001 | HYTSH | T4 | Thyroxine |
P001 | HYTSH | TSH | Thyroid Stimulating Hormone |
P002 | HRTBLCK | ECG | Echocardiography |
P002 | HYPTSN | BP | Blood Pressure |
P003 | PCOD | USG | Ultrasonography |
Disease_Info:
Disease_Id | Disease | Doctor_Name |
HYTSH | Hypothyroidism | Dr. Rohit Shaw |
HRTBLCK | Heart Blockage | Dr. Rima Dhara |
HYPTSN | Hypertension | Dr. Rima Dhara |
PCOD | Poly Cystic Ovarian Syndrome | Dr. Shalini Thakur |
We can do this in SQL using the code:
CREATE TABLE Patient_Info AS
SELECT Patient_Id,
Disease_Id,Test_Id,
TestFROM Patient_Information;
CREATE TABLE Disease_Info AS
SELECT Disease_Id, Disease,
Doctor_Name
FROM Patient_Information;
ALTER TABLE Patient_Info
ADD PRIMARY KEY (Patient_Id, Disease_Id, Test_Id);
ALTER TABLE Disease_Info
ADD PRIMARY KEY (Disease_Id);
The database or relation is in third normal form if it is in 2NF and no transitive dependency is present.
Consider this part of the main table:
Patient_Id | Disease_Id | Disease | Test_Id | Price |
P001 | HYTSH | Hypothyroidism | T4 | 400 |
P001 | HYTSH | Hypothyroidism | TSH | 350 |
P002 | HRTBLCK | Heart Blockage | ECG | 1100 |
P002 | HYPTSN | Hypertension | BP | 50 |
P003 | PCOD | Poly Cystic Ovarian Syndrome | USG | 1200 |
Now, the table stores more information:
To remove it and convert the table to 3NF, we follow the approach:Patient_Info:
Patient_Id | Disease_Id | Disease |
P001 | HYTSH | Hypothyroidism |
P001 | HYTSH | Hypothyroidism |
P002 | HRTBLCK | Heart Blockage |
P002 | HYPTSN | Hypertension |
P003 | PCOD | Poly Cystic Ovarian Syndrome |
Test_Price:
Test_Id | Test | Price |
T4 | Thyroxine | 400 |
TSH | Thyroid Stimulating Hormone | 350 |
ECG | Echocardiography | 1100 |
BP | Blood Pressure | 50 |
USG | Ultrasonography | 1200 |
We can achieve the following using the SQL code:
ALTER TABLE Patient_Info
DROP COLUMN Test_Id,
Test;ALTER TABLE Patient_Info
ADD COLUMN Disease;
INSERT INTO Patient_Info(Disease)
SELECT Disease,
FROM Disease_Info;CREATE TABLE Test_Price AS
SELECT Test_Id,
Test,
PriceFROM Patient_Information;
ALTER TABLE Test_Price
ADD PRIMARY KEY(Test_Id);
For the database to be in Boyce-Codd Normal Form, it must be in 3NF. Also, in BCNF, A must be a super key for a functional dependency from A to B. Consider the part from our main table:
Patient_Id | Category | Doctor_Name |
P001 | Endocrinology | Dr. Rohit Shaw |
P001 | Endocrinology | Dr. Rohit Shaw |
P002 | Cardiology | Dr. Rima Dhara |
P002 | Cardiology | Dr. Rima Dhara |
P003 | Gynecology | Dr. Shalini Thakur |
Here, we see that:
So, to convert the relation to BCNF, we divide the table into two:Patient_Doctor:
Patient_Id | Doctor_Id |
P001 | D110 |
P002 | D200 |
P003 | D201 |
Doctor_Info:
Doctor_Id | Doctor_Name | Category |
D110 | Dr. Rohit Shaw | Endocrinology |
D200 | Dr. Rima Dhara | Cardiology |
D201 | Dr. Shalini Thakur | Gynecology |
SQL Code:
ALTER TABLE Patient_Information
ADD COLUMN Doctor_Id DEFAULT Dr;
UPDATE Patient_Information
SET Doctor_Id='D110',
WHERE Doctor_Name='Dr. Rohit Shaw';
UPDATE Patient_Information
SET Doctor_Id='D200',
WHERE Doctor_Name='Dr. Rima Dhara';
UPDATE Patient_Information
SET Doctor_Id='D201',
WHERE Doctor_Name='Dr.Shalini Thakur';
CREATE TABLE Patient_Doctor AS
SELECT DISTINCT Patient_Id, Doctor_Id,
FROM Patient_Information;
ALTER TABLE Patient_Doctor
ADD PRIMARY KEY(Patient_Id, Doctor_Id);
CREATE TABLE Doctor_Info AS
SELECT DISTINCT Doctor_Id,Doctor_Name,Category,
FROM Patient_Information;
ALTER TABLE Doctor_Info
ADD PRIMARY KEY(Doctor_Id);
For a database to be in 4NF, it must be in BCNF, and no multivalued dependency should be present.
Consider another part of the main table:
|
Here, there is no relation between Marital_Status and Disease_Id.Both are independent of each other. Hence, a multivalued dependency exists. So, we divide the table as follows:Patient_MaritalInfo:
|
Patient_Disease:
|
SQL Code:
CREATE TABLE Patient_MaritalInfo AS
SELECT DISTINCT Patient_Id, Marital_Status,
FROM Patient_Information;
ALTER TABLE Patient_MaritalInfo
ADD PRIMARY KEY (Patient_Id),
CREATE TABLE Patient_Disease AS
SELECT Patient_Id, Disease_Id,
FROM Patient_Information;
ALTER TABLE Patient_Disease
ADD PRIMARY KEY(Patient_Id,Disease_Id);
Benefits we get from Normalizing this Data
The benefits we get from normalizing the dummy database are:
Hence, now we can see how a messy database can be reduced to these normal forms. Normalization helps reduce space and makes updating, deleting, and working with the data easier. Well, there are other normal forms viz:
But most databases can be organized with forms up to 4NF. Based on the business purpose, it is up to you to decide which normal form suits you best.
That’s it! You are all done with organizing your database. Now your database is efficient enough for further analysis. You can further transfer this database to a DBMS or from the SQL server; you can interact with your data using queries to meet the business requirements.
Here are some key takeaways from the blog:
I hope this blog helps in your learning journey. Meanwhile, you can check more articles related to databases here.
Have a great learning ahead!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.