How to Normalize Relational Databases With SQL Code?

Sanjana Bhakat Last Updated : 31 Oct, 2023

9 min read

Introduction

Data is the new oil in this century. The database is the major element of a data science project. To generate actionable insights, the database must be centralized and organized efficiently. If a corrupted, unorganized, or redundant database is used, the results of the analysis may become inconsistent and highly misleading. So, we are going to learn how to normalize an RDBMS with the help of SQL programming. This article attempts to explain how to remove inconsistencies in relational databases with SQL using normalization and a dummy database, for example. Before delving deeper, one must know a few related terms.

Learning Objectives

Know why normalizing a database is important
Understand the normal forms used in the industry
Learn how to convert a given database into the desired Normal Form using SQL.

This article was published as a part of the Data Science Blogathon.

What is a Relational Database?

Imagine a retail store wanting to store and manipulate data on their sales. Ideally, this data will contain the Item Code, Item Name, Price of each unit, Number of units sold, Category, and Date of Purchase. So, all these attributes are the names of the columns, and a particular row contains data on the purchase of a particular product on that day. Hence, here interrelated data has been collected and stored in rows and columns. This is what we call a “Relational Database.” An RDBMS (Relational Database Management System) is used to store this data.

If we go for a formal definition of a database, it can be said that it stores interrelated data from which users can efficiently retrieve information. Relational databases are structured, self-describing, and storing data according to predefined relationships.

But real-time data stored in this model can be messy most of the time, making it inconsistent and hard to work with. So, to easily manipulate and work with these data tables, we normalize the database.

Terms that are Commonly used in an RDBMS

Now, let’s delve deeper into the practical aspects of normalizing a database. But first, we must know a few terms to help us on the journey.

Data Dictionary: Contains all information about the objects in a database.
Metadata: Contains information about what the data stores, sources of data, etc.
Schema: A Schema is the internal design of the database, i.e., how the data looks at different levels to users.
Primary Key: Primary key is that attribute used to identify a tuple or row in a database uniquely. If it contains more than one attribute, it is called a composite key. It cannot contain redundant or null values.
Candidate Key: The candidate key is the attribute or combination of attributes that can uniquely identify a tuple.
Super key: A Super key is a set of attributes used to identify a tuple in a related database uniquely. This being said, each attribute in the Super Key may contain repeated or redundant values (which is not in the case of primary).
Foreign Key: A foreign key is a primary key of another table that becomes an attribute or field in a different table.
Functional Dependency: A functional dependency is a relation or a specific constraint between two attributes or two sets of attributes in a database. A —>B means that B depends on A or a functional dependency exists between B and A.

For more information on terms related to the type of database used in this article, click here.

This article mainly focuses on how to normalize a database. But first,

Why and When Should we Normalize a Relational Database?

We normalize a database because:

It makes it easier to organize the database,
It sequentially decomposes tables and hence, reduces data duplication,
It reduces the storage space taken by the database,
It removes specific anomalies and makes updating, deleting, and inserting data easy.

We normalize the data if:

The database is messy and takes up much storage space.
Updating or deleting a record from the database leads to inconsistency
The database has multiple values in a single cell.
No such attribute or column is present in the database that can help identify every tuple uniquely.
Duplicate or repetitive values are present in the database.

How to Convert a Relational Database to Different Normal Forms?

So, let’s use a practical example to understand how normalization works. Consider a hospital that maintains its database in the form:

Step 1: First, you need to identify the relationships that exist between each attribute of the database. Here the relationships are:

One-to-one:

Patient_Id and Patient_Name
Disease_Id and Disease
Test_Id and Test

One-to-many:

Category and Disease_Id/Disease
Category and Doctor_Name

Many-to-one:

Doctor_Name and Category

Many-to-many:

Doctor_Name and Patient_Id/ Patient_Name
Test and Patient_Name/ Patient_Id

Step 2: Once you know the relationships between the attributes, you need to normalize your data. To do this, you must convert your data to the best Normal Form (NF) that suits your business purpose. This article will look at normalizing a database using SQL (Structured Query Language). Click here to learn the basics and more about the SQL syntaxes used.

We can create the table in PostgreSQL using the following SQL code:

create table Patient_Information(
Patient_id varchar(10) NOT NULL,
P_Name varchar(100),
Age integer,
Marital_Status varchar(10),
Gender varchar(10),
Disease_Id varchar(10),
Disease varchar(100),
Category varchar(100),
Doctor_Name varchar(20),
Test_Id varchar(10),
Test varchar(40),
Price float,
PRIMARY KEY(Patient_Id)
);

Note: All SQL codes mentioned further in the blog are extended from those mentioned earlier.

First Normal Form or 1NF

To convert the first table to 1NF, we must ensure that every column consists only of a single data, and the attribute domain must not change. So, the relationship will become:

Patient_Information:

1200

If the database is not in 1NF, then the design is considered a bad database design. So, one must certainly apply
transformations to bring a database to 1NF before applying further normalization techniques.

Now we insert data into the table using the code:

INSERT INTO Patient_Information
VALUES('P001', 'Anna Smith', 18, 'F', 'HYTSH',
'Hypothyroidism', 'Dr Rohit Shaw', 'T4', 'Thyroxine', 400);

Second Normal Form or 2NF

To convert into 2NF, the relation must be in 1NF, and no partial dependency should be present.So, let’s consider the table as follows:

Patient_Id	Disease_Id	Test_Id	Test	Doctor_Name
P001	HYTSH	T4	Thyroxine	Dr. Rohit Shaw
P001	HYTSH	TSH	Thyroid Stimulating Hormone	Dr. Rohit Shaw
P002	HRTBLCK	ECG	Echocardiography	Dr. Rima Dhara
P002	HYPTSN	BP	Blood Pressure	Dr. Rima Dhara
P003	PCOD	USG	Ultrasonography	Dr. Shalini Thakur

So, this part of the table stores the disease name with the disease’s id and the doctor’s id who specializes in treating that specific disease.

The first advantage we get here is that we need not store long names of diseases in the patient’s information table.
Secondly, by knowing the disease’s id, we can easily find which doctor is or can be allocated to the patient.
Primary key: Patient_id+Disease_id

But we see that the doctor’s name only depends on the disease’s id, not the patient’s. Hence a partial dependency will exist. To remove this partial dependency, we divide the table into two:Patient_Info:

Patient_Id	Disease_Id	Test_Id	Test
P001	HYTSH	T4	Thyroxine
P001	HYTSH	TSH	Thyroid Stimulating Hormone
P002	HRTBLCK	ECG	Echocardiography
P002	HYPTSN	BP	Blood Pressure
P003	PCOD	USG	Ultrasonography

Disease_Info:

Disease_Id	Disease	Doctor_Name
HYTSH	Hypothyroidism	Dr. Rohit Shaw
HRTBLCK	Heart Blockage	Dr. Rima Dhara
HYPTSN	Hypertension	Dr. Rima Dhara
PCOD	Poly Cystic Ovarian Syndrome	Dr. Shalini Thakur

We can do this in SQL using the code:

CREATE TABLE Patient_Info ASSELECT Patient_Id, Disease_Id,Test_Id, TestFROM Patient_Information;
CREATE TABLE Disease_Info ASSELECT Disease_Id, Disease, Doctor_NameFROM Patient_Information;ALTER TABLE Patient_InfoADD PRIMARY KEY (Patient_Id, Disease_Id, Test_Id);ALTER TABLE Disease_InfoADD PRIMARY KEY (Disease_Id);

Third Normal Form or 3NF

The database or relation is in third normal form if it is in 2NF and no transitive dependency is present.

Consider this part of the main table:

Patient_Id	Disease_Id	Disease	Test_Id	Price
P001	HYTSH	Hypothyroidism	T4	400
P001	HYTSH	Hypothyroidism	TSH	350
P002	HRTBLCK	Heart Blockage	ECG	1100
P002	HYPTSN	Hypertension	BP	50
P003	PCOD	Poly Cystic Ovarian Syndrome	USG	1200

Now, the table stores more information:

Primary key: Patient_id+Disease_id.
But the attribute Price depends on the attribute Test_Id. Both of them are non-prime attributes. Hence, a transitive dependency exists.

To remove it and convert the table to 3NF, we follow the approach:Patient_Info:

Patient_Id	Disease_Id	Disease
P001	HYTSH	Hypothyroidism
P001	HYTSH	Hypothyroidism
P002	HRTBLCK	Heart Blockage
P002	HYPTSN	Hypertension
P003	PCOD	Poly Cystic Ovarian Syndrome

Test_Price:

Test_Id	Test	Price
T4	Thyroxine	400
TSH	Thyroid Stimulating Hormone	350
ECG	Echocardiography	1100
BP	Blood Pressure	50
USG	Ultrasonography	1200

We can achieve the following using the SQL code:

ALTER TABLE Patient_InfoDROP COLUMN Test_Id, Test; ALTER TABLE Patient_InfoADD COLUMN Disease; INSERT INTO Patient_Info(Disease)SELECT Disease,FROM Disease_Info; CREATE TABLE Test_Price ASSELECT Test_Id, Test, PriceFROM Patient_Information;ALTER TABLE Test_PriceADD PRIMARY KEY(Test_Id);

Boyce-Codd Normal Form or 3.5NF or BCNF

For the database to be in Boyce-Codd Normal Form, it must be in 3NF. Also, in BCNF, A must be a super key for a functional dependency from A to B. Consider the part from our main table:

Patient_Id	Category	Doctor_Name
P001	Endocrinology	Dr. Rohit Shaw
P001	Endocrinology	Dr. Rohit Shaw
P002	Cardiology	Dr. Rima Dhara
P002	Cardiology	Dr. Rima Dhara
P003	Gynecology	Dr. Shalini Thakur

Here, we see that:

Primary Key: Patient_Id+Category (since it can uniquely identify all the tuples in the table). That means the prime attributes here are: Patient_Id and Category.
But a dependency Doctor_Name —->Category exists, meaning a prime attribute depends on a non-prime attribute. This type of dependency is not allowed in BCNF.

So, to convert the relation to BCNF, we divide the table into two:Patient_Doctor:

Patient_Id	Doctor_Id
P001	D110
P002	D200
P003	D201

Doctor_Info:

Doctor_Id	Doctor_Name	Category
D110	Dr. Rohit Shaw	Endocrinology
D200	Dr. Rima Dhara	Cardiology
D201	Dr. Shalini Thakur	Gynecology

SQL Code:

ALTER TABLE Patient_Information
ADD COLUMN Doctor_Id DEFAULT Dr;
UPDATE Patient_Information
SET Doctor_Id='D110',
WHERE Doctor_Name='Dr. Rohit Shaw';
UPDATE Patient_Information
SET Doctor_Id='D200',
WHERE Doctor_Name='Dr. Rima Dhara';
UPDATE Patient_Information
SET Doctor_Id='D201',
WHERE Doctor_Name='Dr.Shalini Thakur';
CREATE TABLE Patient_Doctor AS
SELECT DISTINCT Patient_Id, Doctor_Id,
FROM Patient_Information;
ALTER TABLE Patient_Doctor
ADD PRIMARY KEY(Patient_Id, Doctor_Id);
CREATE TABLE Doctor_Info AS
SELECT DISTINCT Doctor_Id,Doctor_Name,Category,
FROM Patient_Information;
ALTER TABLE Doctor_Info
ADD PRIMARY KEY(Doctor_Id);

Fourth Normal Form or 4NF

For a database to be in 4NF, it must be in BCNF, and no multivalued dependency should be present.

Consider another part of the main table:

Patient_Id	Marital_Status	Disease_Id
P001	U	HYTSH
P001	U	HYTSH
P002	M	HRTBLCK
P002	M	HYPTSN
P003	U	PCOD

Here, there is no relation between Marital_Status and Disease_Id.Both are independent of each other. Hence, a multivalued dependency exists. So, we divide the table as follows:Patient_MaritalInfo:

Patient_Id	Marital_Status
P001	U
P001	U
P002	M
P002	M
P003	U

Patient_Disease:

Patient_Id	Disease_Id
P001	HYTSH
P001	HYTSH
P002	HRTBLCK
P002	HYPTSN
P003	PCOD

SQL Code:

CREATE TABLE Patient_MaritalInfo AS SELECT DISTINCT Patient_Id, Marital_Status, FROM Patient_Information;ALTER TABLE Patient_MaritalInfoADD PRIMARY KEY (Patient_Id),CREATE TABLE Patient_Disease AS SELECT Patient_Id, Disease_Id, FROM Patient_Information;ALTER TABLE Patient_DiseaseADD PRIMARY KEY(Patient_Id,Disease_Id);

Benefits we get from Normalizing this Data

The benefits we get from normalizing the dummy database are:

We now have tables containing patient attributes, doctor’s attributes, disease attributes, and test attributes separately.
Also, tables show the relationship between those attributes’ tables using foreign keys.
Hence, our database is more organized, and the relationships between each table are well-defined. This makes the data more efficient for analyzing, updating, and retrieving. For example: If the hospital has to update the age of a particular patient, then that record needs to be updated only in the table containing the patient’s attributes. Once that is done, this record will automatically be updated for all the related tables since they are related or modeled using foreign keys (these constraints can be declared in SQL).

Hence, now we can see how a messy database can be reduced to these normal forms. Normalization helps reduce space and makes updating, deleting, and working with the data easier. Well, there are other normal forms viz:

5NF: Here, the relation must be in 4NF, and no join dependency should be present.
6NF: Here, the relation must be in 5NF, and all join dependencies must be trivial i.e., the relation becomes irreducible.

But most databases can be organized with forms up to 4NF. Based on the business purpose, it is up to you to decide which normal form suits you best.

Conclusion

That’s it! You are all done with organizing your database. Now your database is efficient enough for further analysis. You can further transfer this database to a DBMS or from the SQL server; you can interact with your data using queries to meet the business requirements.

Here are some key takeaways from the blog:

Updation anomaly, insertion anomaly, deletion anomaly, and data redundancy make a database inefficient for analysis and querying. Hence, we use normalization to organize related data efficiently and eliminate anomalies.
Five major types of normal forms are commonly used in the industry: 1NF, 2NF, 3NF, BCNF, and 4NF. Normal forms like 5NF and 6NF are not commonly used in the industry. The normal forms must be used depending on the context.
Normalization helps organize interrelated data tables by decomposing them into sub-components or small tables based on constraints. This eliminates certain dependencies.
One of the best ways to achieve a normalized database is by using SQL (Structured Query Language). In this blog, you can see how one can normalize and decompose the dummy database using queries in SQL.

I hope this blog helps in your learning journey. Meanwhile, you can check more articles related to databases here.

Have a great learning ahead!

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sanjana Bhakat

I am a data-science aspirant with a passion for writing and sharing knowledge with the community. I am currently pursuing BSc(H) in Data Science from Techno India University, Kolkata. Some of the topics I am enthusiastic about include: Mathematics, Statistics, Databases, Data Engineering, Machine learning and Data Analytics. I look forward to opportunities where I can learn and enhance my skills.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

How to Normalize Relational Databases With SQL Code?

Introduction

What is a Relational Database?

Terms that are Commonly used in an RDBMS

Why and When Should we Normalize a Relational Database?

How to Convert a Relational Database to Different Normal Forms?

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS