Database Normalization: A Step-by-Step Guide with Examples

SATHISH Last Updated : 17 Jul, 2024

10 min read

Introduction

As an SQL Developer, you regularly work with enormous amounts of data stored in different tables that are present inside databases. It often becomes difficult to extract the information if the data is unorganized. We can solve this problem using Normalization by structuring the database in different forms or stages. This article will help you understand the concept of normalization in DBMS with step-by-step instructions and examples of tables.

We’ll discuss the functional dependencies that may exist in a table and anomalies that occur due to these functional dependencies. We will see the conversion of tables into normal forms to eliminate those anomalies. In this article you will get to understand about the normalization in database with example tables and get proper guide on database normalization with examples

Learning Objectives:

Understand the meaning of normalization and the need for it.
Learn about the various functional dependencies.
Familiarize yourself with the different stages of normalization.

This article was published as a part of the Data Science Blogathon.

What Is Normalization in DBMS?
Functional Dependency
What Is Normalization?
NF: First Normal Form
NF: Second Normal Form
NF: Third Normal Form
Boyce-Codd Normal Form
Guidelines for Using Normalization in DBMS
Advantages of Normalization
Frequently Asked Questions?

What Is Normalization in DBMS?

Normalization is a technique for organizing the data into multiple related tables to minimize Data Redundancy and Data Inconsistency. It aims to eliminate anomalies in data.

Why Do We Need Normalization?

Data inconsistency results from anything that affects data integrity. This can cause the data to be correct in one place and wrong elsewhere it is stored. This can lead to unreliable and meaningless information. It occurs between tables when similar data is stored in different formats in two different tables.

For example, consider the following tables:

LibraryVisitors (StudentID, Student_Name, Student_Address, InTime, OutTime);
Students (StudentID, Student_Name, Student_Address, Department, RollNo, CourseRegistered);

In the above tables, Student_Address is stored in both tables. For each student_id, the address must be the same in those two tables. Both these relations must be considered to retrieve or update the correct address. The issues mentioned arise due to poorly designed/structured databases.

We can eliminate data inconsistency in databases by using constraints on the relations.

Data Redundancy is the condition where the same data is stored at different locations leading to the wastage of storage space.

Examples:

Student Id	Student Name	Course ID	Course Name
111	John	C08	English
112	Alice	C08	English
111	John	C02	French

In the above table, we have stored student name John twice as he registered for two different courses and course name English twice as two students registered for it. This is called data redundancy. Data redundancy causes many problems in databases.

We can eliminate data redundancy in the databases by the normalization of relations.

Also Read: Different Types of Normalization Techniques

Functional Dependency

Before diving into normalization, we need to know clearly about functional dependencies.

An attribute is dependent on another attribute if another attribute uniquely identifies it.

It is denoted by A –> B, meaning A determines B, and B depends upon A.

Example: We can find the Student’s name using the Student_ID.

What Is an Anomaly?

An anomaly is an unexpected side effect of trying to insert, update, or delete a row. Essentially more data must be provided to accomplish an operation than expected.

Consider the following relation:

Retail_Outlet_ID	Outlet_Location	Item_Code	Description	Qty_Available	Retail_Unit_Price
R1001	King Street, Hyderabad, 540001	I1001	Britannia Marie Gold	25	1600
R1002	Rajaji Nagar, Bangalore, 600341	I1106	Cookies	58	1289
R1003	MVP Colony, Visakhapatnam, 500021	I1200	Best Rice	22	2000
R1001	King street, Hyderabad	I1309	Dal	20	1500

Types of Anomalies

Here are some of the most common anomalies that happen in database management.

Insertion anomalies

These occur when we cannot insert a new tuple into the table due to a lack of data.

What happens if we try to insert(add) the details of a new retail outlet with no items in its stock?

NULL values would be inserted into the item details columns, which is not preferable.

Deletion anomalies

They happen when the deletion of some data deletes the other required data also (Unintended data loss)

What happens if we try to delete the item of item code I1106?

The details of the retail outlet R1002 will also be deleted from the database.

Update anomalies:

These happen when an update of a single record requires an update in multiple records.

How many rows will be updated if the retail outlet location of R1002 is changed from King Street to Victoria Street?

2 Rows will be updated

Data redundancy

This happens when new items are supplied to a retail outlet.

What details do we need to insert?

Apart from all necessary details, retail_outlet_location will also be inserted, which is redundant.

We have seen insert, delete, update anomalies, and data redundancy in the above-given example. Functional dependencies may lead to anomalies. To minimize anomalies, there is a need to refine functional dependencies using normalization.

Also Read: A beginner’s Guide to Database: Part 1

What Is Normalization?

Database normalization is the process of organizing a relational database in accordance with a series of so-called normal forms in order to reduce data redundancy and improve data integrity. It was first proposed by Edgar F. Codd.

“Normal Forms” (NF) are the different stages of Normalization in DBMS:

1 NF (First Normal Form)
2 NF (Second Normal Form)
3 NF (Third Normal Form)
BCNF (Boyce -Codd Normal Form)
4 NF (Fourth Normal Form)
5 NF (Fifth Normal Form)
6 NF (Sixth Normal Form)

4NF to 6NF applies to multivalued dependencies and complex table scenarios. In this article, we discuss up to BCNF.

Different forms or stages of normalization in DBMS

Also Read: Understanding the need for DBMS

NF: First Normal Form

A relation R is said to be in 1 NF (First Normal) if and only if:

All the attributes of R are atomic.
It does not contain any multi-valued attributes.

In the above-taken example of the Retail_Outlets table, we have stored multiple values in an address field, such as street name, city name, and pin code.

What if we want to know about all retail outlets in a given city? We may need to perform some string operations on the address field, which is not preferable. So we need to store all these atomic values in separate fields.

A multi-valued attribute is an attribute that can have multiple values like Contact numbers. They should also be separated like ContactNo1, ContanctNo2,.. to achieve 1st Normal form.

1 NF | First normal form — 1st Normal Form

Advantage: 1 NF allows users to use the database queries effectively as it removes ambiguity by removing the non-atomic and multi-valued attributes, which creates major issues in the future while updating and extracting the data from the database.

Limitation: Data redundancy still exists even after 1st Normal form, so we need further normalization in DBMS.

NF: Second Normal Form

A relation R is said to be in 2 NF (Second Normal) form if and only if:

R is already in 1 NF
There is no partial dependency in R between non-key attributes and key attributes.

Suppose we have a composite primary or candidate key in our table. Partial dependency occurs when a part of the primary key (Key attribute) determines the non-key attribute.

In the Retail Outlets table, the Item_Code and Retail_Outlet_ID are key attributes. The item description is partially dependent on Item_Code only. Outlet_Location depends on Retail_Outlet_ID. These are partial dependencies.

To achieve normalization in DBMS, we need to eliminate these dependencies by decomposing the relations.

2 NF | Second normal form — 2nd Normal Form

From the above decomposition, we eliminated the partial dependency.

Advantage: 2 NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in a new table(s), and creating relationships between those tables.

Limitation: There are still some anomalies, as there might be some indirect dependencies between Non-Key attributes, leading to redundant data.

NF: Third Normal Form

A relation R is said to be in 3 NF (Third Normal Form) if and only if:

R is already in 2 NF
There is no transitive dependency that exists between key attributes and non-key attributes through other non-key attributes.

A transitive dependency exists when another non-key attribute determines a non-key attribute. In other words, If A determines B and B determines C, then automatically, A determines C.

Some other examples:

The Year of birth determines the Age of the person
The price of an Item determines the class of the Item
The ZIP code of a city determines the City’s Name

3 NF | Third normal form — 3rd Normal Form

Advantage: 3 NF ensures data integrity. It also reduces the amount of data duplication.

Boyce-Codd Normal Form

It is an upgraded version of the 3rd Normal form. It is also called as 3.5 Normal Form.

A relation R is said to be in 3 NF (Third Normal Form) if and only if:

R is already in 3 NF
For any dependency A –> B, then A should be the Super key.

In simple words, if A –> B, then A cannot be a non-prime Attribute if B is a prime attribute which means that A non-prime attribute cannot determine a prime attribute.

You must be wondering how’s this possible. but Yes, there can be some cases in which the Non-Prime attribute will determine the prime attributes even if the relationship was in the 3rd Normal form. BCNF does not allow this kind of dependency.

Sample Table

Let us understand this better with an example. Look at the below Relation of Student Enrollments table.

Student_ID	Course_Name	Professor
101	JAVA	Prof. Java
102	C++	Prof. CPP
101	Python	Prof. Python
103	JAVA	Prof. Java_2
104	Python	Prof. Python_2

In the above relation:

One student can enroll in multiple courses.
Multiple professors can teach one course.
One professor can be assigned only one course.

So the (Student_ID & Course_Name) will form the primary key. These 2 will compositely determine all other attributes in the relation. In our case, it is only the professor.

The Relation is clearly in 1st Normal Form as there are No Multivalued attributes, and all attributes have atomic values.
The Relation is in 2nd Normal Form as there are No Partial dependencies.
Student_Id cannot determine Course_Name as one student can enroll in multiple courses.
Course_Name cannot determine the professor, as multiple professors may teach the same course.
The relation is in 3rd normal form as there are no transitive dependencies.

If we observe here, the “Professor” attribute, a non-prime attribute, can determine the Course_Name as each professor teaches only one course. But Course_Name is a prime attribute, and Professor is not a Super Key. That means a non-prime attribute determines the prime attribute.

This is not allowed in BCNF. So, how do we decompose this relation?

BCNF | Boyce-Codd Normal Form | DBMS normalization — Boyce-Codd Normal Form

Until here, we have seen normal forms up to BCNF. Here are some guidelines to follow while normalizing the database.

Guidelines for Using Normalization in DBMS

Depending on the business requirements, we can normalize the tables up to the 2nd normal form or the 3rd normal form.
Prefer tables in 3 NF in applications with extensive data modifications.
Prefer tables in 2 NF in applications with extensive data retrieval.
Reason: retrieving data from multiple tables is a costly operation.
Converting the tables from higher normal form to lower normal form is called “Denormalization”.

The below picture summarizes how to reach the third normal form from an unnormalized form:

unnormalized to the third stage of normalization

Any relational database without normalization may lead to problems like large tables, difficulty maintaining the database as it involves searching many records, poor disk space utilization, and inconsistencies. If we fail to eliminate this kind of problem, it would lead to data integrity and redundancy problems. Normalization of a relational database helps to solve these problems. Normalization applies to a series of transformations in terms of normal forms. Any relation in a database must be normalized to get efficient access to the database. Each Normal form eliminates each type of dependency and improves the data integrity.

Advantages of Normalization

Normalization helps a lot with organizing data. Here are some of its advantages:

It reduces data redundancy: Normalisation assists in removing redundant data from tables, using less storage space, and increasing database effectiveness.
It improves data consistency: Normalisation guarantees that the data stays organized and consistent, lowering the possibility of data errors and inconsistencies.
It makes database design simple: Normalization offers rules for arranging tables and data linkages. This facilitates database design and maintenance.
It handles queries faster: Faster query performance is a result of normalized tables’ generally easier search and data retrieval capabilities.
It simplifies database maintenance: By dividing a database’s complexity into smaller, more manageable tables, normalization makes it simpler to add, change, and delete data.

Conclusion

This article was aimed at making you understand the normalization process and how to apply it when you design a database system. There is another multi-valued dependency that 4NF and 5NF can eliminate. Try to explore those also.

I hope this article helped you to understand the concept of normalization better and also we have talk about the normalization in database with examples. If you have any questions, please let me know in the comments. I wish you great learning ahead.

Key Takeaways:

Normalization is a technique for organizing the data into multiple related tables to minimize Data Redundancy and Data Inconsistency.
Insertion, deletion, update, and data redundancy are the various possible anomalies that may occur when building a database.
There are seven different stages of normalization known as Normal Forms.

Frequently Asked Questions?

Q1. What is normalization with an example?

A. Database normalization is the process of organizing data in a database efficiently. It involves reducing redundancy and dependency by dividing large tables into smaller tables and defining relationships between them.

Q2. What are the four 4 types of database normalization?

A. The four types of database normalization are:
a) First Normal Form (1NF)
b) Second Normal Form (2NF)
c) Third Normal Form (3NF)
d) Boyce-Codd Normal Form (BCNF)

Q3. What are the 5 rules of database normalization?

A. The 5 rules of database normalization, often referred to as normal forms, are:
a) First Normal Form (1NF): Each column in a table must contain atomic values, and there should be no repeating groups or arrays.
b) Second Normal Form (2NF): The table must be in 1NF, and all non-key attributes must be fully dependent on the primary key.
c) Third Normal Form (3NF): The table must be in 2NF, and there should be no transitive dependencies, meaning no non-key attribute should depend on another non-key attribute.
d) Boyce-Codd Normal Form (BCNF): A stronger version of 3NF where every determinant is a candidate key.

Q4. What is 1NF vs 2NF vs 3NF?

A. 1NF, 2NF, and 3NF refer to different stages of normalization:
a) 1NF (First Normal Form): Ensures that the data is organized without repeating groups and each attribute contains atomic values.
b) 2NF (Second Normal Form): Builds upon 1NF and ensures that non-key attributes are fully dependent on the entire primary key.
c) 3NF (Third Normal Form): Builds upon 2NF and eliminates transitive dependencies by ensuring that non-key attributes are not dependent on other non-key attributes.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

SATHISH

Hi there, my name is Sathish and I am currently pursuing my final year of B. Tech in the Department of IT from India. I am extremely passionate about the field of data science and machine learning and enjoy studying the latest advancements in deep learning technologies. In addition to my technical skills, I also have a solid foundation in software engineering. I am driven by a desire to learn and grow and am always eager to take on new challenges and expand my knowledge. I am excited about the opportunities that lie ahead in my career and am committed to achieving my goals through hard work and dedication.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Database Normalization: A Step-by-Step Guide with Examples

Introduction

Learning Objectives:

Table of Contents

What Is Normalization in DBMS?

Why Do We Need Normalization?

Functional Dependency

What Is an Anomaly?

Types of Anomalies

Insertion anomalies

Deletion anomalies

Update anomalies:

Data redundancy

What Is Normalization?

NF: First Normal Form

NF: Second Normal Form

NF: Third Normal Form

Boyce-Codd Normal Form

Sample Table

Guidelines for Using Normalization in DBMS

Advantages of Normalization

Conclusion

Key Takeaways:

Frequently Asked Questions?

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid