This article was published as a part of the Data Science Blogathon.
In a large dataset, duplicated data leads to resource and disk space waste, and the chance of errors and inconsistencies increases. Therefore, we must decompose redundant data relationships into manageable and more well-structured ones. Through normalization, we can restructure a relational database to reduce redundancy among the data and improve the data integrity, eliminating unwanted characteristics like Insertions, Updates, and Deletions. In this article, we will explain the different types of normalization in detail and explore the advantages and disadvantages of using normalization.
A relation violates the first normal form if it contains composite or multi-valued attributes. Therefore, in order to remove the repetition of values, we convert it into the first normal form. In the first normal form, the attribute values are single-valued and there are no repeating groups. By using the primary key and foreign key, the first normal form creates a functional dependency between the two tables.
Example:
The Employee table is not in its first normal form because the column Phone No. contains multiple values.
To convert the Employee table into 1NF we decompose it as follows:
Example:
The non-prime attribute AGE in the provided table is reliant on TEACHER ID, a suitable subset of a candidate key. Because of this, it breaks the 2NF rule.
The provided table is divided into two tables to create the second normal form:
The relation must first be in the first and second normal form before it can be in the third normal form. The third normal form removes the transitive dependency of the non-primary key attributes on the primary key by moving these attributes along with the copy of the attribute they depend on to a new relation. Removing transitive dependencies of attributes reduces data duplication and improves the integrity of data. In the third normal form, most of the tables are free of anomalies with insertions, updates, and deletions.
Example:
Currently, the following table is not in the third normal form:
To decompose the given table into the third normal form we can do the following:
The Boyce-Codd Normal Form(BCNF) is the fourth form of normalization and is an updated version of the third normal form. The relation first must be in the third normal form to be in BCNF. As per BCNF, if Q is determined by P, then P should be a super key or candidate key for any functional dependency. When we use the third normal form, we can achieve lossless decomposition, but with BCNF, it is very difficult. BCNF is a more restrictive form of normalization, so there are no anomalous results in the database.
Example:
Think about a relation R that has attributes (student, subject, teacher). Since the teacher is not a candidate key, the subject “teacher” breaches the BCNF.
Divide R into R1 (X, Y) and R2 if X->Y violates the BCNF (R-Y). Consequently, R is split into R1 (Teacher, subject) and R2 (Relation) (student, Teacher).
The dependency preservation property is not always met by BCNF decomposition. If the dependency is not retained following BCNF decomposition, we must choose between staying in BCNF or going back to 3NF.
A relation in a DBMS is in the Fifth Normal Form if it is in the Fourth Normal Form but does not contain any join dependencies. The joining must be lossless. The Fifth Normal Form is satisfied when all the tables present in the DBMS are broken down into as many more tables as possible to reduce redundancy. The Fifth Normal Form is also known as the Project Join Normal Form.
Example:
Mr. Rick takes both C++ and Python classes for Lecture Hall 1, but he does not take Python for Semester 2. For this case, the fields must be combined to identify a valid data set. If we add a new lecture hall as Lecture Hall 3 but do not know what the subject will be, we leave Professor and Subject blank. As all three columns are primary keys we cannot leave the other two columns ( Professor, Subject) blank. As a result, we can decompose the above table into three relations to make 5NF.
In the sixth normal form, the relation variables are decomposed into irreducible components. For a temporal variable or other interval data, this is relatively unimportant for non-temporal relation variables. In many data warehouses, the benefits of the sixth normal form outweigh the downsides.
Example:
For the above, the possible join dependencies are –
(Student ID, First Name)
(Student ID, Last name)
(Student ID, Grade)
So we can decompose it into the sixth normal form as follows:
Normalization provides the following advantages:
The disadvantages of normalization are as follows:
In this article, we talked about how normalization helps eliminate anomalies, which can result in data duplication, affecting the integrity of the data and its performance. This article should leave you with the following points:
As a result of reading this article, hopefully, you now know what normal forms are and how useful they can be when organizing data in a database. Read more articles here!
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.