Hadoop Distributed File System (HDFS) Architecture – A Guide to HDFS for Every Data Engineer

Aniruddha Bhandari Last Updated : 14 Dec, 2020

8 min read

Overview

Get familiar with Hadoop Distributed File System (HDFS)
Understand the Components of HDFS

Introduction

In contemporary times, it is commonplace to deal with massive amounts of data. From your next WhatsApp message to your next Tweet, you are creating data at every step when you interact with technology. Now multiply that by 4.5 billion people on the internet – the math is simply mind-boggling!

But ever wondered how to handle such data? Is it stored on a single machine? What if the machine fails? Will you lose your lovely 3 AM tweets *cough*?

The answer is No. I am pretty sure you are already thinking about Hadoop. Hadoop is an amazing framework. With Hadoop by your side, you can leverage the amazing powers of Hadoop Distributed File System (HDFS)-the storage component of Hadoop. It is probably the most important component of Hadoop and demands a detailed explanation.

So, in this article, we will learn what Hadoop Distributed File System (HDFS) really is and about its various components. Also, we will see what makes HDFS tick – that is what makes it so special. Let’s find out!

What is Hadoop Distributed File System (HDFS)?
What are the components of HDFS?
- Blocks in HDFS?
- Namenode in HDFS
- Datanodes in HDFS
- Secondary Node in HDFS
Replication Management
- Replication of Blocks
- What is a Rack in Hadoop?
- Rack Awareness

What is Hadoop Distributed File System(HDFS)?

It is difficult to maintain huge volumes of data in a single machine. Therefore, it becomes necessary to break down the data into smaller chunks and store it on multiple machines.

Filesystems that manage the storage across a network of machines are called distributed file systems.

Hadoop Distributed File System (HDFS) is the storage component of Hadoop. All data stored on Hadoop is stored in a distributed manner across a cluster of machines. But it has a few properties that define its existence.

Huge volumes – Being a distributed file system, it is highly capable of storing petabytes of data without any glitches.
Data access – It is based on the philosophy that “the most effective data processing pattern is write-once, the read-many-times pattern”.
Cost-effective – HDFS runs on a cluster of commodity hardware. These are inexpensive machines that can be bought from any vendor.

What are the components of the Hadoop Distributed File System(HDFS)?

HDFS has two main components, broadly speaking, – data blocks and nodes storing those data blocks. But there is more to it than meets the eye. So, let’s look at this one by one to get a better understanding.

HDFS Blocks

HDFS breaks down a file into smaller units. Each of these units is stored on different machines in the cluster. This, however, is transparent to the user working on HDFS. To them, it seems like storing all the data onto a single machine.

These smaller units are the blocks in HDFS. The size of each of these blocks is 128MB by default, you can easily change it according to requirement. So, if you had a file of size 512MB, it would be divided into 4 blocks storing 128MB each.

hadoop hdfs blocks

If, however, you had a file of size 524MB, then, it would be divided into 5 blocks. 4 of these would store 128MB each, amounting to 512MB. And the 5th would store the remaining 12MB. That’s right! This last block won’t take up the complete 128MB on the disk.

hadoop hdfs blocks split

But, you must be wondering, why such a huge amount in a single block? Why not multiple blocks of 10KB each? Well, the amount of data with which we generally deal with in Hadoop is usually in the order of petra bytes or higher.

Therefore, if we create blocks of small size, we would end up with a colossal number of blocks. This would mean we would have to deal with equally large metadata regarding the location of the blocks which would just create a lot of overhead. And we don’t really want that!

There are several perks to storing data in blocks rather than saving the complete file.

The file itself would be too large to store on any single disk alone. Therefore, it is prudent to spread it across different machines on the cluster.
It would also enable a proper spread of the workload and prevent the choke of a single machine by taking advantage of parallelism.

Now, you must be wondering, what about the machines in the cluster? How do they store the blocks and where is the metadata stored? Let’s find out.

Namenode in HDFS

HDFS operates in a master-worker architecture, this means that there are one master node and several worker nodes in the cluster. The master node is the Namenode.

Namenode is the master node that runs on a separate node in the cluster.

Manages the filesystem namespace which is the filesystem tree or hierarchy of the files and directories.
Stores information like owners of files, file permissions, etc for all the files.
It is also aware of the locations of all the blocks of a file and their size.

All this information is maintained persistently over the local disk in the form of two files: Fsimage and Edit Log.

Fsimage stores the information about the files and directories in the filesystem. For files, it stores the replication level, modification and access times, access permissions, blocks the file is made up of, and their sizes. For directories, it stores the modification time and permissions.
Edit log on the other hand keeps track of all the write operations that the client performs. This is regularly updated to the in-memory metadata to serve the read requests.

Whenever a client wants to write information to HDFS or read information from HDFS, it connects with the Namenode. The Namenode returns the location of the blocks to the client and the operation is carried out.

Yes, that’s right, the Namenode does not store the blocks. For that, we have separate nodes.

Datanodes in HDFS

Datanodes are the worker nodes. They are inexpensive commodity hardware that can be easily added to the cluster.

Datanodes are responsible for storing, retrieving, replicating, deletion, etc. of blocks when asked by the Namenode.

They periodically send heartbeats to the Namenode so that it is aware of their health. With that, a DataNode also sends a list of blocks that are stored on it so that the Namenode can maintain the mapping of blocks to Datanodes in its memory.

But in addition to these two types of nodes in the cluster, there is also another node called the Secondary Namenode. Let’s look at what that is.

Secondary Namenode in HDFS

Suppose we need to restart the Namenode, which can happen in case of a failure. This would mean that we have to copy the Fsimage from disk to memory. Also, we would also have to copy the latest copy of Edit Log to Fsimage to keep track of all the transactions. But if we restart the node after a long time, then the Edit log could have grown in size. This would mean that it would take a lot of time to apply the transactions from the Edit log. And during this time, the filesystem would be offline. Therefore, to solve this problem, we bring in the Secondary Namenode.

Secondary Namenode is another node present in the cluster whose main task is to regularly merge the Edit log with the Fsimage and produce check‐points of the primary’s in-memory file system metadata. This is also referred to as Checkpointing.

hadop hdfs checkpointing

But the checkpointing procedure is computationally very expensive and requires a lot of memory, which is why the Secondary namenode runs on a separate node on the cluster.

However, despite its name, the Secondary Namenode does not act as a Namenode. It is merely there for Checkpointing and keeping a copy of the latest Fsimage.

Replication Management in HDFS

Now, one of the best features of HDFS is the replication of blocks which makes it very reliable. But how does it replicate the blocks and where does it store them? Let’s answer those questions now.

Replication of blocks

HDFS is a reliable storage component of Hadoop. This is because every block stored in the filesystem is replicated on different Data Nodes in the cluster. This makes HDFS fault-tolerant.

The default replication factor in HDFS is 3. This means that every block will have two more copies of it, each stored on separate DataNodes in the cluster. However, this number is configurable.

hadoop hdfs replication

But you must be wondering doesn’t that mean that we are taking up too much storage. For instance, if we have 5 blocks of 128MB each, that amounts to 5*128*3 = 1920 MB. True. But then these nodes are commodity hardware. We can easily scale the cluster to add more of these machines. The cost of buying machines is much lower than the cost of losing the data!

Now, you must be wondering, how does Namenode decide which Datanode to store the replicas on? Well, before answering that question, we need to have a look at what is a Rack in Hadoop.

What is a Rack in Hadoop?

A Rack is a collection of machines (30-40 in Hadoop) that are stored in the same physical location. There are multiple racks in a Hadoop cluster, all connected through switches.

rack

Rack awareness

Replica storage is a tradeoff between reliability and read/write bandwidth. To increase reliability, we need to store block replicas on different racks and Datanodes to increase fault tolerance. While the write bandwidth is lowest when replicas are stored on the same node. Therefore, Hadoop has a default strategy to deal with this conundrum, also known as the Rack Awareness algorithm.

For example, if the replication factor for a block is 3, then the first replica is stored on the same Datanode on which the client writes. The second replica is stored on a different Datanode but on a different rack, chosen randomly. While the third replica is stored on the same rack as the second but on a different Datanode, again chosen randomly. If, however, the replication factor was higher, then the subsequent replicas would be stored on random Data Nodes in the cluster.

rack awareness

Endnotes

I hope by now you have got a solid understanding of what Hadoop Distributed File System(HDFS) is, what are its important components, and how it stores the data. There are however still a few more concepts that we need to cover with respect to Hadoop Distributed File System(HDFS), but that is a story for another article.

For now, I recommend you go through the following articles to get a better understanding of Hadoop and this Big Data world!

Last but not the least, I recommend reading Hadoop: The Definitive Guide by Tom White. This article was highly inspired by it.

Aniruddha Bhandari

I am on a journey to becoming a data scientist. I love to unravel trends in data, visualize it and predict the future with ML algorithms! But the most satisfying part of this journey is sharing my learnings, from the challenges that I face, with the community to make the world a better place!

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Hadoop Distributed File System (HDFS) Architecture – A Guide to HDFS for Every Data Engineer

Overview

Introduction

Table of Contents

What is Hadoop Distributed File System(HDFS)?

What are the components of the Hadoop Distributed File System(HDFS)?

HDFS Blocks

Namenode in HDFS

Datanodes in HDFS

Secondary Namenode in HDFS

Replication Management in HDFS

Replication of blocks

What is a Rack in Hadoop?

Rack awareness

Endnotes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg