This article was published as a part of the Data Science Blogathon.
Modern applications and products deal with large amounts of data. The quantity of data being processed and utilised in modern times is enormous. So, the question arises? How to manage large files and data.
Data size soon outgrows a machine’s storage limit as data velocity increases. Data might be stored over a network of machines as a solution. Distributed file systems are the name for these types of filesystems. Data is stored over a network, introducing all of the network’s problems.
This is where Hadoop enters the picture. It has one of the most stable file systems available. HDFS (Hadoop Distributed File System) is a one-of-a-kind file system that stores enormously big files and allows for streaming data access on commodity hardware.
Image: techvidvan.com
The Hadoop Distributed File System is a fault-tolerant data storage file system that works on standard hardware. It was created to solve problems that conventional databases couldn’t. As a result, its full potential is only realised when dealing with large amounts of data.
HDFS can store a lot of data and make it easy to retrieve. The files are spread over numerous machines in order to store such large amounts of data. These files are duplicated to protect the system from data loss in the event of a system failure. HDFS also enables parallel processing of applications.
HDFS separates files into blocks, which are then stored on DataNodes. The NameNode, the cluster’s master node, is connected to several DataNodes. These data blocks are replicated across the cluster by the master node. It also tells the user where to get the information they’re looking for. However, before NameNode can assist you with data storage and management, the file must first be partitioned into smaller, more manageable data blocks. This is known as data block splitting.
The smallest quantity of data it can read or write is called a block. The default size of HDFS blocks is 128 MB, although this can be changed. HDFS files are divided into block-sized portions and stored as separate units. Unlike a file system, if a file in HDFS is less than the block size, it does not take up the entire block size; for example, a 5 MB file saved in HDFS with a block size of 128 MB takes up just 5 MB of space. The HDFS block size is big solely to reduce search costs.
Image: nitendratech.com
The block size and replication factor in HDFS may be customised per file. The number of replicas can be set up programmatically by an application. It can be supplied at the time of file creation and altered later if necessary. Name Node is in charge of determining the block size. HDFS makes two copies of the data in the same rack and the third copy in a separate rack using a rack-aware replica placement strategy.
The default block size in HDFS is 64 MB for Hadoop 1.1x and 128 MB for Hadoop 2.x and above. Depending on the cluster’s size, this block size can be changed. HDFS blocks are larger than disc blocks, primarily to reduce seek costs. The default replication size in an older version of Hadoop is three, which implies that each block is duplicated three times and stored on various nodes.
NameNode can be regarded as the system’s master. It keeps track of the file system tree and metadata for all of the system’s files and folders. Metadata information is stored in two files: ‘Namespace image’ and ‘edit log.’ Namenode is aware of all data nodes carrying data blocks for a particular file, but it does not keep track of block positions. When the system starts, this information is rebuilt from data nodes each time.
Name Node is the HDFS controller and manager since it is aware of the state and metadata of all HDFS files, including file permissions, names, and block locations. Because the metadata is tiny, it is kept in the memory of the name node, allowing for speedier data access. Furthermore, because the HDFS cluster is accessible by several customers at the same time, all of this data is processed by a single computer. It performs file system actions such as opening, shutting, renaming, and so on.
The data node is a commodity computer with the GNU/Linux operating system and data node software installed. In a cluster, there will be a data node for each node (common hardware/system). These nodes are in charge of the system’s data storage.
Datanodes respond to client requests by performing read-write operations on file systems. They also carry out actions such as block creation, deletion, and replication in accordance with the name node’s instructions.
Each HDFS cluster contains one name node that stores metadata information such as the filename and the location of the file’s content. The amount of files you may have in each cluster is limited by this one node. When dealing with millions of little files, as is the case with machine learning, this is a significant restriction.
HDFS takes data in any format, regardless of schema, optimises for high-bandwidth streaming, and expands to proven installations of 100PB and beyond. It was designed primarily for large-scale data processing tasks where scalability, flexibility, and performance are crucial.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.