The data researchers no longer depend only on interviews, surveys, observational studies to collect data. Instead, they have switched to the faster ways of data collection which includes leveraging internet, cameras, smartphones, drones, bots and many more.
Later, the collected data is used by organization / governments to make business decisions. But, before that, they require a device or system which can store and secure such big data sets. One such system is Hadoop File Distribution System, commonly known as HDFS.
If you are new to HDFS architecture, let me provide a simple picture of what it does:
In simple terms, HDFS stores raw data is such a manner that the data is easily extracted at a later stage. Data is stored in clusters to enable parallel mode of extraction. After the data is stored in HDFS, we use query languages (such as SQL) to extract the data and use it for analysis / modeling purposes.
In this tutorial, we will look into basic concepts of Hadoop Distributed File System and the various components that make up HDFS. In the end, we’ll learn about a data integration tool DMX-h for practical understanding.
The ‘D’ in HDFS refers to breaking the files into various subsets and storing it in different clusters. This provides much needed pace and scalability for big data processing.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications which have large data sets.
It has a well structured architecture and incorporates MapReduce technique for processing and distributing large data sets. Let’s start with the architecture.
There are 3 major components which supports HDFS architecture namely Client machines, Master nodes, and Slave nodes. Let’s understand each of them:
Master Node: It oversees the two key functional pieces that make up Hadoop: storing lots of data (HDFS), and running parallel computations on all that data (Map Reduce).
Slave Node: It makes up the vast majority of machines and do all the dirty work of storing the data and running the computations, hence the name ‘Slave’.
Client Machines: Client machines are those which have hadoop installed with default clusters settings and assist in loading data and steers the complete cycle of data processing.
Let’s dive deep into nodes. Nodes in HDFS are made up of a two components: Master Node and Multiple Worker Nodes. The master node consists of a Job Tracker, Task Tracker, NameNode and DataNode.
A name node manages the file system metadata and data node store the actual data. A data node stores data in the [Hadoop File System]. A functional filesystem has more than one DataNode, with data replicated across them. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes.
Below are some important points which every big data / database analyst must know about HDFS:
MapReduce is a processing technique and a program model for distributed computing based on java. The mapreduce algorithm contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Reduce takes the output from a map as an input and combines those data tuples into a smaller set of tuples.
As the sequence of the name MapReduce implies, the reduce task is always performed after the map job.
The major advantage of using MapReduce is that it is easy to scale data processing over multiple computing nodes.
MapReduce can be used to perform intensive operations such as change data capture. Change Data Capture is a processing intensive methodology used to make current data available to users. Because, it is so processing intensive, it often makes sense to perform the processing on Hadoop as opposed to Teradata or other platforms.
Below is the complete process flow of MapReduce technique:
MapReduce – It steers the complete processing of data.
Until here, we discussed about theoretical aspects of data integration, let’s now practice it using DMX-h in order to develop even better understanding.
One of the key functions that is involved working with Hadoop/Big data is ETL (Extract, Transform, Load). In addition, we still moving data from HDFS/Map Reduce outputs to, let’s say, other traditional data warehouses, basically perform data integration. One of the tools that is available in the market today is called DMX-h from Syncsort .
It’s not an open source tool. Hence, for installation you need to create a one time account for installation here.
There are 2 major components of DMX-h. They are:
The DMX-h Task Editor provides the feature to create tasks that deal with movement of data. If you are an experienced professional, the interface in certain ways might remind you of SQL Server Integration Services (SSIS).
The following demo shots explain how data can be copied from source to target.
1. Right Click on Source and Target to add the source and destination location of the files.
2. Now move to target file. It simply means assigning a file to the target location.
3. Once the source and target file locations have been assigned, the task is saved in the DMX-h Task Editor.
Now, we move to the DMX-h Job editor where we can load all the tasks created in the DMX-h Task Editor. Then, we connect them according to the data transformation requirements.
4. In the screen shot below, we load the Sample_File_Copy Task created in the Task Editor to the Job editor.
5. Now, we’ll run the job in the editor with the results of execution of the job (for privacy purpose, username is masked).
Once the job has been completed, we should verify if the source file has been successfully copied to the target location. Here I have demonstrated a simple example, this can be expanded to perform CDC operations and MapReduce Functionality.
6. The screenshot below shows how we can have different tasks (sort, split) in a job editor and also how we use the MapReduce component in the Job Editor.
I hope this article provided you with the basic construct of HDFS and how tools like DMX-h can be used to perform different file operations on a Hadoop Cluster. In the end, I would like to thank Debra Aaron (from bank of America) for providing crucial inputs to the article.
In this article, we discussed about the basics of HDFS & Mapreduce and its practical implementation using DMX-h tool. While writing this article, I was keen to understand the role of open source tools in Big Data. If anyone of you have any experience, I would love to interact in comments.
About Author
Ramdas Narayanan is currently working as a Data Architect at Bank of America with focus on Business Intelligence and Data integration for Mortgage Technology. He has more than 22 years of experience in Information Technology, ERP and Relational Database Systems.
Did you like reading this article ? Do share your experience / suggestions in the comments section below.
Thank you Manish for working with me and providing constructive feedback in order to get the article published. I would like to thank Manish and team at analytics vidhya for providing me with this opportunity and also providing encouragement for my desire of publishing articles. Thank you Ramdas
good one . I was looking for this kind of tool .
Thank you Sanjay