We have over 4 billion users on the Internet today. In pure data terms, here’s how the picture looks:
That’s the amount of data we are dealing with right now – incredible! It is estimated that by the end of 2020 we will have produced 44 zettabytes of data. That’s 44*10^21!
This massive amount of data generated at a ferocious pace and in all kinds of formats is what we call today as Big data. But it is not feasible storing this data on the traditional systems that we have been using for over 40 years. To handle this massive data we need a much more complex framework consisting of not just one, but multiple components handling different operations.
We refer to this framework as Hadoop and together with all its components, we call it the Hadoop Ecosystem. But because there are so many components within this Hadoop ecosystem, it can become really challenging at times to really understand and remember what each component does and where does it fit in in this big world.
So, in this article, we will try to understand this ecosystem and break down its components.
By traditional systems, I mean systems like Relational Databases and Data Warehouses. Organizations have been using them for the last 40 years to store and analyze their data. But the data being generated today can’t be handled by these databases for the following reasons:
So, how do we handle Big Data? This is where Hadoop comes in!
People at Google also faced the above-mentioned challenges when they wanted to rank pages on the Internet. They found the Relational Databases to be very expensive and inflexible. So, they came up with their own novel solution. They created the Google File System (GFS).
GFS is a distributed file system that overcomes the drawbacks of the traditional systems. It runs on inexpensive hardware and provides parallelization, scalability, and reliability. This laid the stepping stone for the evolution of Apache Hadoop.
Apache Hadoop is an open-source framework based on Google’s file system that can deal with big data in a distributed environment. This distributed environment is built up of a cluster of machines that work closely together to give an impression of a single working machine.
Here are some of the important properties of Hadoop you should know:
Now, let’s look at the components of the Hadoop ecosystem.
In this section, we’ll discuss the different components of the Hadoop ecosystem.
It is the storage component of Hadoop that stores data in the form of files.
Each file is divided into blocks of 128MB (configurable) and stores them on different machines in the cluster.
It has a master-slave architecture with two main components: Name Node and Data Node.
To handle Big Data, Hadoop relies on the MapReduce algorithm introduced by Google and makes it easy to distribute a job and run it in parallel in a cluster. It essentially divides a single task into multiple tasks and processes them on different machines.
In layman terms, it works in a divide-and-conquer manner and runs the processes on the machines to reduce traffic on the network.
It has two important phases: Map and Reduce.
Map phase filters, groups, and sorts the data. Input data is divided into multiple splits. Each map task works on a split of data in parallel on different machines and outputs a key-value pair. The output of this phase is acted upon by the reduce task and is known as the Reduce phase. It aggregates the data, summarises the result, and stores it on HDFS.
YARN or Yet Another Resource Negotiator manages resources in the cluster and manages the applications over Hadoop. It allows data stored in HDFS to be processed and run by various data processing engines such as batch processing, stream processing, interactive processing, graph processing, and many more. This increases efficiency with the use of YARN.
HBase is a Column-based NoSQL database. It runs on top of HDFS and can handle any type of data. It allows for real-time processing and random read/write operations to be performed in the data.
Pig was developed for analyzing large datasets and overcomes the difficulty to write map and reduce functions. It consists of two components: Pig Latin and Pig Engine.
Pig Latin is the Scripting Language that is similar to SQL. Pig Engine is the execution engine on which Pig Latin runs. Internally, the code written in Pig is converted to MapReduce functions and makes it very easy for programmers who aren’t proficient in Java.
Hive is a distributed data warehouse system developed by Facebook. It allows for easy reading, writing, and managing files on HDFS. It has its own querying language for the purpose known as Hive Querying Language (HQL) which is very similar to SQL. This makes it very easy for programmers to write MapReduce functions using simple HQL queries.
A lot of applications still store data in relational databases, thus making them a very important source of data. Therefore, Sqoop plays an important part in bringing data from Relational Databases into HDFS.
The commands written in Sqoop internally converts into MapReduce tasks that are executed over HDFS. It works with almost all relational databases like MySQL, Postgres, SQLite, etc. It can also be used to export data from HDFS to RDBMS.
Flume is an open-source, reliable, and available service used to efficiently collect, aggregate, and move large amounts of data from multiple data sources into HDFS. It can collect data in real-time as well as in batch mode. It has a flexible architecture and is fault-tolerant with multiple recovery mechanisms.
There are a lot of applications generating data and a commensurate number of applications consuming that data. But connecting them individually is a tough task. That’s where Kafka comes in. It sits between the applications generating data (Producers) and the applications consuming data (Consumers).
Kafka is distributed and has in-built partitioning, replication, and fault-tolerance. It can handle streaming data and also allows businesses to analyze data in real-time.
Oozie is a workflow scheduler system that allows users to link jobs written on various platforms like MapReduce, Hive, Pig, etc. Using Oozie you can schedule a job in advance and can create a pipeline of individual jobs to be executed sequentially or in parallel to achieve a bigger task. For example, you can use Oozie to perform ETL operations on data and then save the output in HDFS.
In a Hadoop cluster, coordinating and synchronizing nodes can be a challenging task. Therefore, Zookeeper is the perfect tool for the problem.
It is an open-source, distributed, and centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services across the cluster.
Spark is an alternative framework to Hadoop built on Scala but supports varied applications written in Java, Python, etc. Compared to MapReduce it provides in-memory processing which accounts for faster processing. In addition to batch processing offered by Hadoop, it can also handle real-time processing.
Further, Spark has its own ecosystem:
With so many components within the Hadoop ecosystem, it can become pretty intimidating and difficult to understand what each component is doing. Therefore, it is easier to group some of the components together based on where they lie in the stage of Big Data processing.
I hope this article was useful in understanding Big Data, why traditional systems can’t handle it, and what are the important components of the Hadoop Ecosystem.
I encourage you to check out some more articles on Big Data which you might find useful:
Thanx Aniruddha for a thoughtful comprehensive summary of Big data Hadoop systems.
Hello Sir)Mam, I hope you are well. I am Vishnu kant shukla 3rd year student from computer science department in maharana pratap engineering college kanpur. I am interested in internship opportunity, If you would give me this opportunity, my teachers and my academics taught me of identifying the purpose of the task collection of data and tools that use preparing of method and implementing it and then finally applying it. This what i know apart from that i have done few projects in my college and i am interested in this area. Thank you so much, Vishnu kant shukla
This is awesome! Detailed analysis with enough background.