The Tale of Apache Hadoop YARN!

Shikha Gupta Last Updated : 03 Jun, 2022

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

YARN stands for Yet Another Resource Negotiator, a large-scale distributed data operating system used for Big Data Analytics. Initially, it was described as “Redesigned Resource Manager” as it separates the processing engine and the management function of MapReduce. Apart from resource management, Yarn also performs various jobs, including job Scheduling, workload management, management of high availability features of Hadoop, implementation of security controls, and maintaining a multi-tenant environment. Furthermore, to make the system more efficient, YARN allows various data processing engines like graph processing, interactive processing, stream processing, and batch processing to run and process data stored in HDFS (Hadoop Distributed File System).

Why YARN?

Hadoop version 1.0, also known as MRV1(MapReduce Version 1), is a proficient data computational tool that performs processing and resource management functions. MRV1 has a single master named job tracker, which performs job scheduling, resource allocation, and job monitoring. It assigns maps and reduces tasks known as task trackers on several subordinate processes. Job trackers get the periodic progress report from the task trackers. Due to a single job, this design resulted in a scalability bottleneck. As a result, Hadoop 1. x had more limitations like delays in batch processing, inefficient utilization of computational resources, scalability issues, etc.

Moreover, it limits only MapReduce for processing big datasets. In 2012, Yahoo and Hortonworks introduced YARN in Hadoop version 2.0 to overcome all these shortcomings. The intention behind YARN is to reduce the overhead of MapReduce by taking over the job of Resource Management and Job Scheduling. With YARN, Hadoop can now run non-MapReduce jobs within the Hadoop cluster. With MapReduce batch tasks, YARN can now run stream data processing and interactive querying.

Features of YARN

YARN is a popular tool due to the following features:

Highest Scalability: The architecture of the Resource manager of YARN architecture allows Hadoop to manage thousands of nodes and clusters according to the user requirements.
High-degree Compatibility: YARN supports the applications created via the map-reduce framework without disruptions; that’s why it shows compatibility with Hadoop 1.0.
Better Cluster Utilization: YARN supports efficient and dynamic utilization of cluster resources in Hadoop, enabling better cluster utilization.
Multi-tenancy: YARN is a versatile technology that allows multiple engine access and gives the benefit of multi-tenancy.

Architecture and components of Hadoop YARN

Apache Hadoop 3.3.3 – Apache Hadoop YARN

source:https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html

Resource Manager

Resource Manager, the master daemon of YARN, is responsible for the management of global assignments of resources like CPU and memory with several other applications. With the goal of maximum cluster utilization, it keeps all resources in use against various constraints such as capacity guarantees, fairness, and SLAs. On receiving the processing requests, it forwards parts of requests to the corresponding node manager where the actual processing takes place and allocates resources for the completion of the request accordingly. It is used as the arbitrator of the cluster resources for job scheduling and deciding the allocation of the available resources for competing jobs. It consists of two parts:

Scheduler: The scheduler performs scheduling based on the requirement of resources by the allocated applications. It distributes resources to the running applications depending upon the ordinary constraints of capacities, queues, etc. It does not monitor or track applications, hence known as a pure scheduler. Scheduler doesn’t ensure restarting failed tasks either due to hardware failure or application failures. To partition the cluster resources among various queues and applications, it has a pluggable policy plug-in. Examples of the plug-in are Capacity Scheduler and the Fair Scheduler, which are current MapReduce schedulers.
Application Manager: It is an interface that manages a list of applications that have been submitted, are currently running, or are finished. The Application Manager manages running Application Masters in the cluster by accepting job submissions and negotiating the Resource Manager’s first container. In addition, it performs multiple tasks like starting Application Master and monitoring and restarting the Application Master container on different nodes in case of failures.

Node Manager

It is the slave daemon of Yarn whose primary goal is to keep up-to-date with the Resource Manager. The responsibility of the node manager is to manage application containers assigned to it by the resource manager. It monitors containers’ resource usage(memory, CPU) and reports it to the Resource Manager. In addition, the yarn Node Manager registers with the Resource Manager to track the health of the node on which it is running and sends the heartbeats with the node’s health status. The node manager is also responsible for performing the log management and killing/destroying the container as directed by the Resource Manager.

Containers

The Containers are a collection of physical resources like RAM, CPU core, memory, and disks on a single node. The containers are monitored by Node Manager and scheduled by Resource Manager. The job of the container is to grant the right to an application to use a definite amount of resources(memory, disk, CPU, etc.) on a particular host. A Container Launch context which is Container Life Cycle (CLC), manages the YARN containers. CLC is a record that carries information like a map of environment variables, security tokens, dependencies stored in remotely accessible storage, the command required to create the process, and the payload for Node Manager services.

Application Master

An application is nothing but a single job submitted to a framework, and each application has a specific Application Master associated with it, which is a framework-specific entity. The chief responsibilities of the application master include negotiating resources with the resource manager, tracking the status, and monitoring the progress of a single application. It manages faults and works with the Node Manager to monitor and execute the component tasks. The application master sends a Container Launch Context(CLC) which includes everything an application needs to run and requests the container from the node manager. Once the application is started, it periodically sends heartbeats to the resource manager to check the health and update records based on its resource demands.

Application Workflow in Hadoop YARN

Perform the following steps to run an application through Hadoop YARN.

Apache Hadoop YARN | Sequence of Execution

source: https://www.softwaretestinghelp.com/what-is-hadoop-yarn/

Step1:- Apply:

The client connects with the Resource Manager to submit the YARN application.

Step 2:- Container allocation:

To launch the Application Manager, the Resource Manager searches for a Node Manager and allocates the container.

Step 3:- Registration:

In this step, the Application Master registers itself with the resource master.

Step 4:- Negotiation:

From the Resource Manager, the application master negotiates the containers.

Step 5:- Notification:

Application Manager gives notification to the Node Manager for launching containers.

Step 6:- Execution:

Application code either gets executed in the container it is currently running, or it can request more containers from the resource manager.

Step 7:- Status Monitoring

To monitor the application’s status, the client contacts the Resource Manager or monitors the status of the Application Manager.

Step 8:- Disconnected

Once the processing is complete, the Application Manager gets disconnected from the Resource Manager.

YARN Command Line Interface

Mostly YARN commands are available for admins, but there are a few commands which developers can also run. These are:

Help

To get a list of all commands available in the YARN cluster.

Syntax:

-yarn -help

Version

To get the current version of YARN you are working with.

Syntax:

-yarn -version

Application id

To print the logs of a particular application id.

Syntax:

- yarn logs -applicationId

Conclusion

YARN is one of the most powerful concepts of Hadoop 2. x. This article has seen all the important concepts of the YARN with good examples of how they work with applications. Key learnings are:

We discuss the Yarn and its features.
Then, learned about the architecture and components of YARN.
Also, learned how an application workflows in Hadoop YARN.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Shikha Gupta

My full name is Shikha Gupta , pursuing B.tech in computer science from Banasthali vidhyapeeth Rajasthan.
I am from East Champaran in Bihar.
My area of interest ,Deep learning,NLP,Java,Data Structure,DBMS and many more.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

The Tale of Apache Hadoop YARN!

Introduction

Why YARN?

Features of YARN

Architecture and components of Hadoop YARN

Resource Manager

Node Manager

Containers

Application Master

Application Workflow in Hadoop YARN

YARN Command Line Interface

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid