Big data processing is crucial today. Big data analytics and learning help corporations foresee client demands, provide useful recommendations, and more. Hadoop, the Open-Source Software Framework for scalable and scattered computation of massive data sets, makes it easy. While MapReduce, Hive, Pig, and Cascading are all useful tools, completing all necessary processing or computing in a single job is seldom possible. Many MapReduce tasks are commonly linked together to create and collect intermediate data and manage processing flow. To overcome this, Yahoo developed Oozie, a foundation for managing multi-step processes in MapReduce, Pig, etc.
Learning Objectives
This article was published as a part of the Data Science Blogathon.
Apache Oozie is a workflow scheduler system for running and managing Hadoop jobs in a scattered environment. It grants the processing of multiple complex jobs in a successive way to carry out a larger job. Two or more duties in a job sequence can also be programmed to operate concurrently. It is basically an Open-Source Java Web Application licensed under the Apache 2.0 license. It is in charge of initiating workflow operations, which are then processed by the Hadoop processing engine. As a result, Oozie may use the current Hadoop infrastructure for load balancing, fail-over, and so on.
It can be used to quickly schedule MapReduce, Sqoop, Pig, or Hive tasks. Many different types of jobs can be integrated using Apache oozie, and a job pipeline of one’s choice can be quickly established.
It is indeed quite flexible. Jobs can be begun, paused, interrupted, and restarted with ease. Rerunning failed workflows is a breeze with Oozie.
It is a service that runs in the Hadoop group and on client computers. It sends workflow definitions for immediate or delayed processing. The workflow is mainly made up of action and control-flow nodes.
An action node represents a workflow job, like transferring files into HDFS, running MapReduce, Pig, or Hive jobs, importing data with Sqoop, or processing a shell script of a Java program.
A control-flow node manages the workflow processing between actions by allowing features like conditional logic, which allows alternative branches to be followed based on the outcome of a former action node.
This group of nodes includes the Start Node, End Node, and Error Node.
It uses an HTTP callback at the end of a workflow process to notify the client of the workflow status. The callback may be caused by entering or exiting an action node too.
The workflow description and each connected resource, like Pig scripts, MapReduce Jar files, and so forth, make up a workflow application. The workflow application must adhere to a simple directory structure deployed to HDFS so that it can access it.
Directory Structure
/ ??? lib/ ? ??? hadoop-application-examples.jar ??? workflow.xml
Workflow.xml (a workflow definition file) must be kept in the top-level directory (parent directory). Jar files containing MapReduce classes can be found under the Lib directory. Any build tool, like Ant or Maven, can be used to create a workflow application that adheres to this pattern.
The command for copying files to HDFS
% hadoop fs -put hadoop-examples/target/ name of workflow
To run the jobs, we’ll need to use the Oozie command-line tool, which is an important client program that talks to the Oozie server.
Step 1: Export the OOZIE URL environment variable to define the Oozie command that sets the Oozie server to use for processing.
% export OOZIE_URL="http://localhost:11000/oozie"
Step 2: Use the -config option to run the Oozie workflow Job, which refers to a local Java properties file. The file includes definitions for the parameters used in the workflow XML file.
% oozie job -config ch05/src/main/resources/max-temp-workflow.properties -run
Step 3: The oozie.wf.application.path. It informs the Oozie of the location of the workflow application in HDFS.
nameNode=hdfs://localhost:8020 jobTracker=localhost:8021 oozie.wf.application.path=${nameNode}/user/${user.name}/
Step 4: The status of a workflow job can be determined by using the subcommand ‘job’ with the ‘-info’ option, which requires giving the job id after ‘-info’, as explained below. Depending on job status, RUNNING, KILLED, or SUCCEED will be shown.
% oozie job -info
Step 5: To receive the result of successful workflow computation, we must run the following Hadoop command.
% hadoop fs -cat
We learn how to deploy workflow apps and operate the workflow application. It initiates work process operations using the Hadoop processing engine to carry out several tasks. It uses modern Hadoop hardware for load balancing, failover, etc. Oozie is responsible for determining the completion of tasks through callbacks and polling. Insights from the article:
We hope you liked this post; please share your thoughts in the comments below.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.