Today we have an abundance of Hadoop jobs that are running in a constant plane, but we can’t schedule these jobs manually, we need some kind of scheduler to handle this flow. Apache Oozie is one such job scheduler that allows users to run, schedule, and manage Hadoop jobs in a distributed environment.
Source: informationit27.medium.com
Oozie is a scalable, extensible, and reliable system that allows users to execute multiple jobs parallelly so that more than one job can be executed simultaneously, and we can accomplish a more significant task. Oozie is famous for its smooth integration with the Hadoop stack, which allows the execution of various Hadoop-related jobs like Pig, Hive, and Sqoop.
In this blog, I discussed five interview-winning questions that will help you to set a pace for Apache Oozie and ace your upcoming interview!
Learning Objectives
Below is what we’ll learn after reading this blog thoroughly:
Overall, by reading this guide, we will gain a comprehensive understanding of Oozie to schedule the jobs.
This article was published as a part of the Data Science Blogathon.
By cascading the jobs one after another, we can perform the job scheduling, but whenever there is a job failure for any reason, we’re not allowed to restart that job from the failure. Rather, we have to restart the entire process, which is a very inefficient and time-consuming. Also, we lack flexibility like starting, stopping, suspending, or re-running a job.
The purpose of using Apache Oozie is to manage multiple types of jobs that are being efficiently processed in the Hadoop system.
Oozie is a Java Web Application that runs in a Java servlet container, allows us to execute multiple independent jobs simultaneously, run the jobs back to back following a specific sequence, run the jobs on a defined time, or can control the jobs from anywhere.
Users define their jobs as a Directed Acyclic Graph(DAG) with multiple dependencies in-between and then, Oozie takes this information to perform the assigned task in a particular order as available in the workflow. That’s how Ooozie will save our time and energy by managing the entire workflow, which is not available in normal job cascading.
Special Features of Oozie
1. Email Notification: Oozie facilities us with Email notification features that can be sent upon the completion of jobs.
2. Web Services API: Oozie supports web services API, enabling us to control jobs from anywhere.
3. Client API: Oozie supports us with a command-line interface to launch, control, and monitor a job from the Java application.
4. Periodic Run: Oozie allows us to execute the scheduled jobs periodically.
The workflow of Apache Oozie is a collection/group of actions arranged in a control dependency DAG (Direct Acyclic Graph). The DAG can control how and when an action can be run. “hPDL”(an XML Process Definition Language) is used to write the Oozie workflow definitions.
Major components of Apache Oozie Workflow
The two key components of Apache Oozie Workflow include:
Control Flow Nodes: Control flow nodes are the mechanisms that play a significant role in defining the start and end of the workflow i.e., start, end, and fail. Apart from that control glow node also offers a mechanism to control and handle the execution path of the workflow (decision, fork, and join).
Action Nodes: Action nodes are used to trigger the execution of a computation or processing task. It is a mechanism by which Oozie offers support for different types of Hadoop actions, including Hadoop MapReduce, Hadoop file system, Pig, etc. Oozie also offers support for system-defined jobs like SSH, HTTP, email, etc.
Source: aws.amazon.com
Apache Oozie Workflow Job States
Below are the various states defined in an Oozie workflow job:-
1. PREP: It is the initial state of the workflow job where the user only creates the job, and it’s still just defined.
2. RUNNING: It is the main execution state where the job begins to run and stays there until it reaches the end state, an error occurs, or the job is suspended due to some conditions.
3. SUSPENDED: A job reaches the suspended state if there is any issue occurring in the running time or someone explicitly suspends the job. A job can move from the suspended state to the running or killed state.
4. SUCCEEDED: As soon as the job hits the end node, the workflow job becomes successful.
5. KILLED: As soon as the administrator kills any workflow job in the prep, running, or suspend state, it moves to the killed state.
6. FAILED: When any workflow job fails due to an unexpected error in the running state, it reaches the failed state.
Source: www.cloudduggu.com
$ oozie job -oozie http://172.20.95.107:11000(oozie_server_node)/oozie
-config job.properties -run
$ oozie job -oozie http://172.20.95.107:11000(oozie_server_node)/oozie
-info <job id>
<start to=“[START-NODE-NAME]” />
<end name=“[END-NODE-NAME]”/>
<error
<message>“[Any custom message]”</message>
</error>
$ oozie job -oozie http://172.20.95.107:11000(oozie_server_node)/oozie
-start <job-name or job-id>
$ oozie job -oozie http://172.20.95.107:11000(oozie_server_node)/oozie
-config job.properties -submit <job-name or job-id>
This blog covers some of the frequently asked Apache Oozie interview questions that could be asked in data science and big data developer interviews. Using these interview questions as a reference, you can better understand the concept of Apache Oozie and start formulating effective answers for upcoming interviews. The key takeaways from this Oozie blog are:-
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.