Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. It is the most actively developed open-source engine for this task, making it a standard tool for any developer or data scientist interested in big data.
Spark supports multiple widely-used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and Spark runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or an incredibly large scale.
With more than 500 contributors from across 200 organizations responsible for code and a user base of 225,000+ members, Apache Spark has become mainstream and most in-demand big data framework across all major industries. E-commerce companies like Alibaba, social networking companies like Tencent, and Chinese search engine Baidu, all run apache spark operations at scale.
This article is a single-stop resource that gives the Spark architecture overview with the help of a spark architecture diagram.
Below are the high-level components of the architecture of the Apache Spark application:
The driver is the process “in the driver seat” of your Spark Application. It is the controller of the execution of a Spark Application and maintains all of the states of the Spark cluster (the state and tasks of the executors). It must interface with the cluster manager in order to actually get physical resources and launch executors.
At the end of the day, this is just a process on a physical machine that is responsible for maintaining the state of the application running on the cluster.
Spark executors are the processes that perform the tasks assigned by the Spark driver. Executors have one core responsibility: take the tasks assigned by the driver, run them, and report back their state (success or failure) and results. Each Spark Application has its own separate executor processes.
The Spark Driver and Executors do not exist in a void, and this is where the cluster manager comes in. The cluster manager is responsible for maintaining a cluster of machines that will run your Spark Application(s). Somewhat confusingly, a cluster manager will have its own “driver” (sometimes called master) and “worker” abstractions.
The core difference is that these are tied to physical machines rather than processes (as they are in Spark). The machine on the left of the illustration is the Cluster Manager Driver Node. The circles represent daemon processes running on and managing each of the individual worker nodes. There is no Spark Application running as of yet—these are just the processes from the cluster manager.
When the time comes to actually run a Spark Application, we request resources from the cluster manager to run it. Depending on how our application is configured, this can include a place to run the Spark driver or might be just resources for the executors for our Spark Application. Over the course of Spark Application execution, the cluster manager will be responsible for managing the underlying machines that our application is running on.
The system currently supports several cluster managers:
A third-party project (not supported by the Spark project) exists to add support for Nomad as a cluster manager.
An execution mode gives you the power to determine where the aforementioned resources are physically located when you go running your application. You have three modes to choose from:
Cluster mode is probably the most common way of running Spark Applications. In cluster mode, a user submits a pre-compiled JAR, Python script, or R script to a cluster manager. The cluster manager then launches the driver process on a worker node inside the cluster, in addition to the executor processes. This means that the cluster manager is responsible for maintaining all Spark Application– related processes.
Client mode is nearly the same as cluster mode except that the Spark driver remains on the client machine that submitted the application. This means that the client machine is responsible for maintaining the Spark driver process, and the cluster manager maintains the executor processes. These machines are commonly referred to as gateway machines or edge nodes.
Local mode is a significant departure from the previous two modes: it runs the entire Spark Application on a single machine. It achieves parallelism through threads on that single machine. This is a common way to learn Spark, to test your applications, or experiment iteratively with local development.
However, we do not recommend using local mode for running production applications.
To sum up, Spark helps us break down the intensive and high-computational jobs into smaller, more concise tasks which are then executed by the worker nodes. It also achieves the processing of real-time or archived data using its basic architecture.
I recommend you go through the following data engineering resources to enhance your knowledge-
Spark Streaming helps Spark handle data in real time. It breaks down info into small bits, letting you analyze data as it comes in quickly.
Apache Spark works around RDDs (think of them as datasets), and two main actions: transforming data (changing it) and taking actions (doing something with it
RDD is like an intelligent way to organize data in Spark. It’s good at handling errors and works across a bunch of computers. You can change data (transformations) or do stuff with it (actions) using RDDs.
Spark mostly talks Scala, a coding language. But it’s cool with others like Java, Python, and R too. Scala is like Spark’s favorite language because they get along well and make things happen faster.
I hope you might have liked the article. If you have any questions related to this article do let me know in the comments section below.
Please improve the explanation.