Performance Tuning on Apache Spark

Bharati Last Updated : 03 May, 2021

5 min read

This article was published as a part of the Data Science Blogathon.

Motivation

While working on the spark application tuning problem, I spent a considerable amount of time trying to make sense of the visualizations from the Spark Web UI. Spark Web UI is a very handy tool for this task. For beginners, it becomes very difficult to gain intuitions of a problem from these visualizations alone. Though there are very good resources on spark performance, the information was scattered. Thus, I felt the need to document and share my learnings.

Target Audience and Takeaways

This post assumes the readers have a basic understanding of spark concepts. This post will help beginners in identifying probable performance problems in their applications runs from a Spark Web UI. The focus is only on the information that is not obvious from the UI and the inferences to draw from this non-obvious information. Please note, it does not contain an exhaustive list of information to interpret from Spark Web UI, but only the ones that I found relevant to my project and yet general enough for the audience to know.

Spark Web UI

Spark Web UI is available only when the application is running. For analyzing past runs, the history server needs to be enabled to store the event logs that can then be used to populate the Web UI.

Spark Web UI displays useful information about your application in the tabs, namely

Executors
Environment
Jobs
Stages
Storage

The remaining post describes the intuitions from each of the tabs, in the mentioned order.

Executors tab

Gives information on the tasks run by every executor.

Fig 1: Executor tab summary

Performance Tuning on Apache Spark executors tab

From Fig 1, one can understand there is one driver and 5 executors each running with 2 cores and 3 GB memory.

The box marked in red shows the uneven distribution of tasks where one node of the cluster is overdoing tasks, while others are comparatively idle.

The box marked in blue shows that the input data size was 487.3 MB. Now, this application was run on a dataset size of 83 MB. The input data size comprises of original dataset read and the shuffle data transfers across nodes. This shows a lot of data (approx 400+ MB) was been shuffled in the application.

Environment tab

There are many spark properties to control and fine-tune the application. These properties could be set either while submitting the job or creating the context object. Unless the property is explicitly added, it does not get applied. We mistook this by assuming that the properties get applied with their default values, when not explicitly set. All applied properties can be viewed from the Environment tab. If the property is not seen there, it means the property has not gotten applied whatsoever.

Jobs tab

A job is associated with a chain of Resilient Distributed Dataset dependencies organized in a direct acyclic graph (DAG) that looks like Fig 2. From DAG visualizations, one can find the stages been executed and the number of skipped stages. By default, the spark does not reuse its computed steps in the stages, unless explicitly persisted/cached. Skipped stages are cached stages marked in grey, where computation values are stored in memory and not recomputed after accessing HDFS. A glance at the DAG visualization is enough to know if RDD computations are repeatedly performed or cached stages are used.

Fig 2: DAG Visualization of a job

Performance Tuning on Apache Spark DAG visualization

Stages tab

Gives a deeper view of the application running at the task level. A stage represents a segment of work done in parallel by individual tasks. There is a 1-1 mapping between tasks and data partitions, i.e 1 task per data partition. One can deep dive into a job, into specific stages, and down to every task in a stage from the Spark Web UI.

The stage gives a good overview of the executions – DAG visualizations, Event Timelines, Summary/Aggregator Metrics of its’ tasks.

I prefer looking at event timelines to analyze the tasks. They give a pictorial representation of the details of time spent in the stage’s execution. With a single glance, we could draw quick inferences on how well the stage performed, and how we could further improve the execution time.

Fig 3 – Event timeline sample

Performance Tuning on Apache Spark event timeline sample

For e.g, inferences drawn from Fig 3 could be:

The data is divided into 15 partitions. Thus, 15 tasks are running (represented with 15 green lines).
The tasks are executing on 3 nodes, each with 2 executors
The stage completes only when the longest-running task finishes. Other executors remain idle until the longest task finishes.
There are few long-running tasks, while few tasks run for a very short time indicating that data is not well partitioned.
Not a lot of time is been spent on scheduler delay or serialization in this stage which is good.

Fig 4 – Event Timeline of a stage with many data partitions.

Looking at fig 4, we can infer that data is not well distributed and unnecessarily partitioned. From the evaluation metric, one can confirm that task scheduling took more time than actual execution time. The greater the percentage of green in the timeline, the more efficient is the stage computation.

It is desirable to have a lesser number of stages in the jobs. Whenever the data gets shuffled, a new stage is created. Shuffling is expensive and thus, attempt to reduce the number of stages your program needs.

Input Data Size

Another important insight is to look at the input size of the data been shuffled. One of the goals is also to reduce the size of this shuffle data.

Fig 5 – Stages Tab Overview.

Fig 5 above shows stages where data is moved in MBs. This hints that the code can be improved to reduce the size of data been juggled around between stages. For e.g, say if a filter was applied on some data for a given event ‘x’, then in the resultant RDD, the column “event” becomes redundant as technically all rows are of event ‘x’. This column could be dropped from future RDDs built on this filtered data to save additional information been transferred during shuffle operations.

Storage Tab

Shows only the RDDs that have been persisted ie using persist() or cache(). To make it more legible, you can assign a name to the RDD while storing it using setName(). Only the RDDs that you want to persist must show in the Storage Tab and they could be easily recognizable with the custom names provided.

Summary

This article helps provide insights into identifying the problems from Spark Web UI such as the size of data been shuffled, the execution time of stages, re-computation of RDDs due to lack of caching. If one understands their data and application, then the ideal data distribution and desired number of partitions could be gauged by inferring from execution UI. Overloading of one node Vs others in the cluster is another area of improvement that could be seen from this UI. The resolution to some of these problems is discussed more in the Apache Spark Performance Tuning article.

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.

Bharati

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

brahmaid

It is needed for personalizing the website.

Expiry: Session

Type: HTTP

csrftoken

This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website

Expiry: Session

Type: HTTPS

Identityid

Preserves the login/logout state of users across the whole site.

Expiry: Session

Type: HTTPS

sessionid

Preserves users' states across page requests.

Expiry: Session

Type: HTTPS

g_state

Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.

Expiry: 365 days

Type: HTTP

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

We do not use cookies of this type.

_gcl_au

Used by Google Adsense, to store and track conversions.

Expiry: 3 Months

Type: HTTP