Understanding the Basics of Apache Spark RDD

Dhanya Thailappan Last Updated : 07 Jun, 2023

7 min read

This article was published as a part of the Data Science Blogathon

Hello readers!

In this article, I am going to discuss one of the most essential parts of Apache Spark called RDD.

Before getting into Spark RDD, I strongly recommend you to read this article, Understand the internal working of Apache Spark to get an overview of the working of Apache Spark.

What is RDD in Spark?
Features of Spark RDD
How to create RDD?
Operations of RDD
- Transformations
- Actions
Practical demo of RDD operations
When to use RDDs?
Frequently Asked Questions
Endnote

What is RDD in Spark?

An RDD (Resilient Distributed Dataset) is a core data structure in Apache Spark, forming its backbone since its inception. It represents an immutable, fault-tolerant collection of elements that can be processed in parallel across a cluster of machines. RDDs serve as the fundamental building blocks in Spark, upon which newer data structures like datasets and data frames are constructed.

RDDs are designed for distributed computing, dividing the dataset into logical partitions. This logical partitioning enables efficient and scalable processing by distributing different data segments across different nodes within the cluster. RDDs can be created from various data sources, such as Hadoop Distributed File System (HDFS) or local file systems, and can also be derived from existing RDDs through transformations.

Being the core abstraction in Spark, RDDs encompass a wide range of operations, including transformations (such as map, filter, and reduce) and actions (like count and collect). These operations allow users to perform complex data manipulations and computations on RDDs. RDDs provide fault tolerance by keeping track of the lineage information necessary to reconstruct lost partitions.

In summary, RDDs serve as the foundational data structure in Spark, enabling distributed processing and fault tolerance. They are integral to achieving efficient and scalable data processing in Apache Spark.

Features of Spark RDD

Spark RDD possesses the following features.

Immutability

The important fact about RDD is, it is immutable. You cannot change the state of RDD. If you want to change the state of RDD, you need to create a copy of the existing RDD and perform your required operations. Hence, the required RDD can be retrieved at any time.

In-memory computation

Data stored in a disk takes much time to load and process. Spark supports in-memory computation which stores data in RAM instead of disk. Hence, the computation power of Spark is highly increased.

Lazy evaluation

Transformations in RDDs are implemented using lazy operations. In lazy evaluation, the results are not computed immediately. It will generate the results, only when the action is triggered. Thus, the performance of the program is increased.

Fault-tolerant

As I said earlier, once you perform any operations in an existing RDD, a new copy of that RDD is created, and the operations are performed on the newly created RDD. Thus, any lost data can be recovered easily and recreated. This feature makes Spark RDD fault-tolerant.

Partitioning

Data items in RDDs are usually huge. This data is partitioned and send across different nodes for distributed computing.

Persistence

Intermediate results generated by RDD are stored to make the computation easy. It makes the process optimized.

Grained operation

Spark RDD offers two types of grained operations namely coarse-grained and fine-grained. The coarse-grained operation allows us to transform the whole dataset while the fine-grained operation allows us to transform individual elements in the dataset.

How to create RDD?

In Apache Spark, RDDs can be created in three ways.

Parallelize method by which already existing collection can be used in the driver program.
By referencing a dataset that is present in an external storage system such as HDFS, HBase.
New RDDs can be created from an existing RDD.

Operations of RDD

Two operations can be applied in RDD. One is transformation. And another one in action.

Transformations

Transformations are the processes that you perform on an RDD to get a result which is also an RDD. The example would be applying functions such as filter(), union(), map(), flatMap(), distinct(), reduceByKey(), mapPartitions(), sortBy() that would create an another resultant RDD. Lazy evaluation is applied in the creation of RDD.

Actions

Actions return results to the driver program or write it in a storage and kick off a computation. Some examples are count(), first(), collect(), take(), countByKey(), collectAsMap(), and reduce().

Transformations will always return RDD whereas actions return some other data type.

Practical demo of RDD operations

Let’s take a practical look at some of the RDD operations. To practice Apache Spark, you need to install Cloudera virtual environment. You can find a detailed guide to install Cloudera VM here.

Create RDD

First, let’s create an RDD using parallelize() method which is the simplest method.

val rdd1 = sc.parallelize(List(23, 45, 67, 86, 78, 27, 82, 45, 67, 86))

Here, sc denotes SparkContext
and each element is copied to form RDD.

Read result

We can read the result generated by RDD by using the collect operation.

rdd1.collect

The results are shown
here.

Count

The count action is used to get the total number of elements present in the particular RDD.

rdd1.count

There are 10 elements in rdd1.

Distinct

Distinct is a type of transformation that is used to get the unique elements in the RDD.

rdd1.distinct.collect

The distinct elements are displayed.

Filter

Filter transformation creates a new dataset by selecting the elements according to the given condition.

rdd1.filter(x => x < 50).collect

Here, the elements which are less than 50 are displayed.

sortBy

sortBy operation is used to arrange the elements in ascending order when the condition is true and in descending order when the condition is false.

rdd1.sortBy(x => x, true).collect

rdd1.sortBy(x => x, false).collect

Reduce

Reduce action is used to summarize the RDD based on the given formula.

rdd1.reduce((x, y) => x + y)

Here, each element is added and the total sum is printed.

Map

Map transformation processes each element in the RDD according to the given condition and creates a new RDD.

rdd1.map(x => x + 1).collect

Here, each element is incremented once.

Union, intersection, and cartesian

Let’s create another RDD.

val rdd2 = sc.parallelize(List(25,73, 97, 78, 27, 82))

Union operation combines all the elements of the given two RDDs.

Intersection operation forms a new RDD by taking the common elements in the given RDDs.

Cartesian operation is used to create a cartesian product of the required RDDs.

rdd1.union(rdd2).collect
rdd1.intersection(rdd2).collect
rdd1.cartesian(rdd2).collect

First

First is a type of action that always returns the first element of the RDD.

rdd1.first()

Here, the first element in rdd1 is 23.

Take

Take action returns the first n elements in the RDD.

rdd1.take(5)

Here, the first 5 elements are displayed.

Now, you may have noticed that when you do any transformations, only copies of existing RDDs are created and the initially created RDD doesn’t change. This is because RDDs are immutable. This feature makes RDDs fault-tolerant and the lost data can also be recovered easily.

When to use RDDs?

RDD is preferred to use when you want to apply low-level transformations and actions. It gives you a greater handle and control over your data. RDDs can be used when the data is highly unstructured such as media or text streams. RDDs are used when you want to add functional programming constructs rather than domain-specific expressions. RDDs are used in the situation where the schema is not applied.

Frequently Asked Questions

Q1. What is the purpose of RDD?

A. The purpose of RDD (Resilient Distributed Dataset) in Apache Spark is to provide a fault-tolerant and parallelized data structure for distributed computing. RDDs allow for the efficient processing of large-scale data across a cluster of machines by dividing the dataset into logical partitions and enabling transformations and actions on those partitions in parallel, achieving high-performance and scalable data processing.

Q2. What is spark context?

A. Spark Context, often referred to as sc, is the entry point and the main interface between a Spark application and the underlying Spark cluster. It represents the connection to a Spark cluster and serves as the driver program’s control and communication hub. Spark Context provides access to various Spark functionalities and resources, such as distributed datasets (RDDs), distributed variables, and cluster managers. It coordinates the execution of Spark tasks, manages the cluster resources, and handles the distribution of data and computations across the nodes in the cluster.

Endnote

I hope now you have a basic idea about the RDDs and their role in Apache Spark.

Thanks for reading, cheers!

Please take a look at my other articles on dhanya_thailappan, Author at Analytics Vidhya.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Dhanya Thailappan

Predicting the future is not magic. It's an Artificial Intelligence!! This inspired me so much and that's why I love Data Science and Artificial Intelligence. I am currently working as a Data Engineer. I wish to explore more and share my knowledge with others.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Understanding the Basics of Apache Spark RDD

Table of contents

What is RDD in Spark?

Features of Spark RDD

Immutability

In-memory computation

Lazy evaluation

Fault-tolerant

Partitioning

Persistence

Grained operation

How to create RDD?

Operations of RDD

Transformations

Actions

Practical demo of RDD operations

Create RDD

Read result

Count

Distinct

Filter

sortBy

Reduce

Map

Union, intersection, and cartesian

First

Take

When to use RDDs?

Frequently Asked Questions

Endnote

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers