Introduction to Apache Spark and its Datasets

Abhishek Jaiswal Last Updated : 17 Aug, 2022

6 min read

This article was published as a part of the Data Science Blogathon.

Introduction

In this article, we will introduce you to the big data ecosystem and the role of Apache Spark in Big data. We will also cover the Distributed database system, the backbone of big data.

In today’s world, data is the fuel. Almost every electronic device collects data that is used for business purposes. Imagine how we work on a big volume of data. Of course, we need powerful computers to work on this amount of data, but it’s unpractical.

Processing petabytes of data is impractical for a single computer; hence big data technologies come into the picture.

Big data is the domain where we deal with many data using different big data tools and cloud systems.

Processing Big Data

Big data processing requires parallel computation since loading petabytes of data in a single or a very high-end computer is impossible. This technique of using parallel computation is known as distributed computing.

A single computer in a distributed system is known as a node, and each node uses its computing resources.

A master node is responsible for dividing the word load among various nodes, and if a work node fails, it stops delivering the load to the failed node.

A cluster is a collection of nodes, including the master node that words in synchronization.

Big Data EcoSystem:

Many open-source tools comprise the big data ecosystem. Open-source tools are generally used in Big data because these tools are more transparent and free to use, so there is no need to worry about the data leaking.

Popular big data open source tools are Apache Spark, Hadoop, Map-Reduce, Hive, Impala, etc.

Tools Category:

Programming Tools
Business Intelligence tools
Analytics and visualization
Databases (NoSql and SQL )
Cloud Technologies
Data Technologies

Hadoop Ecosystem:

Hadoop Ecosystem consists of various open-source tools that fall under the Apache project. These tools are made for big data workloads. All the components work dependently in the Hadoop ecosystem.

Ingest Data( Flume , Sqoop)
Access Data ( Impala, Hue)
Process and Analyze Data (Pig & Hive, Spark)
Store Data (HDFS, HBase)

What is Apache Spark?

Spark is a distributed in-memory data processing tool. Spark is a replacement for Apache Map-Reduce.

Spark is a powerful replacement for Apache Map-Reduce. Spark is faster than Map reduce because of in-memory computation, making it highly capable and always up for a high volume of data processing.

In-memory computation takes the help of an individual system ram for computation instead of the disk, which makes spark powerful.

Top Features of Apache Spark:

Fast Processing
We have already discussed the impact of using in-memory computation on big data processing. Due to its computation power, Spark has been the only choice for big data processing.
Supports various APIs
Spark supports JAVA, PYTHON, and SCALA programming languages. Spark was written on Scala, and it supports other programming language APIs.
Spark core handles jobs using SCALA.
Spark Core is fault-tolerant, so if any node goes down, processing will not stop.
PowerFul Libraries
Spark Supports Various 3rd party libraries as well. Spark comes with a wide range of libraries for specific tasks.
ML-lib in the spark is natively built for machine learning tasks. It also supports streaming machine learning pipelines.
Compatibility and Deployment
The biggest advantage of using Spark is that it doesn’t require huge tedious dependencies. Spark can run on any cloud cluster and can be easily scaled. Spark runs on Kubernetes, Mesos, Hadoop, and other cloud notebooks.
Real-Time Processing
Spark also supports the conventional Hadoop Map-Reduce, which leads Spark to process data from HDFS file format. Spark can easily work on the HDFS cluster without any dependency conflicts.

Apache Spark Architecture

Apache Spark Core engine consists of 3 components

Spark Driver
Executors
Cluster Manager

Spark Driver is responsible for spark context (code we write). It translates the spark context and sends the information to the cluster manager, which creates clusters, and the executor handles worker nodes and assigns tasks to them.

Getting Started with Spark with Python

Spark applications can be written in Python. Python uses py4j in the backend to handle the java codes.

Objectives

Setting Up Pyspark
Creating Context and Session
Spark RDD
Transformation and actions

Setting Up Pyspark

Pyspark in spark API built for python. It lets us create a spark application in Python. It uses py4j in the backend.

Spark can be run natively on any python environment. We can also build spark clusters on cloud notebooks. Popular python environment for running spark in data bricks, providing some databases to work on.

Here is a guide on running spark clusters on data bricks for free.

Installing required packages.

For Running spark in python, we need pyspark module and findspark.

!pip install pyspark
!pip install findspark

Findspark It generates startup files to the current Python profile and prepares the spark environment. Find spark locates the spark startup files.

import findspark
findspark.init()

Spark Session and Context

Spark session Spark Session keeps track of our application. Spark Session must be created before working on spark and loading the data.

SparkContext Spark context is an entry point to the spark application, and it also concludes some RDD functions likeparallelize().

# Initialization spark context class
sc = SparkContext()
# Create spark session
spark = SparkSession 
.builder 
.appName("Application name ") 
.config("spark.some.config.option", "somevalue") 
.getOrCreate()

getOrCreate Creates a new session if the named session doesn’t exist.

spark

Spark RDDs

Spark RDD ( Resilient distributed datasets) are fundamental data structures on a spark, which is an immutable distributed object.RDDs are super fast, and the Core Engine of Spark supports RDDs.

RDDs in Spark can only be created by parallelizing or referencing the other datasets.

RDDS works in a distributed fashion, meaning the dataset in RDD is divided into logical partitions, computed by different assigned nodes by cluster. Spark RDDs are fault tolerant; in spark, other datasets are based on RDDs.

RDD accepts the following types of datatypes —

Parquet, Text, Hadoop inputs, Avro, Sequence Files, etc.
Amazon S3, Cassandra, HBase, HDFS, etc.

In RDD, data is distributed across multiple nodes, making it work in a distributed manner.

RDD supports lazy evaluation, which means it doesn’t compute anything until the value is required.

sc.parallelize It transforms a series into an RDD.

Series to RDD transformation.

data = range(1,30)
# print first element of iterator
print(data[0])
len(data)
xRDD = sc.parallelize(data, 5)

RDD Transformations

Transformations are the rules that must be followed for the computation.RDDs are lazy evaluations that indicate that no calculations will be performed until the actions are called.

This transformation will be stored as a set of rules and implemented at the action stage.

# Reduces each number by 2
sRDD = xrangeRDD.map(lambda x: x-2)

# selects all number less than 20
filteredRDD = sRDD.filter(lambda x : x<20)

RDD Actions

Actions are the actual computation process. after applying the transformation, we need to call actions whenever we need the values. It helps in data integrity.

print(filteredRDD.collect())
filteredRDD.count()

Output:

Conclusion

In this article, we talked about the ecosystem of Big Data and the various types of tools the Big data Ecosystem is made up of.

We talked about the role of distributed systems and how Spark works in Big data.

Spark Architecture contains a driver node, context reader, and node manager. Spark works in a distributed manner, the same as Hadoop, but alike Hadoop, it uses In-memory computation instead of disk.

We discussed RDD and Transformations and actions.

Spark RDDs can’t be modified only can be replaced.
Spark RDDs are lazy evaluated, which helps in data integrity and doesn’t let data corrupt.
Spark Supports distributed SQL that is built on top of RDDs
Spark Supports various machine learning models, including CNN as well as NLPs.

Thanks for reading this article

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Abhishek Jaiswal

A data enthusiast exploring the leading technologies related to the data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Introduction to Apache Spark and its Datasets

Introduction

Processing Big Data

Big Data EcoSystem:

Tools Category:

Hadoop Ecosystem:

What is Apache Spark?

Top Features of Apache Spark:

Apache Spark Architecture

Getting Started with Spark with Python

Objectives

Setting Up Pyspark

Installing required packages.

Spark RDDs

Series to RDD transformation.

RDD Transformations

RDD Actions

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk