A Beginner’s Guide to Spark Streaming For Data Engineers

Siddharth Sonkar Last Updated : 22 Dec, 2023

8 min read

Overview

Understand Spark Streaming and its functioning.
Learn about Windows in Spark Streaming with an example.

Introduction

According to IBM, 60% of all sensory information loses value in a few milliseconds if it is not acted on. Bearing in mind that the Big Data and analytics market has reached $125 billion and a large chunk of this will be attributed to IoT in the future, the inability to tap real-time information will result in a loss of billions of dollars.

Examples of some of these applications include a telco, working out how many of its users have used Whatsapp in the last 30 minutes, a retailer keeping track of the number of people who have said positive things about its products today on social media, or a law enforcement agency looking for a suspect using data from traffic CCTV.

This is the primary reason stream-processing systems like Spark Streaming will define the future of real-time analytics. There is also a growing need to analyze both data at rest and data in motion to drive applications, which makes systems like Spark—which can do both—all the more attractive and powerful. It’s a system for all Big Data seasons.

You learn how Spark Streaming not only keeps the familiar Spark API intact but also, under the hood, uses RDDs for storage as well as fault-tolerance. This enables Spark practitioners to jump into the streaming world from the outset. With that in mind, let’s get right to it.

Spark Streaming introduction — An Introduction to Spark Streaming | by Harshit Agarwal | Medium

Overview
Introduction
Apache Spark
Apache Spark Ecosystem
Spark Streaming: Abstractions
Discretized Stream (DStream)
Spark Streaming: Streaming Context
Basic Data Sources
Most of the transformations have the same syntax as the one applied to RDDs
Spark Streaming: Window
Window-based – Word Count
A (more efficient) Window-based – Word Count
Spark Streaming: Output Operations
Conclusion
Frequently Asked Questions

Apache Spark

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open-source engine for this task, making it a standard tool for any developer or data scientist interested in big data.

Spark supports multiple widely-used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or an incredibly large scale. Below are a few of the features of Spark:

Fast and general-purpose engine for large-scale data processing
- Not a modified version of Hadoop
- The leading candidate for “successor to Map Reduce”
Spark can efficiently support more types of computations
- For example, interactive queries, stream processing
Can read/write to any Hadoop-supported system (e.g., HDFS)
Speed: in-memory data storage for very fast iterative queries
- the system is also more efficient than MapReduce for complex applications running on disk
- up to 40x faster than Hadoop
- Ingest data from many sources: Kafka, Twitter, HDFS, TCP sockets
- Results can be pushed out to file-systems, databases, live dashboards, but not only

Apache Spark Ecosystem

The following are the components of Apache Spark Ecosystem-

Spark Core: basic functionality of Spark (task scheduling, memory management, fault recovery, storage systems interaction).
Spark SQL: package for working with structured data queried via SQL as well as HiveQL
Spark Streaming: a component that enables processing of live streams of data (e.g., log files, status updates messages)
MLLib: MLLib is a machine learning library like Mahout. It is built on top of Spark and has the provision to support many machine learning algorithms.
GraphX: For graphs and graphical computations, Spark has its own Graph Computation Engine, called GraphX. It is similar to other widely used graph processing tools or databases, like Neo4j, Giraffe, and many other distributed graph databases.

Spark Streaming: Abstractions

Spark Streaming has a micro-batch architecture as follows:

treats the stream as a series of batches of data
new batches are created at regular time intervals
the size of the time intervals is called the batch interval
the batch interval is typically between 500 ms and several seconds

The reduce value of each window is calculated incrementally.

Discretized Stream (DStream)

Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval.

RDD transformations are computed by the Spark engine
the DStream operations hide most of these details
Any operation applied on a DStream translates to operations on the underlying RDDs
The reduce value of each window is calculated incrementally.

Spark Streaming: Streaming Context

It is the main entry point for Spark Streaming functionality. It provides methods used to create DStreams from various input sources. Streaming Spark can be either created by providing a Spark master URL and an appName, or from an org.apache.spark.SparkConf configuration, or from an existing org.apache.spark.SparkContext. The associated SparkContext can be accessed using context.sparkContext.

After creating and transforming DStreams, streaming computation can be started and stopped using context.start() and, respectively. context.awaitTermination() allows the current thread to wait for the termination of the context by stop() or by an exception.

To execute a SparkStreaming application, we need to define the StreamingContext. It specializes SparkContext for streaming applications.

Streaming context in Java can be defined as follows-

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, batchInterval);

where:

master is a Spark, Mesos, or YARN cluster URL; to run your code in local mode, use “local[K]” where K>=2 represents the parallelism
appname is the name of your application
batch interval time interval (in seconds) of each batch

Once built, they offer two types of operations:

Transformations which yield a new DStream from a previous one. For example, one common transformation is filtering data.
- stateless transformations: the processing of each batch does not depend on the data of its previous batches.

Examples are: map(), filter(), and reduceByKey()

stateful transformations: use data from previous batches to compute the results of the current batch. They include sliding windows, tracking state across time, etc
Output operations that write data to an external system. Each streaming application has to define an output operation.

Note that a streaming context can be started only once, and must be started after we set up all the DStreams and output operations.

Basic Data Sources

Below listed are the basic data sources of Spark Streaming:

File Streams: It is used for reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
```
... = streamingContext.fileStream<...>(directory);
```
Streams based on Custom Receivers: DStreams can be created with data streams received through custom receivers, extending the Receiver<T> class
```
... = streamingContext.queueStream(queueOfRDDs)
```
Queue of RDDs as a Stream: For testing a Spark Streaming application with test data, one can also create a DStream based on a queue of RDDs, using
```
... = streamingContext.queueStream(queueOfRDDs)
```

Most of the transformations have the same syntax as the one applied to RDDs

Transformation	Meaning
map(func)	Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)	Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)	Return a new DStream by selecting only the records of the source DStream on which func returns true.
union(otherStream)	Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
join(other Stream)	When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.

Example: Word Count-

SparkConf sparkConf = new SparkConf()
.setMaster("local[2]").setAppName("WordCount");
JavaStreamingContext ssc = ...
JavaReceiverInputDStream<String> lines = ssc.socketTextStream( ... );
JavaDStream<String> words = lines.flatMap(...);
JavaPairDStream<String, Integer> wordCounts = words
                                             .mapToPair(s -> new Tuple2<>(s, 1))
                                             .reduceByKey((i1, i2) -> i1 + i2);

wordCounts.print();

Spark Streaming: Window

The simplest windowing function is a window, which lets you create a new DStream, computed by applying the windowing parameters to the old DStream. You can use any of the DStream operations on the new stream, so you get all the flexibility you want.

Windowed computations allow you to apply transformations over a sliding window of data. Any window operation needs to specify two parameters:

window length
- The duration of the window in secs
sliding interval
- The interval at which the window operation is performed in secs
- These parameters must be multiples of the batch interval

window(windowLength, slideInterval)

It returns a new DStream which is computed based on windowed batches.

...
JavaStreamingContext ssc = ...
JavaReceiverInputDStream<String> lines = ...
JavaDStream<String> linesInWindow =
lines.window(WINDOW_SIZE, SLIDING_INTERVAL);
JavaPairDStream<String, Integer> wordCounts = linesInWindow.flatMap(SPLIT_LINE)
.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);

reduceByWindow(func, InvFunc, windowLength, slideInterval)
- Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func (which should be associative).
- The reduce value of each window is calculated incrementally.

- func reduces new data that enters the sliding window
- invFunc “inverse reduces” the old data that leaves the window.
reduceByKeyAndWindow(func, InvFunc, windowLength, slideInterval)
- When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window.

For performing these transformations, we need to define a checkpoint directory

Window-based – Word Count

...
JavaPairDStream<String, Integer> wordCountPairs = ssc.socketTextStream(...)
.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator())
.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = wordCountPairs
.reduceByKeyAndWindow((i1, i2) -> i1 + i2, WINDOW_SIZE, SLIDING_INTERVAL);
wordCounts.print();
wordCounts.foreachRDD(new SaveAsLocalFile());

A (more efficient) Window-based – Word Count

In a more efficient version, the reduce value of each window is calculated incrementally-

a reduce function handles new data that enters the sliding window;
an “inverse reduce” function handles old data that leaves the window.

Note that checkpointing must be enabled for using this operation.

...
ssc.checkpoint(LOCAL_CHECKPOINT_DIR);
...
JavaPairDStream<String, Integer> wordCounts = wordCountPairs.reduceByKeyAndWindow(
(i1, i2) -> i1 + i2,
(i1, i2) -> i1 - i2, WINDOW_SIZE, SLIDING_INTERVAL);

Spark Streaming: Output Operations

Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems

Output Operation	Meaning
print()	Prints the first ten elements of every batch of data in a DStream on the driver node running the application.
saveAsTextFiles(prefix, [suffix])	Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix.
saveAsHadoopFiles(prefix, [suffix])	Save this DStream’s contents as Hadoop files.
saveAsObjectFiles(prefix, [suffix])	Save this DStream’s contents as SequenceFiles of serialized Java objects.
foreachRDD(func)	Generic output operator that applies a function, func, to each RDD generated from the stream.

Online References-
• Spark Documentation
• Spark Documentation

Conclusion

It should be clear that Spark Streaming presents a powerful way to write streaming applications. Taking a batch job you already run and turning it into a streaming job with almost no code changes is both simple and extremely helpful from an engineering standpoint if you need to have this job interact closely with the rest of your data processing application.

I recommend you go through the following data engineering resources to enhance your knowledge-

Frequently Asked Questions

Q1.How does Spark Streaming differ from batch processing?

Spark Streaming processes data in small, configurable micro-batches, providing low-latency processing compared to traditional batch processing.

Q2.What types of data sources are supported by Spark Streaming?

Spark Streaming supports various data sources, including HDFS, Kafka, Flume, and others, allowing seamless integration with diverse streaming platforms.

Q3.Can Spark Streaming handle both data at rest and data in motion?

Yes, Spark Streaming can analyze both data at rest (static data) and data in motion (live data streams), making it versatile for different use cases.

If you liked the article then please drop a comment in the comment section below.

Siddharth Sonkar

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

A Beginner’s Guide to Spark Streaming For Data Engineers

Overview

Introduction

Table of contents

Apache Spark

Apache Spark Ecosystem

Spark Streaming: Abstractions

Discretized Stream (DStream)

Spark Streaming: Streaming Context

Once built, they offer two types of operations:

Basic Data Sources

Most of the transformations have the same syntax as the one applied to RDDs

Example: Word Count-

Spark Streaming: Window

Window-based – Word Count

A (more efficient) Window-based – Word Count

Spark Streaming: Output Operations

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck