According to IBM, 60% of all sensory information loses value in a few milliseconds if it is not acted on. Bearing in mind that the Big Data and analytics market has reached $125 billion and a large chunk of this will be attributed to IoT in the future, the inability to tap real-time information will result in a loss of billions of dollars.
Examples of some of these applications include a telco, working out how many of its users have used Whatsapp in the last 30 minutes, a retailer keeping track of the number of people who have said positive things about its products today on social media, or a law enforcement agency looking for a suspect using data from traffic CCTV.
This is the primary reason stream-processing systems like Spark Streaming will define the future of real-time analytics. There is also a growing need to analyze both data at rest and data in motion to drive applications, which makes systems like Spark—which can do both—all the more attractive and powerful. It’s a system for all Big Data seasons.
You learn how Spark Streaming not only keeps the familiar Spark API intact but also, under the hood, uses RDDs for storage as well as fault-tolerance. This enables Spark practitioners to jump into the streaming world from the outset. With that in mind, let’s get right to it.
Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters. As of this writing, Spark is the most actively developed open-source engine for this task, making it a standard tool for any developer or data scientist interested in big data.
Spark supports multiple widely-used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers. This makes it an easy system to start with and scale-up to big data processing or an incredibly large scale. Below are a few of the features of Spark:
Fast and general-purpose engine for large-scale data processing
Spark can efficiently support more types of computations
The following are the components of Apache Spark Ecosystem-
Spark Streaming has a micro-batch architecture as follows:
The reduce value of each window is calculated incrementally.
Discretized Stream or DStream is the basic abstraction provided by Spark Streaming. It represents a continuous stream of data, either the input data stream received from the source or the processed data stream generated by transforming the input stream. Internally, a DStream is represented by a continuous series of RDDs, which is Spark’s abstraction of an immutable, distributed dataset (see Spark Programming Guide for more details). Each RDD in a DStream contains data from a certain interval.
It is the main entry point for Spark Streaming functionality. It provides methods used to create DStream
s from various input sources. Streaming Spark can be either created by providing a Spark master URL and an appName, or from an org.apache.spark.SparkConf configuration, or from an existing org.apache.spark.SparkContext. The associated SparkContext can be accessed using context.sparkContext
.
After creating and transforming DStreams, streaming computation can be started and stopped using context.start()
and, respectively. context.awaitTermination()
allows the current thread to wait for the termination of the context by stop()
or by an exception.
To execute a SparkStreaming application, we need to define the StreamingContext. It specializes SparkContext for streaming applications.
Streaming context in Java can be defined as follows-
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, batchInterval);
where:
Transformations which yield a new DStream from a previous one. For example, one common transformation is filtering data.
Examples are: map(), filter(), and reduceByKey()
Note that a streaming context can be started only once, and must be started after we set up all the DStreams and output operations.
Below listed are the basic data sources of Spark Streaming:
... = streamingContext.fileStream<...>(directory);
... = streamingContext.queueStream(queueOfRDDs)
... = streamingContext.queueStream(queueOfRDDs)
Transformation |
Meaning |
map(func) |
Return a new DStream by passing each element of the source DStream through a function func. |
flatMap(func) |
Similar to map, but each input item can be mapped to 0 or more output items. |
filter(func) |
Return a new DStream by selecting only the records of the source DStream on which func returns true. |
union(otherStream) |
Return a new DStream that contains the union of the elements in the source DStream and otherDStream. |
join(other Stream) |
When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key. |
SparkConf sparkConf = new SparkConf() .setMaster("local[2]").setAppName("WordCount"); JavaStreamingContext ssc = ... JavaReceiverInputDStream<String> lines = ssc.socketTextStream( ... ); JavaDStream<String> words = lines.flatMap(...); JavaPairDStream<String, Integer> wordCounts = words .mapToPair(s -> new Tuple2<>(s, 1)) .reduceByKey((i1, i2) -> i1 + i2); wordCounts.print();
The simplest windowing function is a window, which lets you create a new DStream, computed by applying the windowing parameters to the old DStream. You can use any of the DStream operations on the new stream, so you get all the flexibility you want.
Windowed computations allow you to apply transformations over a sliding window of data. Any window operation needs to specify two parameters:
window(windowLength, slideInterval)
It returns a new DStream which is computed based on windowed batches.
...
JavaStreamingContext ssc = ...
JavaReceiverInputDStream<String> lines = ...
JavaDStream<String> linesInWindow =
lines.window(WINDOW_SIZE, SLIDING_INTERVAL);
JavaPairDStream<String, Integer> wordCounts = linesInWindow.flatMap(SPLIT_LINE)
.mapToPair(s -> new Tuple2<>(s, 1))
.reduceByKey((i1, i2) -> i1 + i2);
For performing these transformations, we need to define a checkpoint directory
...
JavaPairDStream<String, Integer> wordCountPairs = ssc.socketTextStream(...)
.flatMap(x -> Arrays.asList(SPACE.split(x)).iterator())
.mapToPair(s -> new Tuple2<>(s, 1));
JavaPairDStream<String, Integer> wordCounts = wordCountPairs
.reduceByKeyAndWindow((i1, i2) -> i1 + i2, WINDOW_SIZE, SLIDING_INTERVAL);
wordCounts.print();
wordCounts.foreachRDD(new SaveAsLocalFile());
In a more efficient version, the reduce value of each window is calculated incrementally-
Note that checkpointing must be enabled for using this operation.
... ssc.checkpoint(LOCAL_CHECKPOINT_DIR); ... JavaPairDStream<String, Integer> wordCounts = wordCountPairs.reduceByKeyAndWindow( (i1, i2) -> i1 + i2, (i1, i2) -> i1 - i2, WINDOW_SIZE, SLIDING_INTERVAL);
Output operations allow DStream’s data to be pushed out to external systems like a database or a file systems
Output Operation |
Meaning |
print() |
Prints the first ten elements of every batch of data in a DStream on the driver node running the application. |
saveAsTextFiles(prefix, [suffix]) |
Save this DStream’s contents as text files. The file name at each batch interval is generated based on prefix. |
saveAsHadoopFiles(prefix, [suffix]) |
Save this DStream’s contents as Hadoop files. |
saveAsObjectFiles(prefix, [suffix]) |
Save this DStream’s contents as SequenceFiles of serialized Java objects. |
foreachRDD(func) |
Generic output operator that applies a function, func, to each RDD generated from the stream. |
Online References-
• Spark Documentation
• Spark Documentation
It should be clear that Spark Streaming presents a powerful way to write streaming applications. Taking a batch job you already run and turning it into a streaming job with almost no code changes is both simple and extremely helpful from an engineering standpoint if you need to have this job interact closely with the rest of your data processing application.
I recommend you go through the following data engineering resources to enhance your knowledge-
Spark Streaming processes data in small, configurable micro-batches, providing low-latency processing compared to traditional batch processing.
Spark Streaming supports various data sources, including HDFS, Kafka, Flume, and others, allowing seamless integration with diverse streaming platforms.
Yes, Spark Streaming can analyze both data at rest (static data) and data in motion (live data streams), making it versatile for different use cases.
If you liked the article then please drop a comment in the comment section below.