Spark MLlib for Big Data and Machine Learning

MankayarKarasi Last Updated : 27 Feb, 2025
5 min read

With the growing demand for big data and machine learning, this article introduces Spark MLlib, its components, and its functionality, focusing on the use of machine learning algorithms within Apache Spark. Apache Spark is a fast and versatile data processing framework capable of handling large datasets efficiently by distributing tasks across multiple computers, making it a powerful tool for big data analytics and machine learning. To enhance its accessibility, the Apache Spark community developed PySpark, enabling Python programmers to work with Resilient Distributed Datasets (RDDs) and leverage Spark’s capabilities within the Python ecosystem. In this article you all get to know about the spark mLlib , its components and major tools for spark MLlib.

This article was published as a part of the Data Science Blogathon.

What is Spark MLlib?

Spark MLlib is Apache Spark’s scalable machine learning library, offering efficient, distributed algorithms for tasks like classification, regression, clustering, and collaborative filtering. It includes tools for feature extraction, transformation, and pipeline building. Built on Spark’s in-memory processing, MLlib handles large-scale data seamlessly and integrates with Spark SQL and DataFrames for streamlined workflows. Ideal for big data applications, it combines high performance with ease of use, making it a powerful choice for scalable machine learning in distributed environments.

Components of Spark

Spark Core

All the functionalities being provided by Apache Spark are built on the top of Spark Core. It manages all essential I/O functionalities. It is used for task dispatching and fault recovery. Spark Core is embedded with a special collection called RDD (Resilient Distributed Dataset). RDD is among the abstractions of Spark. Spark RDD handles partitioning data across all the nodes in a cluster. It holds them in the memory pool of the cluster as a single unit. There are two operations performed on RDDs:

  • Transformation: It is a function that produces new RDD from the existing RDDs.
  • Action: In Transformation, RDDs are created from each other. But when we want to work with the actual dataset, then, at that point we use Action.

Spark SQL

The Spark SQL component is a distributed framework for structured data processing. Spark SQL works to access structured and semi-structured information. It also enables powerful, interactive, analytical applications across both streaming and historical data. DataFrames and SQL provide a common way to access a variety of data sources. Its main feature is being a Cost-based optimizer and Mid query fault-tolerance.

Spark Streaming

It is an add-on to core Spark API which allows scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming, groups the live data into small batches. It then delivers it to the batch system for processing. It also provides fault tolerance characteristics.

Spark GraphX

GraphX in Spark is an API for graphs and graph parallel execution. It is a network graph analytics engine and data store. Clustering, classification, traversal, searching, and pathfinding is also possible in graphs.

SparkR

SparkR provides a distributed data frame implementation. It supports operations like selection, filtering, aggregation but on large datasets.

Spark MLlib

Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists of popular algorithms and utilities. MLlib in Spark is a scalable Machine learning library that discusses both high-quality algorithm and high speed. The machine learning algorithms like regression, classification, clustering, pattern mining, and collaborative filtering. Lower level machine learning primitives like generic gradient descent optimization algorithm are also present in MLlib.

  • Spark.ml is the primary Machine Learning API for Spark. The library Spark.ml offers a higher-level API built on top of DataFrames for constructing ML pipelines.

Spark MLlib tools are given below:

  1. ML Algorithms
  2. Featurization
  3. Pipelines
  4. Persistence
  5. Utilities

ML Algorithms

ML Algorithms form the core of MLlib. These include common learning algorithms such as classification, regression, clustering, and collaborative filtering.

MLlib standardizes APIs to make it easier to combine multiple algorithms into a single pipeline, or workflow. The key concepts are the Pipelines API, where the pipeline concept is inspired by the scikit-learn project.

Transformer

A Transformer is an algorithm that can transform one DataFrame into another DataFrame. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:

  • A feature transformer might take a DataFrame, read a column (e.g., text), map it into a new column (e.g., feature vectors), and output a new DataFrame with the mapped column appended.
  • A learning model might take a DataFrame, read the column containing feature vectors, predict the label for each feature vector, and output a new DataFrame with predicted labels appended as a column.

Estimator

An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces

  • Model, which is a Transformer. For example, a learning algorithm such as Logistic regression is an Estimator, and calling fit() trains a LogisticRegressionModel, which is a Model and hence a Transformer.
  • Transformer.transform() and Estimator.fit() are both stateless. In the future, stateful algorithms may be supported via alternative concepts.

Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).

Checkout this article about the Machine Learning on Spark using Spark MLLIB

Featurization

Featurization includes feature extraction, transformation, dimensionality reduction, and selection.

  1. Feature Extraction is extracting features from raw data.
  2. Feature Transformation includes scaling, renovating, or modifying features
  3. Feature Selection involves selecting a subset of necessary features from a huge set of features.

Pipelines

The pipeline workflow will execute the data modelling in the above specific order.

from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ['job', 'marital', 'education', 'default', 'housing', 'loan']
stages = []
for categoricalCol in categoricalColumns:
    stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Indexer')
    encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "Vec"])
    stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol = 'label')
stages += [label_stringIdx]
numericColumns = ['age', 'balance', 'duration']
assemblerInputs = [c + "Vec" for c in categoricalColumns] + numericColumns
Vassembler = VectorAssembler(inputCols = assemblerInputs, outputCol="features")
stages += [Vassembler]

from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
Dataframe

Dataframes provide a more user-friendly API than RDDs. The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages. Dataframes facilitate practical ML Pipelines, particularly feature transformations.

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mlearnsample').getOrCreate()
df = spark.read.csv('loan_bank.csv', header = True, inferSchema = True)
df.printSchema()

Persistence

Persistence helps in saving and loading algorithms, models, and Pipelines. This helps in reducing time and efforts as the model is persistence, it can be loaded/ reused any time when needed.

from pyspark.ml.classification import LogisticRegressionlr = LogisticRegression(featuresCol = 'features', labelCol = 'label')lrModel = lr.fit(train)
from pyspark.ml.evaluation import BinaryClassificationEvaluator

evaluator = BinaryClassificationEvaluator()

print('Test Area Under ROC', evaluator.evaluate(predictions))

predictions = lrModel.transform(test)
predictions.select('age', 'label', 'rawPrediction', 'prediction').show()

Utilities

Utilities for linear algebra, statistics, and data handling. Example: mllib.linalg is MLlib utilities for linear algebra.

Conclusion

Spark MLlib is required if you are dealing with big data and machine learning. In this article, you had learned about the details of Spark MLlib, Data frames, and Pipelines. In the future article, we will work on hands-on code in implementing Pipelines and building data model using MLlib.

Frequently Asked Questions

Q1.What is Spark MLlib used for?

Spark MLlib is used for machine learning tasks like classification, regression, clustering, and recommendation. It helps process large datasets efficiently using distributed computing.

Q2. What is the difference between Spark and TensorFlow?

Spark is a big data processing framework for distributed data, while TensorFlow is a deep learning library focused on building and training neural networks. Spark handles large-scale data, and TensorFlow specializes in AI models.

Q3. What is MLlib vs ml?

MLlib is the older Spark library for machine learning with RDD-based APIs. ML is the newer library with DataFrame-based APIs, offering a more user-friendly and flexible approach for machine learning tasks.

Login to continue reading and enjoy expert-curated content.

Responses From Readers

Clear

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details