With the growing demand for big data and machine learning, this article introduces Spark MLlib, its components, and its functionality, focusing on the use of machine learning algorithms within Apache Spark. Apache Spark is a fast and versatile data processing framework capable of handling large datasets efficiently by distributing tasks across multiple computers, making it a powerful tool for big data analytics and machine learning. To enhance its accessibility, the Apache Spark community developed PySpark, enabling Python programmers to work with Resilient Distributed Datasets (RDDs) and leverage Spark’s capabilities within the Python ecosystem. In this article you all get to know about the spark mLlib , its components and major tools for spark MLlib.
This article was published as a part of the Data Science Blogathon.
Spark MLlib is Apache Spark’s scalable machine learning library, offering efficient, distributed algorithms for tasks like classification, regression, clustering, and collaborative filtering. It includes tools for feature extraction, transformation, and pipeline building. Built on Spark’s in-memory processing, MLlib handles large-scale data seamlessly and integrates with Spark SQL and DataFrames for streamlined workflows. Ideal for big data applications, it combines high performance with ease of use, making it a powerful choice for scalable machine learning in distributed environments.
All the functionalities being provided by Apache Spark are built on the top of Spark Core. It manages all essential I/O functionalities. It is used for task dispatching and fault recovery. Spark Core is embedded with a special collection called RDD (Resilient Distributed Dataset). RDD is among the abstractions of Spark. Spark RDD handles partitioning data across all the nodes in a cluster. It holds them in the memory pool of the cluster as a single unit. There are two operations performed on RDDs:
The Spark SQL component is a distributed framework for structured data processing. Spark SQL works to access structured and semi-structured information. It also enables powerful, interactive, analytical applications across both streaming and historical data. DataFrames and SQL provide a common way to access a variety of data sources. Its main feature is being a Cost-based optimizer and Mid query fault-tolerance.
It is an add-on to core Spark API which allows scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming, groups the live data into small batches. It then delivers it to the batch system for processing. It also provides fault tolerance characteristics.
GraphX in Spark is an API for graphs and graph parallel execution. It is a network graph analytics engine and data store. Clustering, classification, traversal, searching, and pathfinding is also possible in graphs.
SparkR provides a distributed data frame implementation. It supports operations like selection, filtering, aggregation but on large datasets.
Spark MLlib is used to perform machine learning in Apache Spark. MLlib consists of popular algorithms and utilities. MLlib in Spark is a scalable Machine learning library that discusses both high-quality algorithm and high speed. The machine learning algorithms like regression, classification, clustering, pattern mining, and collaborative filtering. Lower level machine learning primitives like generic gradient descent optimization algorithm are also present in MLlib.
ML Algorithms form the core of MLlib. These include common learning algorithms such as classification, regression, clustering, and collaborative filtering.
MLlib standardizes APIs to make it easier to combine multiple algorithms into a single pipeline, or workflow. The key concepts are the Pipelines API, where the pipeline concept is inspired by the scikit-learn project.
A Transformer is an algorithm that can transform one DataFrame into another DataFrame. Technically, a Transformer implements a method transform(), which converts one DataFrame into another, generally by appending one or more columns. For example:
An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer. Technically, an Estimator implements a method fit(), which accepts a DataFrame and produces
Each instance of a Transformer or Estimator has a unique ID, which is useful in specifying parameters (discussed below).
Checkout this article about the Machine Learning on Spark using Spark MLLIB
Featurization includes feature extraction, transformation, dimensionality reduction, and selection.
The pipeline workflow will execute the data modelling in the above specific order.
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer, VectorAssembler
categoricalColumns = ['job', 'marital', 'education', 'default', 'housing', 'loan']
stages = []
for categoricalCol in categoricalColumns:
stringIndexer = StringIndexer(inputCol = categoricalCol, outputCol = categoricalCol + 'Indexer')
encoder = OneHotEncoderEstimator(inputCols=[stringIndexer.getOutputCol()], outputCols=[categoricalCol + "Vec"])
stages += [stringIndexer, encoder]
label_stringIdx = StringIndexer(inputCol = 'deposit', outputCol = 'label')
stages += [label_stringIdx]
numericColumns = ['age', 'balance', 'duration']
assemblerInputs = [c + "Vec" for c in categoricalColumns] + numericColumns
Vassembler = VectorAssembler(inputCols = assemblerInputs, outputCol="features")
stages += [Vassembler]
from pyspark.ml import Pipeline
pipeline = Pipeline(stages = stages)
pipelineModel = pipeline.fit(df)
df = pipelineModel.transform(df)
selectedCols = ['label', 'features'] + cols
df = df.select(selectedCols)
Dataframes provide a more user-friendly API than RDDs. The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages. Dataframes facilitate practical ML Pipelines, particularly feature transformations.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mlearnsample').getOrCreate()
df = spark.read.csv('loan_bank.csv', header = True, inferSchema = True)
df.printSchema()
Persistence helps in saving and loading algorithms, models, and Pipelines. This helps in reducing time and efforts as the model is persistence, it can be loaded/ reused any time when needed.
from pyspark.ml.classification import LogisticRegressionlr = LogisticRegression(featuresCol = 'features', labelCol = 'label')lrModel = lr.fit(train)
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
print('Test Area Under ROC', evaluator.evaluate(predictions))
predictions = lrModel.transform(test)
predictions.select('age', 'label', 'rawPrediction', 'prediction').show()
Utilities for linear algebra, statistics, and data handling. Example: mllib.linalg is MLlib utilities for linear algebra.
Spark MLlib is required if you are dealing with big data and machine learning. In this article, you had learned about the details of Spark MLlib, Data frames, and Pipelines. In the future article, we will work on hands-on code in implementing Pipelines and building data model using MLlib.
Spark MLlib is used for machine learning tasks like classification, regression, clustering, and recommendation. It helps process large datasets efficiently using distributed computing.
Spark is a big data processing framework for distributed data, while TensorFlow is a deep learning library focused on building and training neural networks. Spark handles large-scale data, and TensorFlow specializes in AI models.
MLlib is the older Spark library for machine learning with RDD-based APIs. ML is the newer library with DataFrame-based APIs, offering a more user-friendly and flexible approach for machine learning tasks.