Want to Build Machine Learning Pipelines? A Quick Introduction using PySpark

Lakshay arora Last Updated : 22 Apr, 2020

8 min read

Overview

Here’s a quick introduction to building machine learning pipelines using PySpark
The ability to build these machine learning pipelines is a must-have skill for any aspiring data scientist
This is a hands-on article with a structured PySpark code approach – so get your favorite Python IDE ready!

Introduction

Take a moment to ponder this – what are the skills an aspiring data scientist needs to possess to land an industry role?

A machine learning project has a lot of moving components that need to be tied together before we can successfully execute it. The ability to know how to build an end-to-end machine learning pipeline is a prized asset. As a data scientist (aspiring or established), you should know how these machine learning pipelines work.

This is, to put it simply, the amalgamation of two disciplines – data science and software engineering. These two go hand-in-hand for a data scientist. It isn’t just about building models – we need to have the software skills to build enterprise-level systems.

So in this article, we will focus on the basic idea behind building these machine learning pipelines using PySpark. This is a hands-on article so fire up your favorite Python IDE and let’s get going!

Note: This is part 2 of my PySpark for beginners series. You can check out the introductory article below:

PySpark for Beginners – Take your First Steps into Big Data Analytics (with code)

Perform Basic Operations on a Spark Dataframe
1. Reading a CSV file
2. Defining the Schema
Data Exploration using PySpark
1. Check the Data Dimensions
2. Describe the Data
3. Missing Values Count
4. Find Count of Unique Values in a Column
Encode Categorical VariablesusingPySpark
1. String Indexing
2. One Hot Encoding
Vector Assembler
Building Machine Learning Pipelines using PySpark
1. Transformers and Estimators
2. Examples of Pipelines

Perform Basic Operations on a Spark Dataframe

An essential (and first) step in any data science project is to understand the data before building any Machine Learning model. Most data science aspirants stumble here – they just don’t spend enough time understanding what they’re working with. There’s a tendency to rush in and build models – a fallacy you must avoid.

We will follow this principle in this article. I’ll follow a structured approach throughout to ensure we don’t miss out on any critical step.

So first, let’s take a moment and understand each variable we’ll be working with here. We are going to use a dataset from a recently concluded India vs Bangladesh cricket match. Let’s see the different variables we have in the dataset:

Batsman: Unique batsman id (Integer)
Batsman_Name: Name of the batsman (String)
Bowler: Unique bowler id (Integer)
Bowler_Name: Name of the bowler (String)
Commentary: Description of the event as broadcasted (String)
Detail: Another string describing the events such as wickets and extra deliveries (String)
Dismissed: Unique Id of the batsman if dismissed (String)
Id: Unique row id (String)
Isball: Whether the delivery was legal or not (Boolean)
Isboundary: Whether the batsman hit a boundary or not (Binary)
Iswicket: Whether the batsman dismissed or not ( Binary)
Over: Over number (Double)
Runs: Runs on that particular delivery (Integer)
Timestamp: Time at which the data was recorded (Timestamp)

So let’s begin, shall we?

Reading a CSV file

When we power up Spark, the SparkSession variable is appropriately available under the name ‘spark‘. We can use this to read multiple types of files, such as CSV, JSON, TEXT, etc. This enables us to save the data as a Spark dataframe.

By default, it considers the data type of all the columns as a string. You can check the data types by using the printSchema function on the dataframe:

	# read a csv file
	my_data = spark.read.csv('ind-ban-comment.csv',header=True)

	# see the default schema of the dataframe
	my_data.printSchema()

view raw file_reading_1.py hosted with ❤ by GitHub

Defining the Schema

Now, we do not want all the columns in our dataset to be treated as strings. So what can we do about that?

We can define the custom schema for our dataframe in Spark. For this, we need to create an object of StructType which takes a list of StructField. And of course, we should define StructField with a column name, the data type of the column and whether null values are allowed for the particular column or not.

Refer to the below code snippet to understand how to create this custom schema:

	import pyspark.sql.types as tp

	# define the schema
	my_schema = tp.StructType([
	tp.StructField(name= 'Batsman', dataType= tp.IntegerType(), nullable= True),
	tp.StructField(name= 'Batsman_Name', dataType= tp.StringType(), nullable= True),
	tp.StructField(name= 'Bowler', dataType= tp.IntegerType(), nullable= True),
	tp.StructField(name= 'Bowler_Name', dataType= tp.StringType(), nullable= True),
	tp.StructField(name= 'Commentary', dataType= tp.StringType(), nullable= True),
	tp.StructField(name= 'Detail', dataType= tp.StringType(), nullable= True),
	tp.StructField(name= 'Dismissed', dataType= tp.IntegerType(), nullable= True),
	tp.StructField(name= 'Id', dataType= tp.IntegerType(), nullable= True),
	tp.StructField(name= 'Isball', dataType= tp.BooleanType(), nullable= True),
	tp.StructField(name= 'Isboundary', dataType= tp.BinaryType(), nullable= True),
	tp.StructField(name= 'Iswicket', dataType= tp.BinaryType(), nullable= True),
	tp.StructField(name= 'Over', dataType= tp.DoubleType(), nullable= True),
	tp.StructField(name= 'Runs', dataType= tp.IntegerType(), nullable= True),
	tp.StructField(name= 'Timestamp', dataType= tp.TimestampType(), nullable= True)
	])

	# read the data again with the defined schema
	my_data = spark.read.csv('ind-ban-comment.csv',schema= my_schema,header= True)

	# print the schema
	my_data.printSchema()

view raw file_reading_2.py hosted with ❤ by GitHub

Drop columns from the data

In any machine learning project, we always have a few columns that are not required for solving the problem. I’m sure you’ve come across this dilemma before as well, whether that’s in the industry or in an online hackathon.

In our instance, we can use the drop function to remove the column from the data. Use the asterisk (*) sign before the list to drop multiple columns from the dataset:

	# drop the columns that are not required
	my_data = my_data.drop(*['Batsman', 'Bowler', 'Id'])
	my_data.columns

view raw drop_columns.py hosted with ❤ by GitHub

Data Exploration using PySpark

Check the Data Dimensions

Unlike Pandas, Spark dataframes do not have the shape function to check the dimensions of the data. We can instead use the code below to check the dimensions of the dataset:

	# get the dimensions of the data
	(my_data.count() , len(my_data.columns))
	# >> (605, 11)

view raw data_dimensions.py hosted with ❤ by GitHub

Describe the Data

Spark’s describe function gives us most of the statistical results like mean, count, min, max, and standard deviation. You can use the summary function to get the quartiles of the numeric variables as well:

	# get the summary of the numerical columns
	my_data.select('Isball', 'Isboundary', 'Runs').describe().show()

view raw describe_pyspark.py hosted with ❤ by GitHub

Missing Values Count

It’s rare when we get a dataset without any missing values. Can you remember the last time that happened?

It is important to check the number of missing values present in all the columns. Knowing the count helps us treat the missing values before building any machine learning model using that data.

So, you can use the code below to find the null value count in your dataset:

	# import sql function pyspark
	import pyspark.sql.functions as f

	# null values in each column
	data_agg = my_data.agg(*[f.count(f.when(f.isnull(c), c)).alias(c) for c in my_data.columns])
	data_agg.show()

view raw null_values_pyspark.py hosted with ❤ by GitHub

Value Counts of a Column

Unlike Pandas, we do not have the value_counts() function in Spark dataframes. You can use the groupBy function to calculate the unique value counts of categorical variables:

	# value counts of Batsman_Name column
	my_data.groupBy('Batsman_Name').count().show()

view raw value_counts_pyspark.py hosted with ❤ by GitHub

Encode Categorical Variables using PySpark

Most machine learning algorithms accept the data only in numerical form. So, it is essential to convert any categorical variables present in our dataset into numbers.

Remember that we cannot simply drop them from our dataset as they might contain useful information. It would be a nightmare to lose that just because we don’t want to figure out how to use them!

Let’s see some of the methods to encode categorical variables using PySpark.

String Indexing

String Indexing is similar to Label Encoding. It assigns a unique integer value to each category. 0 is assigned to the most frequent category, 1 to the next most frequent value, and so on. We have to define the input column name that we want to index and the output column name in which we want the results:

	from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator

	# create object of StringIndexer class and specify input and output column
	SI_batsman = StringIndexer(inputCol='Batsman_Name',outputCol='Batsman_Index')
	SI_bowler = StringIndexer(inputCol='Bowler_Name',outputCol='Bowler_Index')

	# transform the data
	my_data = SI_batsman.fit(my_data).transform(my_data)
	my_data = SI_bowler.fit(my_data).transform(my_data)

	# view the transformed data
	my_data.select('Batsman_Name', 'Batsman_Index', 'Bowler_Name', 'Bowler_Index').show(10)

view raw string_index.py hosted with ❤ by GitHub

One-Hot Encoding

One-hot encoding is a concept every data scientist should know. I’ve relied on it multiple times when dealing with missing values. It’s a lifesaver!

Here’s the caveat – Spark’s OneHotEncoder does not directly encode the categorical variable.

First, we need to use the String Indexer to convert the variable into numerical form and then use OneHotEncoderEstimator to encode multiple columns of the dataset.

It creates a Sparse Vector for each row:

	# create object and specify input and output column
	OHE = OneHotEncoderEstimator(inputCols=['Batsman_Index', 'Bowler_Index'],outputCols=['Batsman_OHE', 'Bowler_OHE'])

	# transform the data
	my_data = OHE.fit(my_data).transform(my_data)

	# view and transform the data
	my_data.select('Batsman_Name', 'Batsman_Index', 'Batsman_OHE', 'Bowler_Name', 'Bowler_Index', 'Bowler_OHE').show(10)

view raw ohe_pyspark.py hosted with ❤ by GitHub

Vector Assembler

A vector assembler combines a given list of columns into a single vector column.

This is typically used at the end of the data exploration and pre-processing steps. At this stage, we usually work with a few raw or transformed features that can be used to train our model.

The Vector Assembler converts them into a single feature column in order to train the machine learning model (such as Logistic Regression). It accepts numeric, boolean and vector type columns:

	from pyspark.ml.feature import VectorAssembler

	# specify the input and output columns of the vector assembler
	assembler = VectorAssembler(inputCols=['Isboundary',
	'Iswicket',
	'Over',
	'Runs',
	'Batsman_Index',
	'Bowler_Index',
	'Batsman_OHE',
	'Bowler_OHE'],
	outputCol='vector')

	# fill the null values
	my_data = my_data.fillna(0)

	# transform the data
	final_data = assembler.transform(my_data)

	# view the transformed vector
	final_data.select('vector').show()

view raw vector_assembler.py hosted with ❤ by GitHub

Building Machine Learning Pipelines using PySpark

A machine learning project typically involves steps like data preprocessing, feature extraction, model fitting and evaluating results. We need to perform a lot of transformations on the data in sequence. As you can imagine, keeping track of them can potentially become a tedious task.

This is where machine learning pipelines come in.

A pipeline allows us to maintain the data flow of all the relevant transformations that are required to reach the end result.

We need to define the stages of the pipeline which act as a chain of command for Spark to run. Here, each stage is either a Transformer or an Estimator.

Transformers and Estimators

As the name suggests, Transformers convert one dataframe into another either by updating the current values of a particular column (like converting categorical columns to numeric) or mapping it to some other values by using a defined logic.

An Estimator implements the fit() method on a dataframe and produces a model. For example, LogisticRegression is an Estimator that trains a classification model when we call the fit() method.

Let’s understand this with the help of some examples.

Examples of Pipelines

Let’s create a sample dataframe with three columns as shown below. Here, we will define some of the stages in which we want to transform the data and see how to set up the pipeline:

	from pyspark.ml import Pipeline

	# create a sample dataframe
	sample_df = spark.createDataFrame([
	(1, 'L101', 'R'),
	(2, 'L201', 'C'),
	(3, 'D111', 'R'),
	(4, 'F210', 'R'),
	(5, 'D110', 'C')
	], ['id', 'category_1', 'category_2'])

	sample_df.show()

view raw pipeline_1_pyspark.py hosted with ❤ by GitHub

We have created the dataframe. Suppose we have to transform the data in the below order:

stage_1: Label Encode or String Index the column category_1
stage_2: Label Encode or String Index the column category_2
stage_3: One-Hot Encode the indexed column category_2

At each stage, we will pass the input and output column name and setup the pipeline by passing the defined stages in the list of the Pipeline object.

The pipeline model then performs certain steps one by one in a sequence and gives us the end result. Let’s see how to implement the pipeline:

	# define stage 1 : transform the column category_1 to numeric
	stage_1 = StringIndexer(inputCol= 'category_1', outputCol= 'category_1_index')
	# define stage 2 : transform the column category_2 to numeric
	stage_2 = StringIndexer(inputCol= 'category_2', outputCol= 'category_2_index')
	# define stage 3 : one hot encode the numeric category_2 column
	stage_3 = OneHotEncoderEstimator(inputCols=['category_2_index'], outputCols=['category_2_OHE'])

	# setup the pipeline
	pipeline = Pipeline(stages=[stage_1, stage_2, stage_3])

	# fit the pipeline model and transform the data as defined
	pipeline_model = pipeline.fit(sample_df)
	sample_df_updated = pipeline_model.transform(sample_df)

	# view the transformed data
	sample_df_updated.show()

view raw pipeline_2_pyspark.py hosted with ❤ by GitHub

Now, let’s take a more complex example of setting up a pipeline. Here, we will do transformations on the data and build a logistic regression model.

For this, we will create a sample dataframe which will be our training dataset with four features and the target label:

	from pyspark.ml.classification import LogisticRegression

	# create a sample dataframe with 4 features and 1 label column
	sample_data_train = spark.createDataFrame([
	(2.0, 'A', 'S10', 40, 1.0),
	(1.0, 'X', 'E10', 25, 1.0),
	(4.0, 'X', 'S20', 10, 0.0),
	(3.0, 'Z', 'S10', 20, 0.0),
	(4.0, 'A', 'E10', 30, 1.0),
	(2.0, 'Z', 'S10', 40, 0.0),
	(5.0, 'X', 'D10', 10, 1.0),
	], ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'label'])

	# view the data
	sample_data_train.show()

view raw pipeline_3_pyspark.py hosted with ❤ by GitHub

Now, suppose this is the order of our pipeline:

stage_1: Label Encode or String Index the column feature_2
stage_2: Label Encode or String Index the column feature_3
stage_3: One Hot Encode the indexed column of feature_2 and feature_3
stage_4: Create a vector of all the features required to train a Logistic Regression model
stage_5: Build a Logistic Regression model

We have to define the stages by providing the input column name and output column name. The final stage would be to build a logistic regression model. And in the end, when we run the pipeline on the training dataset, it will run the steps in a sequence and add new columns to the dataframe (like rawPrediction, probability, and prediction).

	# define stage 1: transform the column feature_2 to numeric
	stage_1 = StringIndexer(inputCol= 'feature_2', outputCol= 'feature_2_index')
	# define stage 2: transform the column feature_3 to numeric
	stage_2 = StringIndexer(inputCol= 'feature_3', outputCol= 'feature_3_index')
	# define stage 3: one hot encode the numeric versions of feature 2 and 3 generated from stage 1 and stage 2
	stage_3 = OneHotEncoderEstimator(inputCols=[stage_1.getOutputCol(), stage_2.getOutputCol()],
	outputCols= ['feature_2_encoded', 'feature_3_encoded'])
	# define stage 4: create a vector of all the features required to train the logistic regression model
	stage_4 = VectorAssembler(inputCols=['feature_1', 'feature_2_encoded', 'feature_3_encoded', 'feature_4'],
	outputCol='features')
	# define stage 5: logistic regression model
	stage_5 = LogisticRegression(featuresCol='features',labelCol='label')

	# setup the pipeline
	regression_pipeline = Pipeline(stages= [stage_1, stage_2, stage_3, stage_4, stage_5])

	# fit the pipeline for the trainind data
	model = regression_pipeline.fit(sample_data_train)
	# transform the data
	sample_data_train = model.transform(sample_data_train)

	# view some of the columns generated
	sample_data_train.select('features', 'label', 'rawPrediction', 'probability', 'prediction').show()

view raw pipeline_4_pyspark.py hosted with ❤ by GitHub

Congrats! We have successfully set up the pipeline. Let’s create a sample test dataset without the labels and this time, we do not need to define all the steps again. We will just pass the data through the pipeline and we are done!

	# create a sample data without the labels
	sample_data_test = spark.createDataFrame([
	(3.0, 'Z', 'S10', 40),
	(1.0, 'X', 'E10', 20),
	(4.0, 'A', 'S20', 10),
	(3.0, 'A', 'S10', 20),
	(4.0, 'X', 'D10', 30),
	(1.0, 'Z', 'E10', 20),
	(4.0, 'A', 'S10', 30),
	], ['feature_1', 'feature_2', 'feature_3', 'feature_4'])

	# transform the data using the pipeline
	sample_data_test = model.transform(sample_data_test)

	# see the prediction on the test data
	sample_data_test.select('features', 'rawPrediction', 'probability', 'prediction').show()

view raw pipeline_5_pyspark.py hosted with ❤ by GitHub

Perfect!

End Notes

This was a short but intuitive article on how to build machine learning pipelines using PySpark. I’ll reiterate it again because it’s that important – you need to know how these pipelines work. This is a big part of your role as a data scientist.

Have you worked on an end-to-end machine learning project before? Or been a part of a team that built these pipelines in an industry setting? Let’s connect in the comments section below and discuss.

I’ll see you in the next article on this PySpark for beginners series. Happy learning!

Lakshay arora

Ideas have always excited me. The fact that we could dream of something and bring it to reality fascinates me. Computer Science provides me a window to do exactly that. I love programming and use it to solve problems and a beginner in the field of Data Science.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Vijay

I am getting : IllegalArgumentException: 'Data type string of column Isboundary is not supported.\nData type string of column Iswicket is not supported.\nData type string of column Over is not supported.\nData type string of column Runs is not supported.' for # transform the data final_data = assembler.transform(my_data) Is there a recommendation to solve this error.

Lokesh

Excellent Article. Very clear to understand each data cleaning step even for a newbie in analytics. Thanks a lot for much informative article :)

Purnima Sharma

Thanks for the article, very well explained indeed. I was wondering if you could post the building of pipeline using the same example of cricket match.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Want to Build Machine Learning Pipelines? A Quick Introduction using PySpark

Overview

Introduction

Table of Contents

Perform Basic Operations on a Spark Dataframe

Reading a CSV file

Defining the Schema

Drop columns from the data

Data Exploration using PySpark

Check the Data Dimensions

Describe the Data

Missing Values Count

Value Counts of a Column

Encode Categorical Variables using PySpark

String Indexing

One-Hot Encoding

Vector Assembler

Building Machine Learning Pipelines using PySpark

Transformers and Estimators

Examples of Pipelines

End Notes

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)