This article was published as a part of the Data Science Blogathon.
In this article, we will be working withPySpark‘s MLIB library it is commonly known as the Machine learning library of PySpark where we can use any ML algorithm that was previously available in SkLearn (sci-kit-learn). We can perform all the operations which were required in the complete ML pipeline.
The very first step before playing with PySpark is to set up and start the Spark session and for that, we will be first importing the SparkSession function from the pyspark.sql package.
from pyspark.sql import SparkSession df_ml = SparkSession.builder.appName('Machine learning example').getOrCreate() df_ml
Output:
Inference: After importing the SparkSession function, we used the builder function to build our session and gave the name to the session using the appName function, which is under the builder function only at the last, we simply created the session using the getOrCreate() function.
Before heading towards reading the data, let’s understand what our dataset is!
So this dataset is a banknote authentication dataset from Kaggle, and it holds the statistical details of both real notes and fake notes. IF you wanna know more about this dataset, follow this link.
Feature columns:
Target column:
Now, after creating and setting up the SparkSession it’s time to read the dataset on which we will be applying machine learning operations and, before that, the data preprocessing techniques using PySpark.
training_dataset = df_ml.read.csv('/content/bank_notes.csv', header=True, inferSchema=True) training_dataset
Output:
DataFrame[variance: double, skewness: double, curtosis: double, entropy: double, Target: int]
Inference: Here, with the help of the read.csv() function, we have read the CSV formatted dataset and provided the header as True so that we can get the column name as header and infer schema as True so that we can get the real data type each column.
Let’s see our dataset now
training_dataset.show()
Output:
Inference: Here, we can see the top 20 rows of the dataset with the help of the show() function
Now we will look at the Schema of our banknote detection dataset i.e. we will see what data type each column hold and do it has null values or not? So let’s answer this question with the printSchema() function.
training_dataset.printSchema()
Output:
Inference: After calling the printSchema method, we can see that it returned the type of the data of each column where:
Though by far, we saw the complete schema of our dataset, this is not something which we wanna see all the time instead, to see just how many columns are there, so let’s figure that out!
training_dataset.columns
Output:
['variance', 'skewness', 'curtosis', 'entropy', 'Target']
Inference: By using the columns object, we can see how many columns are there in the data, and it will be returned in the list format.
Vector assembler is the package that helps us to bring all the dependent columns, i.e. features, in one column; in short, it stacked the feature columns together in the form of vector type, so now, instead of dealing with multiple columns, we only need to care about that one column because it holds all the data which we need to train our model.
# ["variance", "skewness","curtosis", "entropy"] -------> new feature -------> independent feature from pyspark.ml.feature import VectorAssembler featassembler = VectorAssembler(inputCols=["variance", "skewness","curtosis", "entropy"], outputCol = "Independent Features" ) featassembler
Output:
VectorAssembler_835df549891f
Code breakdown:
In this section we will transform our dataset i.e. we will add our Independent feature columns to the original dataset as we all know that the ML algorithm should have clean and sorted data so data transforming is a must-do step.
result = featassembler.transform(training_dataset) result.show()
Output:
Inference: So, by using the transform() over the assembler object, we have successfully added the independent feature(S) column at the last column(from left)
Technically thinking so now our dataset should hold one more column i.e. The independent feature column; let’s check that using our columns object on the variable that holds the resultant dataset.
result.columns
Output:
['variance', 'skewness', 'curtosis', 'entropy', 'Target', 'Independent Features']
Yes, it does! we have our last column in the dataset but do we need the other columns like Kurtosis, variance, skewness, and entropy?
No right! because these columns we already have in the last column that we created using the Vector Assembler method. So, in the end, we should only have 2 columns from the dataset, and they are:
Here we are doing it, we are simply making a final dataset that will consist of only 2 columns.
final_data = result.select("Independent features", "Target") final_data.show()
Output:
Inference: Nowhere with the help of select() we filtered out the grouped feature column as well as the resultant column, and now are dataset only have 2 columns, and these are the only one that we care for now.
Now, as we know, the Train Test split is one of the known steps in the machine learning pipeline where we divide our training dataset and testing dataset to remove the problem of overfitting of the model as if we will train the model on the whole dataset then it will undoubtedly lead to the problem of overfitting of the model hence we should always divide the data into training and testing sets.
In PySpark, we will be using the random split() function to divide the data into training and testing sets.
train_data, test_data = final_data.randomSplit([0.75, 0.25])
Inference: Now as we can see that we are breaking up the data into 75% of training and 25% of testing data using the random split() function and it is getting stored in the train_data and test_data variables simultaneously.
Now, as if we have split our dataset and already have our training set, so it’s time to build our model based on the training dataset and then test the same model corresponding to the testing data. As we know that it is a regression problem so we will be using the Linear Regression algorithm.
from pyspark.ml.regression import LinearRegression model = LinearRegression(featuresCol = 'Independent features', labelCol='Target') model = model.fit(train_data)
Code breakdown:
The one who knows about the mathematical intuition of linear regression can easily pick up what this coefficient and intercepts demonstrate. For the ones who are not aware of the same for them, we will discuss it in nutshell.
Equation of linear regression: y = c + b*x Where,
Coefficients
model.coefficients
Output:
DenseVector([-0.142, -0.0786, -0.1014, 0.0])
Inference: In the output, we can see that it has given us the array of a list of coefficients in the form of Dense Vector i.e. all the regression coefficients in the Vector format.
Intercepts
model.intercept
Output:
0.7994252652531315
Inference: In the output, we can see the intercept that our model has and it represents the mean of the target variable when all the feature variables collectively have zero value.
So here we have come to the section where we will see how our model performed after all the training it went through, and we call it the Prediction section.
prediction_result = model.evaluate(test_data) prediction_result
Output:
Inference: Here for predicting the results we are using the testing data and along with that involving the evaluate() method to predict the results on that unseen data then in the output we can see that it returned the object of ml. regression.LinearRegressionSummary.
prediction_result.predictions.show()
Output:
Inference: Now, with the help of the show method, we can easily compare how close are the predicted value than the actual value and it returns the DataFrame where we can see the predicted as well as actual values side by side based on the DataFrame we build previously.
In this section, we will summarize in a nutshell whatever we have covered so far, firstly we started our spark session and imported the banknote authentication dataset. Later, we also learned about Vector Assembler and implemented the same the last, we followed various steps in the ML pipeline to make predictions.
Here’s the repo link to this article. I hope you liked my article on Introduction to PySpark MLIB. If you have any opinions or questions, then comment below.
Connect with me on LinkedIn for further discussion.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.