This article was published as a part of the Data Science Blogathon.
In this article we will be learning about the Linear Regression using MLIB and everything will be hands-on i.e. we will be building an end to end Linear regression model which will predict the customer’s yearly spend on the company’s product if we talk about the dataset so it is completely a dummy dataset which is generated in purpose to understand the concepts of model building for continuous data using “MLIB”.
Before getting into the machine learning process and following the steps to predict the customer’s yearly spending we must need to initialize the Spark Session and read our dummy dataset of e-commerce websites that have all the relevant features.
In this particular section, we will setup up the Spark object so that we will be able to create an environment to perform the operations which are supported and managed by it.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('E-commerce').getOrCreate()
Inference: So from the above two code lines we have successfully imported the SparkSession object from PySpark’s SQL package and then we have created the environment using the getOrCreate() function one thing to note is that before creating it we have built it using the builder function and given it the name as “E-commerce”
In this section, we will be reading the dummy dataset which I’ve created to perform the ML operations along with Data Preprocessing using PySpark.
data = spark.read.csv("Ecommerce_Customers.csv",inferSchema=True,header=True)
Inference: So in the above line of code we have read the Ecommerce data and kept the inferSchema parameter as True so that it will return the real data type that which dataset possesses and the header as True so that the first tuple of record will be stated as header.
Here the Schema of the dataset will be shown so that one could get the inference of what kind of data each column holds and then the analysis could be done with more precision.
data.printSchema()
Output:
Inference: So we have used the printSchema() function to show the information about each column that our dataset holds and while looking at the output one can see what kind of data type is there.
Now we will go through the dataset using three different ways so that one could also know all the methods to investigate it.
Looking at the data using the show() function where it will return the top 20 rows from the complete data.
Now the head function needs to be introduced which is quite similar to the head function used in pandas in the below code’s output we can see that the head function returned the Row object which holds one complete record/tuple.
data.head()
Output:
Row(Email='[email protected]', Address='835 Frank TunnelWrightmouth, MI 82180-9605', Avatar='Violet', Avg Session Length=34.49726772511229, Time on App=12.65565114916675, Time on Website=39.57766801952616, Length of Membership=4.0826206329529615, Yearly Amount Spent=587.9510539684005)
Now let’s see the more clear version of getting into the data where each item will be iterable through the combination of for loop and head function and the output shown is the more clear version of the Row object output.
for item in data.head(): print(item)
Output:
[email protected] 835 Frank TunnelWrightmouth, MI 82180-9605 Violet 34.49726772511229 12.65565114916675 39.57766801952616 4.0826206329529615 587.9510539684005
As mentioned earlier that we will gonna predict the customer’s yearly expenditure on products so based on what we already know, we have to deal with continuous data and when we are working with such type of data we have to use the linear regression model.
For that reason, we will be importing the Linear Regression package from the ML library of PySpark.
from pyspark.ml.regression import LinearRegression
In this section, all the data preprocessing techniques will be performed which are required to make the dataset ready to be sent across the ML pipeline where the model could easily adapt and build an efficient model.
Importing Vector and VectorAssembler libraries so that we could easily separate the features columns and the Label column i.e. all the dependent columns will be stacked together as the feature column and the independent column will be as a label column.
from pyspark.ml.linalg import Vectors from pyspark.ml.feature import VectorAssembler
Let’s have a look at which columns are present in our dataset.
data.columns
Output:
Inference: So from the above output all the columns are listed down in the form of list type only but this will not give us enough information about which column to select hence for that reason we will use the describe method.
data.describe()
Output:
DataFrame[summary: string, Email: string, Address: string, Avatar: string, Avg Session Length: string, Time on App: string, Time on Website: string, Length of Membership: string, Yearly Amount Spent: string]
Inference: If you will go through the output closely you will find that columns that have a string as the data type will have no role in the model development phase as machine learning is the involvement of mathematical calculation where only number game is allowed hence integer and double data type columns accepted.
Based on the above discussion the columns which are selected to be part of the machine learning pipeline are as follows:
assembler = VectorAssembler( inputCols=["Avg Session Length", "Time on App", "Time on Website",'Length of Membership'], outputCol="features")
Output:
Inference: In the above code we chose the VectorAssembler method to stack all our features columns together and return them as the “features” columns by the output column parameter.
output = assembler.transform(data)
Here, the Transform function is used to fit the real data with the changes that we have done in the assembler variable using the VectorAssembler function so that the changes should reflect in the real dataset.
output.select("features").show()
Output:
Now with the select function, we have selected only the features column from the dataset and showed it in the form of DataFrame using the show() function.
final_data = output.select("features",'Yearly Amount Spent')
From the above code, we are concatenating the stack of dependent features (named as features) and independent features together and naming it final_data and this frame will be analyzed further in the process.
In this step of the model building, we will be dividing our data into a training set and the testing set, where training data will be the one on top of which our model will be built and on the other hand testing data is the one on which we will test our model that how well it performed.
In MLIB, for dividing the data into testing and training sets we have to use a random split() function which takes an input in the form of the list type.
train_data,test_data = final_data.randomSplit([0.7,0.3])
Inference: With the help of the tuple unpacking concept we have stored the training set (70%) into train_data and similarly 30% of the dataset into test_data. Note that in the random split() method the list is passed.
train_data.describe().show()
Output:
test_data.describe().show()
Output:
Inference: Describe method seems to be an accurate way to analyze and draw the difference between training and testing data where we can see that in the training set there are 349 records while 151 are on the other hand.
Finally, we have come across the step where we will be building our Linear Regression Model and for that LinearRegression object is used which if you remember we have imported in the starting and then passed the “Yearly Amount Spent” column in the label Column parameter which is our independent column.
lr = LinearRegression(labelCol='Yearly Amount Spent')
Now, as we have created our Linear Regression object so now we can easily fit our data i.e. we can do the model training by passing the training data in the fit method.
lrModel = lr.fit(train_data,)
Now, let’s print the Coefficients of each feature and intercepts of the model which is being trained on the training dataset this is also one of the pieces of information which will let you know how well your model is involving each independent variable separately.
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))
Output:
Coefficients: [25.324513354618116,38.880247333555445,0.20347373150823037,61.82593066961652] Intercept: -1031.8607952442187
So in this step, we will be evaluating our model i.e. We will analyze how well our model performed, and in this stage of the model building, we decide whether to go with the existing one or not in the model deployment stage.
So for evaluation, we have come across the “evaluate” function and stored it in the test_results variable as we will use it for further analysis.
test_results = lrModel.evaluate(test_data)
The one who knows the mathematical intuition behind Linear Regression must be aware of the fact that residual = Original result – Predicted result i.e. the difference between the predicted result by the model and the original result of the label column.
test_results.residuals.show()
Output:
Now it’s time to make predictions from our model for that we will first store the unlabelled data i.e the feature data and transform it too so that changes will take place.
unlabeled_data = test_data.select('features') predictions = lrModel.transform(unlabeled_data) predictions.show()
Output:
Inference: So from the above output we can see that it returned a DataFrame that practically has two columns one is the complete stack of features column and the other one is the prediction column.
So, in this section we will see by far what we have learned in this article if I have to mention it in the nutshell then we have gone through a complete machine learning pipeline for the linear regression algorithm.
Here’s the repo link to this article. I hope you liked my article on Introduction to Linear Regression using MLIB. If you have any opinions or questions, then comment below.
Connect with me on LinkedIn for further discussion.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.