Suppose you are working on a practice problem related to house rent given lots of data points and input features. It’s quite common to perform EDA, Preprocessing(may need to create additional features), and feeding our data to our model. In this scenario even if we use the simplest Linear Regression Model (multiple variables) it may become huge in size due to all the input_features and all the parameters which will be time-consuming to re-train again and again for use.
So the simplest thing to do is to save our model and later load it for inference or prediction at a later time. While Keras models API provides the model.save() functionality for saving our deep learning model is limited to the realm of deep learning and for most beginners, in ML it’s quite confusing to save their model. Also due to estimators having a huge number of parameters, it is quite advisable to save them.
In this article, we will explore quick hacks to save machine learning models using Pickle and Joblib—two efficient methods to easily save and load models for future use.
This article was published as a part of the Data Science Blogathon
We are going to use a house price prediction dataset with a single feature area(for demonstration purposes). Our job will be to predict the price given the area. For keeping things simple we will have only 4-5 data points and the model we will be using will be a Linear Regression Model which just fits a straight line to our dataset and calculates the square of predicted difference from actual differences over all data points*
The square in cost function ensures that negative values are nullified
We are now quickly going to create our model file in 5 steps which we will be saving for later use.
1. We will start by loading all the required dependencies.
# loading dependencies
import pandas as pd
import numpy as np
from sklearn import linear_model
2. Now we will be loading our data using pd.read_csv() function into a pandas dataframe(train_df) and use df.head() method to print first 5 rows.
# loading dependencies
import pandas as pd
import numpy as np
from sklearn import linear_model
# loading our data
train_df = pd.read_csv('train.csv')
# viewing few files
print(train_df.head())
# creating the model object
model = linear_model.LinearRegression() # y = mx+b
# fitting model with X_train - area, y_train - price
print(model.fit(train_df[['area']],train_df.price))
3. To create our model we will be first creating a model object which will be actually a LinearRegression classifier and then fit our model with our training samples and training labels for which our model job will be to find the best straight line fit.
# creating the model object model = linear_model.LinearRegression() # y = mx+b
# fitting model with X_train - area, y_train - price
model.fit(train_df[['area']],train_df.price)
After executing the above code output will look a bit like this
>> LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
4. As we know a straight line has a coefficient and an intercept in the equation, so we should check out those values as sklearn provides some handy attributes. These can be checked as
# checking coeffiecent - m
model.coef_
>> array([135.78767123])
# checking intercept - b
model.intercept_
>> 180616.43835616432
5. Finally for completeness sake one can test the model for predicting the price for a 5000sqft area house.
# predict model values - area = 5000
model.predict([[5000]])
>> array([859554.79452055])
It’s now time to save our created model. We are going to look into 2 quick hacks for the saving model. Also as a bonus, I will be providing guidelines on where to use which method.
Many of you will be familiar with the pickle module, however, if not it’s good to know that the pickle module allows you to pickle a file using de-serialization which means simply breaking down an object into its constituting components. For e.g, our model files attribute like the one we saw.
To save a file using pickle one needs to open a file, load it under some alias name and dump all the info of the model. This can be achieved using below code:
# loading library
import pickle
# create an iterator object with write permission - model.pkl
with open('model_pkl', 'wb') as files:
pickle.dump(model, files)
After the above steps, one can see a file with the name model_pkl in the directory, and opening it will show something like this:
Directory As Shown In Google Collab
One can load this file back again into a model using the same logic, here we are using the lr variable for referencing the model and then using it to predict the price for 5000sqft:
# load saved model
with open('model_pkl' , 'rb') as f:
lr = pickle.load(f)
# check prediction
lr.predict([[5000]]) # similar
>> array([859554.79452055])
Benefits:
Joblib is an alternative to model saving in a way that it can operate on objects with large NumPy arrays/data as a backend with many parameters. It can be used as an individual module(refer here) or using the Sci-Kit Learn library. For simplicity’s sake, we will be using the second method.
-> First, we will import joblib from sklearn’s external class
# loading dependency from sklearn.externals import joblib
To save the model we will use its dump functionality to save the model to the model_jlib file.
# saving our model # model - model , filename-model_jlib joblib.dump(model , 'model_jlib')
After running the above code a file will be created with a filename and contents will be similar to the pickle file.
Note: We didn’t use an iterator as the module saves the data onto disk rather than string-names. However, it accepts file-like objects.
To load the model we will be providing file-path or file object to the load function and storing it in the m_jlib variable, which we can later use for prediction.
# opening the file- model_jlib
m_jlib = joblib.load('model_jlib')
Finally for predicting we can call predict method on m_jlib and pass it a 2d array with values as 5000.
# check prediction
m_jlib.predict([[5000]]) # similar
>> array([859554.79452055])
Note predict methods assumes you provide data in a 2d format so we used [[5000]] meaning 5000 as an 2d array
Benefits:
Due to the time complexity involved in training large models, saving is becoming a crucial part of the data-science realm. In this article, I introduced a few Quick Hacks To Save Machine Learning Model using Pickle and Joblib. Both processes work on the same concept of serialization (saving of data into its component form) and deserialization (restoring data from the serialized chunks). Therefore, always pickle or joblib the model from a trusted source.
For simplicity, we have used a Linear Regression model, but you can apply the same approach to save different types of models like Logistic Regression, Decision Trees, SVMs, and many more.
Hope you have enjoyed reading the article and learned something in the process. Those who want to dive deeper can refer to the reference section and work along.
A. Scikit-learn (sklearn) is a popular machine learning library for Python. To save a trained sklearn model, you can use the “joblib” module, which is part of the sklearn library.
The “joblib” module provides a simple way to save and load Python objects, including trained sklearn models. Saving the model enables you to reuse the model for making predictions on new data, without having to retrain the model from scratch.
To save a trained sklearn model using joblib, you can use the “dump” function, which takes two arguments: the trained model object and the filename for saving the model.
A. To save a trained model in Python, you can use the “pickle” or “joblib” module. Both modules provide functions for serializing and deserializing Python objects, including trained machine learning models.
A. In Python, the “pickle” module provides a way to serialize and deserialize Python objects, including trained machine learning models. By saving a trained model using the pickle module, you can reuse the model for making predictions on new data, without having to retrain the model from scratch.
A.Machine Learning, pickle is a Python module used to serialize and save trained models as byte streams. This allows models to be stored and later loaded for predictions without retraining, using functions like pickle.dump()
and pickle.load()
.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.