Build your first Machine Learning pipeline using scikit-learn!

Lakshay arora Last Updated : 31 Dec, 2024

11 min read

For building any machine learning model, it is important to have a sufficient amount of data to train the model. The data is often collected from various resources and might be available in different formats. Due to this reason, data cleaning and preprocessing become a crucial step in the machine learning project.

Whenever new data points are added to the existing data, we need to perform the same preprocessing steps again before we can use the machine learning model to make predictions. This becomes a tedious and time-consuming process!

Overview

Understand the structure of a Machine Learning Pipeline
Build an end-to-end ML pipeline on a real-world data
Train a Random Forest Regressor for sales prediction

An alternate to this is creating a machine learning pipeline that remembers the complete set of preprocessing steps in the exact same order. So that whenever any new data point is introduced, the machine learning pipeline performs the steps as defined and uses the machine learning model to predict the target variable.

This is exactly what we are going to cover in this article – design a machine learning pipeline and automate the iterative processing steps.

Overview
Understanding Problem Statement
Building a prototype model
Pipeline Design
Building Pipeline
Predict the target
Live Coding Window
Conclusion

Understanding Problem Statement

In order to make the article intuitive, we will learn all the concepts while simultaneously working on a real world data – BigMart Sales Prediction.

As a part of this problem, we are provided with the information about the stores (location, size, etc), products (weight, category, price, etc) and historical sales data. Using this information, we have to forecast the sales of the products in the stores.

You can read the detailed problem statement and download the dataset from here. Below is the complete set of features in this data.The target variable here is the Item_Outlet_Sales.

I encourage you to go through the problem statement and data description once before moving to the next section so that you have a fair understanding of the features present in the data.

Building a prototype model

To build a machine learning pipeline, the first requirement is to define the structure of the pipeline. In other words, we must list down the exact steps which would go into our machine learning pipeline.

In order to do so, we will build a prototype machine learning model on the existing data before we create a pipeline. The main idea behind building a prototype is to understand the data and necessary preprocessing steps required before the model building process. Based on our learning from the prototype model, we will design a machine learning pipeline that covers all the essential preprocessing steps.

The focus of this section will be on building a prototype that will help us in defining the actual machine learning pipeline for our sales prediction project. Let’s get started!

Data Exploration and Preprocessing

What is the first thing you do when you are provided with a dataset? You would explore the data, go through the individual variables, and clean the data to make it ready for the model building process.

That is exactly what we will be doing here. We will explore the variables and find out the mandatory preprocessing steps required for the given data. Let us start by checking if there are any missing values in the data. We will use the isnull().sum() function here

There are only two variables with missing values – Item_Weight and Outlet_Size.

Since Item_Weight is a continuous variable, we can use either mean or median to impute the missing values. On the other hand, Outlet_Size is a categorical variable and hence we will replace the missing values by the mode of the column. You can try different methods to impute missing values as well.

Have you built any machine learning models before? If yes, then you would know that most machine learning models cannot handle missing values on their own. Thus imputing missing values becomes a necessary preprocessing step.

Additionally, machine learning models cannot work with categorical (string) data as well, specifically scikit-learn. Before building a machine learning model, we need to convert the categorical variables into numeric types. Let us do that.

Encode the categorical variables

To check the categorical variables in the data, you can use the train_data.dtypes() function. This will give you a list of the data types against each variable. For the BigMart sales data, we have the following categorical variable –

Item_Fat_Content
Item_Type, Outlet_Identifier
Outlet_Size, Outlet_Location_Type, and
Outlet_Type

There are a number of ways in which we can convert these categories into numerical values. You can read about the same in this article – Simple Methods to deal with Categorical Variables. We are going to use the categorical_encoders library in order to convert the variables into binary columns.

Note: that in this example I am not going to encode Item_Identifier since it will increase the number of feature to 1500. This feature can be used in other ways (read here), but to keep the model simple, I will not use this feature here.

Scale the data:

So far we have taken care of the missing values and the categorical (string) variables in the data. Next we will work with the continuous variables. Often the continuous variables in the data have different scales, for instance, a variable V1 can have a range from 0 to 1 while another variable can have a range from 0-1000.

Based on the type of model you are building, you will have to normalize the data in such a way that the range of all the variables is almost similar. You can do this easily in python using the StandardScaler function.

Model Building

Now that we are done with the basic pre-processing steps, we can go ahead and build simple machine learning models over this data. We will try two models here – Linear Regression and Random Forest Regressor to predict the sales.

To compare the performance of the models, we will create a validation set (or test set). Here I have randomly split the data into two parts using the train_test_split() function, such that the validation set holds 25% of the data points while the train set has 75%

Great, we have our train and validation sets ready. Let us train a linear regression model on this data and check it’s performance on the validation set. To check the model performance, we are using RMSE as an evaluation metric.

Note: If you are not familiar with Linear regression, you can go through the article below-

7 Regression Techniques you should know!

The linear regression model has a very high RMSE value on both training and validation data. Let us see if a tree-based model performs better in this case. Here we will train a random forest and check if we get any improvement in the train and validation errors.

Note: To learn about the working of Random forest algorithm, you can go through the article below-

A complete tutorial to tree-based models from scratch!

As you can see, there is a significant improvement on is the RMSE values. You can train more complex machine learning models like Gradient Boosting and XGBoost, and see of the RMSE value further improves.

A very interesting feature of the random forest algorithm is that it gives you the ‘feature importance’ for all the variables in the data. Let us see how can we use this attribute to make our model simpler and better!

Feature Importance

After the preprocessing and encoding steps, we had a total of 45 features and not all of these may be useful in forecasting the sales. Alternatively we can select the top 5 or top 7 features, which had a major contribution in forecasting sales values.

If the model performance is similar in both the cases, that is – by using 45 features and by using 5-7 features, then we should use only the top 7 features, in order to keep the model more simple and efficient.

The idea is to have a less complex model without compromising on the overall model performance.

Following is the code snippet to plot the n most important features of a random forest model.

Now, we are going to train the same random forest model using these 7 features only and observe the change in RMSE values for the train and the validation set.

Now, this is amazing! Using only 7 features has given almost the same performance as the previous model where we were using 45 features. Let us identify the final set of features that we need and the preprocessing steps for each of them.

Identifying features to build the ML pipeline

As discussed initially, the most important part of designing a machine leaning pipeline is defining its structure, and we are almost there!

We are now familiar with the data, we have performed required preprocessing steps, and built a machine learning model on the data. At this stage we must list down the final set of features and necessary preprocessing steps (for each of them) to be used in the machine learning pipeline.

Selected Features and Preprocessing Steps

Item_MRP: It holds the price of the products. During the preprocessing step we used a standard scaler to scale this values.
Outlet_Type_Grocery_Store: A binary column which indicates if the outlet type is a grocery store or not. To use this information in the model building process, we will add a binary feature in the existing data that contains 1 (if outlet type is a grocery store) and 0 ( if outlet type is something else).
Item_Visibility: Denotes visibility of products in the store. Since this variable had a small value range and no missing values, we didn’t apply any preprocessing steps on this variable.
Outlet_Type_Supermarket_Type3: Another binary column indicating if the outlet type is a “supermarket_type_3” or not. To capture this information we will create binary feature that stores 1 (if outlet type is supermarket_type_3) and 0 (othewise).
Outlet_Identifier_OUT027: This feature specifies whether the outlet identifier is “OUT027” or not. Similar to the last previous example, we will create a separate column that carries 1 (if outlet type is grocery store) and 0 (otherwise).
Outlet_Establishment_Year: The Outlet_Establishment_Year describes year of establishment of the stores. Since we did not perform any transformation on values in this column, we will not preprocess it in the pipeline as well.
Item_Weight: During the preprocessing steps we observed that Item_Weight had missing values. These missing values were imputed using the average of the column. This has to be taken into account while building the machine learning pipeline.

Apart from these 7 columns, we will drop the rest of the columns since we will not use them to train the model. Let us go ahead and design our ML pipeline!

Pipeline Design

In the last section we built a prototype to understand the preprocessing requirement for our data. It is now time to form a pipeline design based on our learning from the last section. We will define our pipeline in three stages:

Create the required binary features
Perform required data preprocessing and transformations
Build a model to predict the sales

1. Create the required binary features

We will create a custom transformer that will add 3 new binary columns to the existing data.

Outlet_Type : Grocery Store
Outlet_Type : Supermarket Type3
Outlet_Identifier_OUT027

2. Data Preprocessing and transformations.

We will use a ColumnTransformer to do the required transformations. It will contain 3 steps.

Drop the columns that are not required for model training
Impute missing values in the column Item_Weight using the average
Scale the column Item_MRP using StandardScaler()

3. Use the model to predict the target on the cleaned data

This will be the final step in the pipeline. In the last two steps we preprocessed the data and made it ready for the model building process. Finally, we will use this data and build a machine learning model to predict the Item Outlet Sales.

Let’s code each step of the pipeline on the BigMart Sales data.

Building Pipeline

First of all, we will read the data set and separate the independent and target variable from the training dataset. You can download the dataset from here.

Now, as a first step, we need to create 3 new binary columns using a custom transformer. Here are the steps we need to follow to create a custom transformer.

Define a class OutletTypeEncoder
Add the parameter BaseEstimator while defining the class
The class must contain fit and transform methods

In the transform method, we will define all the 3 columns that we want after the first stage in our ML pipeline.

Next we will define the pre-processing steps required before the model building process.

Drop the columns – Item_Identifier, Outlet_Identifier, Item_Fat_Content, Item_Type, Outlet_Identifier, Outlet_Size, Outlet_Location_Type and Outlet_Establishment_Year
Impute missing values in column Item_Weight with mean
Scale the column Item_MRP using StandardScaler().

This will be the second step in our machine learning pipeline. After this step, the data will be ready to be used by the model to make predictions.

Predict the target

This will be the final block of the machine learning pipeline – define the steps in order for the pipeline object! As you can see in the code below we have specified three steps – create binary columns, preprocess the data, train a model.

When we use the fit() function with a pipeline object, all three steps are executed. Post the model training process, we use the predict() function that uses the trained model to generate the predictions.

Now, we will read the test data set and we call predict function only on the pipeline object to make predictions on the test data.

Live Coding Window

You can try the above code in the following coding window. Try different transformations on the dataset and also evaluate how good your model is.

# importing required libraries
import pandas as pd
from sklearn.compose import ColumnTransformer 
from sklearn.impute import SimpleImputer
import category_encoders as ce
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline

# read the training data set
data = pd.read_csv('train_kOBLwZA.csv')

# top rows of the data
data.head()

# seperate the independent and target variables
train_x = data.drop(columns=['Item_Outlet_Sales'])
train_y = data['Item_Outlet_Sales']


# import the BaseEstimator
from sklearn.base import BaseEstimator

# define the class OutletTypeEncoder
# This will be our custom transformer that will create 3 new binary columns
# custom transformer must have methods fit and transform
class OutletTypeEncoder(BaseEstimator):

    def __init__(self):
        pass

    def fit(self, documents, y=None):
        return self

    def transform(self, x_dataset):
        x_dataset['outlet_grocery_store'] = (x_dataset['Outlet_Type'] == 'Grocery Store')*1
        x_dataset['outlet_supermarket_3'] = (x_dataset['Outlet_Type'] == 'Supermarket Type3')*1
        x_dataset['outlet_identifier_OUT027'] = (x_dataset['Outlet_Identifier'] == 'OUT027')*1
        
        return x_dataset

# pre-processsing step
# Drop the columns - 
# Impute the missing values in column Item_Weight by mean
# Scale the data in the column Item_MRP
pre_process = ColumnTransformer(remainder='passthrough',
                                transformers=[('drop_columns', 'drop', ['Item_Identifier',
                                                                        'Outlet_Identifier',
                                                                        'Item_Fat_Content',
                                                                        'Item_Type',
                                                                        'Outlet_Identifier',
                                                                        'Outlet_Size',
                                                                        'Outlet_Location_Type',
                                                                        'Outlet_Type'
                                                                       ]),
                                              ('impute_item_weight', SimpleImputer(strategy='mean'), ['Item_Weight']),
                                              ('scale_data', StandardScaler(),['Item_MRP'])])

# Define the Pipeline
"""
Step1: get the oultet binary columns
Step2: pre processing
Step3: Train a Random Forest Model
"""
print('\n\nBuilding Pipeline\n\n')
model_pipeline = Pipeline(steps=[('get_outlet_binary_columns', OutletTypeEncoder()), 
                                 ('pre_processing',pre_process),
                                 ('random_forest', RandomForestRegressor(max_depth=10,random_state=2))
                                 ])
# fit the pipeline with the training data
print('Fitting the pipeline with the training data')
model_pipeline.fit(train_x,train_y)

# predict target values on the training data
print('\n\nPredict target on the train data\n\n')
print(model_pipeline.predict(train_x))

print('Reading the test data: ')
# read the test data
test_data = pd.read_csv('test_t02dQwI.csv')

# predict target variables on the test data 
print('\n\nPredict on the test data\n\n')
print(model_pipeline.predict(test_data))

Conclusion

Having a well-defined structure before performing any task often helps in efficient execution of the same. And this is true even in case of building a machine learning model. Once you have built a model on a dataset, you can easily break down the steps and define a structured Machine learning pipeline.

In this article, I covered the process of building an end-to-end Machine Learning pipeline and implemented the same on the BigMart sales dataset. If you have any more ideas or feedback on the same, feel free to reach out to me in the comment section below.

Lakshay arora

Ideas have always excited me. The fact that we could dream of something and bring it to reality fascinates me. Computer Science provides me a window to do exactly that. I love programming and use it to solve problems and a beginner in the field of Data Science.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Ankur Bhargava

Wonderful Article. It would be great if you could elucidate on the Base Estimator part of the code. Unable to fathom the meaning of fit & _init_.

Chung Lee

Great article but I have an error with the same code as you wrote - ModuleNotFoundError: No module named 'category_encoders' Do you happen to know why?

Show 1 reply

LAKSHAY ARORA

Install the library: !pip3 install category_encoders

Bhavya

Very much useful article, i learnt how to use pipelines but I didn't get why we need OutletTypeEncoder class and why it needs fit and transform methods, what is the need of using BaseEstimator there?

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Build your first Machine Learning pipeline using scikit-learn!

Overview

Table of contents

Understanding Problem Statement

Building a prototype model

Data Exploration and Preprocessing

Encode the categorical variables

Scale the data:

Model Building

Feature Importance

Identifying features to build the ML pipeline

Selected Features and Preprocessing Steps

Pipeline Design

1. Create the required binary features

2. Data Preprocessing and transformations.

3. Use the model to predict the target on the cleaned data

Building Pipeline

Predict the target

Live Coding Window

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID