Build a Predictive Model in 10 Minutes (using Python)

Sunil Ray Last Updated : 14 Nov, 2024

7 min read

I came across this strategic virtue from Sun Tzu recently:

What does this have to do with a data science blog? This is the essence of how you win competitions and hackathons. You come into the competition better prepared than the competitors, execute quickly, learn, and iterate to bring out your best.

Last week, we published “Perfect Way to Build a Predictive Model in Less than 10 minutes using R“. Anyone can guess a quick follow-up to this article. Given the rise of Python in the last few years and its simplicity, it makes sense to have this tool kit ready for the Pythonists in the data science world. I will follow a similar structure as the previous article with my additional inputs at different stages of model building. These two articles will help you build your first predictive model faster and with better power. Most of the top data scientists and Kagglers quickly built and submitted their first effective model. This helps them get a head start on the leaderboard and provides a benchmark solution to beat.

New Feature

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Overview:

Learn about the essentials of predictive modeling in Python, from data preparation to model performance evaluation, using efficient and easy-to-follow steps.
Discover effective techniques for quickly building your first predictive model and setting a benchmark for iterative improvement.
Explore the advantages of data exploration and smart data treatment to streamline the modeling process and enhance accuracy.
Understand which algorithms and methods work best for different prediction tasks with practical Python code examples.
Master time-saving strategies and automation techniques that top data scientists use to succeed in competitions and real-world applications.

Breaking Down the process of Predictive Modeling
Let’s Start Putting This Into Action
Conclusion
Frequently Asked Questions

Breaking Down the process of Predictive Modeling

I always focus on investing quality time during the initial phase of the model building, like hypothesis generation/brainstorming session(s) / discussion(s) or understanding the domain. All these activities helped me relate to the problem, eventually leading me to design more robust business solutions. There are good reasons why you should spend this time up front:

You have enough time to invest, and you are fresh ( It has an impact)
You are not biased with other data points or thoughts (I always suggest do hypothesis generation before deep diving into data)
At a later stage, you would be in a hurry to complete the project and not be able to spend quality time

This stage will need quality time, so I am not mentioning the timeline here; I recommend you make this a standard practice. It will help you to build better predictive models and result in less iteration of work at later stages. Let’s look at the remaining stages in the first model build with timelines:

Descriptive analysis on the Data – 50% time
Data treatment (Missing value and outlier fixing) – 40% time
Data Modelling – 4% time
Estimation of performance – 6% time

P.S. This is the split of time spent only for the first model build

Let’s go through the process step by step (with estimates of time spent in each step):

Stage 1: Descriptive Analysis / Data Exploration:

In my initial days as a data scientist, data exploration used to take a lot of time. With time, I have automated many operations on the data. The benefits of automation are obvious, given that data prep takes up 50% of the work in building a first model. You can look at “7 Steps of Data Exploration ” for the most common data exploration operations.

Tavish mentioned in his article that the time taken to perform this task had been significantly reduced with advanced machine-learning tools coming into the race. Since this is our first benchmark model, we do away with feature engineering. Hence, the time you might need to do descriptive analysis is restricted to missing values and big features that are directly visible. In my methodology, you will need 2 minutes to complete this step (Assumption, 100,000 observations in data set).

The operations I perform for my first model include:

Identify ID, Input and Target features
Identify categorical and numerical features
Identify columns with missing values

Stage 2: Data Treatment (Missing values treatment):

There are various ways to deal with it. For our first model, we will focus on the smart and quick techniques to build your first effective model (Tavish already discusses these in his article; I am adding a few methods)

Create dummy flags for missing value(s): It works; sometimes, missing values carry a lot of information.
Impute missing value with mean/ median/ any other easiest method: Mean and Median imputation performs well, Most people prefer to impute with mean value, but in case of skewed distribution, I would suggest you go with median. Other Intelligent methods are imputing values by similar case mean and median imputation using other relevant features or building a model. For Example, In the Titanic survival challenge, you can impute missing values of Age using salutations of passengers’ names Like “Mr.”, “Miss.”,” Mrs.”, “Master” and others, and this has shown a good impact on model performance.
Impute the missing value of the categorical variable: Create a new level to impute the categorical variable so that all missing values are coded as a single value, say “New_Cat.” Alternatively, you can look at the frequency mix and impute the missing value with a higher frequency value.

With such simple data treatment methods, the time to treat data can be reduced to 3-4 minutes.

Stage 3. Data Modelling :

I recommend using any one of the GBM / Random Forest techniques, depending on the business problem. These two techniques are extremely effective in creating a benchmark solution. I have seen data scientists often use these two methods as their first model; in some cases, they act as a final model. This will take the maximum time (~4-5 minutes).

Stage 4. Estimation of Performance:

There are various methods to validate your model’s performance. I suggest you divide your train data set into Train and Validate (ideally 70:30) and build a model based on 70% of the train data set. Now, cross-validate it using 30% of the validated data set and evaluate the performance using an evaluation metric. This finally takes 1-2 minutes to execute and document.

The intent of this article is not to win the competition but to establish a benchmark for ourselves. Let’s look at the Python codes to perform the above steps and build your first model with a higher impact.

If you want to learn Predictive analysis from starting than Checkout this Article!

Let’s Start Putting This Into Action

I assume you have done all the hypothesis generation first and are good with basic data science using Python. I am illustrating this with an example of a data science challenge. Let’s look at the structure:

Step 1

Import required libraries and read, test, and train the data set. Append both.

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import random
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

train=pd.read_csv('C:/Users/Analytics Vidhya/Desktop/challenge/Train.csv')
test=pd.read_csv('C:/Users/Analytics Vidhya/Desktop/challenge/Test.csv')
train['Type']='Train' #Create a flag for Train and Test Data set
test['Type']='Test'
fullData = pd.concat([train,test],axis=0) #Combined both Train and Test Data set

Step 2

The next step of the framework is not required in Python.

Step 3

The next step in building predictive modeling in Python is to view the dataset’s column names and summary.

fullData.columns # This will show all the column names
fullData.head(10) # Show first 10 records of dataframe
fullData.describe() #You can look at summary of numerical fields by using describe() function

Step 4

Identify the a) ID variables, b) Target variables, c) Categorical Variables, d) Numerical Variables, e) Other Variables

ID_col = ['REF_NO']
target_col = ["Account.Status"]
cat_cols = ['children','age_band','status','occupation','occupation_partner','home_status','family_income','self_employed', 'self_employed_partner','year_last_moved','TVarea','post_code','post_area','gender','region']
num_cols= list(set(list(fullData.columns))-set(cat_cols)-set(ID_col)-set(target_col)-set(data_col))
other_col=['Type'] #Test and Train Data set identifier

Step 5

Identify the variables with missing values and create a flag for those

fullData.isnull().any()#Will return the feature with True or False,True means have missing value else False

num_cat_cols = num_cols+cat_cols # Combined numerical and Categorical variables

#Create a new variable for each variable having missing value with VariableName_NA 
# and flag missing value with 1 and other with 0

for var in num_cat_cols:
    if fullData[var].isnull().any()==True:
        fullData[var+'_NA']=fullData[var].isnull()*1

Step 6

The next step in building a predictive modeling in Python is to impute Missing values.

#Impute numerical missing values with mean
fullData[num_cols] = fullData[num_cols].fillna(fullData[num_cols].mean(),inplace=True)

#Impute categorical missing values with -9999
fullData[cat_cols] = fullData[cat_cols].fillna(value = -9999)

Step 7

Create label encoders for categorical variables and split the data set to train & test, further split the train data set to Train and Validate

#create label encoders for categorical features
for var in cat_cols:
 number = LabelEncoder()
 fullData[var] = number.fit_transform(fullData[var].astype('str'))

#Target variable is also a categorical so convert it
fullData["Account.Status"] = number.fit_transform(fullData["Account.Status"].astype('str'))

train=fullData[fullData['Type']=='Train']
test=fullData[fullData['Type']=='Test']

train['is_train'] = np.random.uniform(0, 1, len(train)) <= .75
Train, Validate = train[train['is_train']==True], train[train['is_train']==False]

Step 8

Pass the imputed and dummy (missing values flags) variables into the modelling process. I am using the random forest to predict the class.

features=list(set(list(fullData.columns))-set(ID_col)-set(target_col)-set(other_col))

x_train = Train[list(features)].values
y_train = Train["Account.Status"].values
x_validate = Validate[list(features)].values
y_validate = Validate["Account.Status"].values
x_test=test[list(features)].values

random.seed(100)
rf = RandomForestClassifier(n_estimators=1000)
rf.fit(x_train, y_train)

Step 9

The next step in building predictive modeling in Python is to check performance and make predictions.

status = rf.predict_proba(x_validate)
fpr, tpr, _ = roc_curve(y_validate, status[:,1])
roc_auc = auc(fpr, tpr)
print roc_auc

final_status = rf.predict_proba(x_test)
test["Account.Status"]=final_status[:,1]
test.to_csv('C:/Users/Analytics Vidhya/Desktop/model_output.csv',columns=['REF_NO','Account.Status'])

And Submit!

Conclusion

In conclusion, building a predictive model effectively hinges on a strategic approach that prioritizes data exploration, smart data treatment, and rapid model development. Focusing on an initial benchmark model sets a foundation for swift iteration and improvement in future modelling stages. This guide underscores the importance of breaking down complex modelling tasks into manageable steps, ultimately enabling you to build robust, accurate models more efficiently. With the step-by-step Python code provided, you’re well-equipped to implement these methods in your predictive modelling projects, gaining a competitive edge in data science competitions and real-world applications.

Happy modelling, and may your models be both insightful and impactful!

Frequently Asked Questions

Q1. What is prediction model in Python?

A. A prediction model in Python is a mathematical or statistical algorithm used to make predictions or forecasts based on input data. It utilizes machine learning or statistical techniques to analyze historical data and learn patterns, which can then be used to predict future outcomes or trends. Python provides various libraries and frameworks for building and deploying prediction models efficiently.

Q2. What are the three types of prediction?

A. The three types of prediction are:
1. Classification: Predicting the class or category that an input belongs to based on training data with labelled examples.
2. Regression: Predicting a continuous numerical value as an output to find a relationship between input and target variables.
3. Time series forecasting: Predicting future values based on the patterns and trends observed in historical time series data.

Q3. What is a predictive model in Python?

A. A predictive model in Python is a statistical or machine learning algorithm designed to forecast outcomes based on data input. Using libraries like scikit-learn or TensorFlow, predictive models analyze historical data to learn patterns and make future predictions, supporting decisions across various domains.

Q4. Can Python be used for modeling?

A. Yes, Python is great for modeling! It has libraries for:
Mathematical modeling: NumPy, SciPy, SymPy
Statistical modeling: Statsmodels, Scikit-learn
Machine learning: Scikit-learn, TensorFlow, PyTorch
Data modeling: Pandas, SQLAlchemy, App Engine Datastore
Simulation: SimPy, PyDSTool

Q5. Can Python be used for predictive analytics?

A. Yes, Python is great for predictive analytics. It has libraries for:
Data cleaning and analysis: Pandas, NumPy
Machine learning: Scikit-learn, TensorFlow, PyTorch
Visualization: Matplotlib, Seaborn
You can use Python to build and deploy predictive models for various applications.

Sunil Ray

Sunil Ray is Chief Content Officer at Analytics Vidhya, India's largest Analytics community. I am deeply passionate about understanding and explaining concepts from first principles. In my current role, I am responsible for creating top notch content for Analytics Vidhya including its courses, conferences, blogs and Competitions.

I thrive in fast paced environment and love building and scaling products which unleash huge value for customers using data and technology. Over the last 6 years, I have built the content team and created multiple data products at Analytics Vidhya.

Prior to Analytics Vidhya, I have 7+ years of experience working with several insurance companies like Max Life, Max Bupa, Birla Sun Life & Aviva Life Insurance in different data roles.

Industry exposure: Insurance, and EdTech

Major capabilities: Content Development, Product Management, Analytics, Growth Strategy.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Gianni

Hello I'm completely new, and I'm a bit lost. 1. By example, where I can find the train.csv and test.csv ? 2. This instruction "fullData.describe() #You can look at summary of numerical fields by using describe() function" ought to show me a resume of dataset but I can't see nothing. 3. When I try the code I get an error in line num_cols= list(set(list(fullData.columns))-set(cat_cols)-set(ID_col)-set(target_col)-set(data_col)) because the data_col is not defined. It's an error ? Sorry for my silly question.

Johnny Johnson

Beauuuuuuuutiful! Just what I was looking for; practical application. I can't wait to give it a try. Thank you

Pronojit

Hi Sunil, Thanks for the neat workflow, which I am sure will be helpful to many. But I couldnt get the logic behind encoding the target variable with LabelEncoder as well. How does it help in better prediction? Can you explain the same please? Thanks.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

Build a Predictive Model in 10 Minutes (using Python)

Get Personalized Learning Path! Set your goal and timeline. Get a path—under 2 mins.

Table of contents

Breaking Down the process of Predictive Modeling

Stage 1: Descriptive Analysis / Data Exploration:

Stage 2: Data Treatment (Missing values treatment):

Stage 3. Data Modelling :

Stage 4. Estimation of Performance:

Let’s Start Putting This Into Action

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID