4 Ways to Evaluate your Machine Learning Model: Cross-Validation Techniques (with Python code)

Lakshana Last Updated : 12 Dec, 2024

9 min read

Whenever we build any machine learning model, we feed it with initial data to train the model. And then we feed some unknown data (test data) to understand how well the model performs and generalized over unseen data. If the model performs well on the unseen data, it’s consistent and is able to predict with good accuracy on a wide range of input data; then this model is stable.

But this is not the case always! Machine learning models are not always stable and we have to evaluate the stability of the machine learning model. That is where Cross Validation comes into the picture.

“In simple terms, Cross-Validation is a technique used to assess how well our Machine learning models perform on unseen data”

According to Wikipedia, Cross-Validation is the process of assessing how the results of a statistical analysis will generalize to an independent data set. There are many ways to perform Cross-Validation and we will learn about 4 methods in this article.

Let’s first understand the need for Cross-Validation!

This article was published as a part of the Data Science Blogathon

Why do we need Cross-Validation?

Suppose you build a machine learning model to solve a problem, and you have trained the model on a given dataset. When you check the accuracy of the model on the training data, it is close to 95%. Does this mean that your model has trained very well, and it is the best model because of the high accuracy?

No, it’s not! Because your model trains on the given data, it understands the data well, captures even the minute variations (noise), and generalizes effectively over the given data. If you expose the model to completely new, unseen data, it might not predict with the same accuracy and it might fail to generalize over the new data. This problem is called over-fitting.

Sometimes the model doesn’t train well on the training set as it’s not able to find patterns. In this case, it wouldn’t perform well on the test set as well. This problem is called Under-fitting.

Cross-Validation overfitting, underfitting, balance

Image Source: fireblazeaischool.in

To overcome over-fitting problems, we use a technique called Cross-Validation.

What is Cross Validation?

Cross-Validation is a resampling technique with the fundamental idea of splitting the dataset into 2 parts- training data and test data. Train data is used to train the model and the unseen test data is used for prediction. If the model performs well over the test data and gives good accuracy, it means the model hasn’t overfitted the training data and can be used for prediction.

Let’s dive deep and learn about some of the model evaluation techniques.

1. Hold Out method

This is the simplest evaluation method and is widely used in Machine Learning projects. Here the entire dataset(population) is divided into 2 sets – train set and test set. The data can be divided into 70-30 or 60-40, 75-25 or 80-20, or even 50-50 depending on the use case. As a rule, the proportion of training data has to be larger than the test data.

Image Source: DataVedas

The data split happens randomly, and we can’t be sure which data ends up in the train and test bucket during the split unless we specify random_state. This can lead to extremely high variance and every time, the split changes, the accuracy will also change.

There are some drawbacks to this method:

In the Hold out method, the test error rates are highly variable (high variance) and it totally depends on which observations end up in the training set and test set
Only a part of the data is used to train the model (high bias) which is not a very good idea when data is not huge and this will lead to overestimation of test error.

One of the major advantages of this method is its computational efficiency compared to other cross-validation techniques.

Quick implementation of Hold Out method in Python

Python Code:

from sklearn.model_selection import train_test_split
X = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
X_train, X_test= train_test_split(X, test_size=0.3, random_state=1)
print('Train:', X_train, 'Test:', X_test)

Output

Train: [50, 10, 40, 20, 80, 90, 60] Test: [30, 100, 70]

Here, random_state is the seed used for reproducibility.

2. Leave One Out Cross-Validation

In this method, we divide the data into train and test sets – but with a twist. Instead of dividing the data into two subsets, we select a single observation as test data and label everything else as training data to train the model. Next, we select the 2nd observation as test data and train the model on the remaining data.

Image Source: ISLR

This process continues ‘n’ times, and we calculate the average of all these iterations to estimate the test set error.

When it comes to test-error estimates, LOOCV gives unbiased estimates (low bias). But bias is not the only matter of concern in estimation problems. We should also consider variance.

LOOCV has extremely high variance because averaging the output of n models fitted on almost identical sets of observations results in highly positively correlated outputs.

And you can clearly see that this method is computationally expensive because the model runs ‘n’ times to test every observation in the data. Our next method will tackle this problem and give us a good balance between bias and variance.

Quick implementation of Leave One Out Cross-Validation in Python

from sklearn.model_selection import LeaveOneOut

X = [10,20,30,40,50,60,70,80,90,100]

l = LeaveOneOut()

for train, test in l.split(X):

    print("%s %s"% (train,test))

Output:

[1 2 3 4 5 6 7 8 9] [0]
[0 2 3 4 5 6 7 8 9] [1]
[0 1 3 4 5 6 7 8 9] [2]
[0 1 2 4 5 6 7 8 9] [3]
[0 1 2 3 5 6 7 8 9] [4]
[0 1 2 3 4 6 7 8 9] [5]
[0 1 2 3 4 5 7 8 9] [6]
[0 1 2 3 4 5 6 8 9] [7]
[0 1 2 3 4 5 6 7 9] [8]
[0 1 2 3 4 5 6 7 8] [9]

This output clearly shows how LOOCV keeps one observation aside as test data and all the other observations go to train data.

3. K-Fold Cross-Validation

In this resampling technique, we divide the whole data into k sets of almost equal sizes. We select the first set as the test set and train the model on the remaining k-1 sets. Then, we calculate the test error rate after fitting the model to the test data.

In the second iteration, we select the 2nd set as the test set and use the remaining k-1 sets to train the model while calculating the error. This process continues for all the k sets.

Image Source: Wikipedia

The mean of errors from all the iterations is calculated as the CV test error estimate.

In K-Fold CV, the no of folds k is less than the number of observations in the data (k<n) and we are averaging the outputs of k fitted models that are somewhat less correlated with each other since the overlap between the training sets in each model is smaller. This leads to low variance then LOOCV.

The best part about this method is that each data point appears in the test set exactly once and participates in the training set k-1 times. As the number of folds k increases, the variance also decreases (low variance). This method leads to intermediate bias because each training set contains fewer observations (k-1)n/k than the Leave One Out method but more than the Hold Out method.

Typically, K-fold Cross Validation uses k=5 or k=10 because empirical evidence shows these values yield test error estimates with low bias and variance.

The major disadvantage of this method is that the model must run from scratch K times, making it more computationally expensive than the Hold Out method but better than the Leave One Out method.

Simple implementation of K-Fold Cross-Validation in Python

from sklearn.model_selection import KFold

X = ["a",'b','c','d','e','f']

kf = KFold(n_splits=3, shuffle=False, random_state=None)

for train, test in kf.split(X):

    print("Train data",train,"Test data",test)

Output

Train: [2 3 4 5] Test: [0 1]
Train: [0 1 4 5] Test: [2 3]
Train: [0 1 2 3] Test: [4 5]

4. Stratified K-Fold Cross-Validation

This is a slight variation from K-Fold Cross Validation, which uses ‘stratified sampling’ instead of ‘random sampling.’

Let’s quickly understand what stratified sampling is and how is it different from random sampling.

Suppose your data contains reviews for a cosmetic product used by both the male and female population. When we perform random sampling to split the data into train and test sets, we might not represent most of the male data in the training set, which could end up in the test data. When we train the model on sample training data that is not a correct representation of the actual population, the model will not predict the test data with good accuracy.

This is where Stratified Sampling comes to the rescue. Here, the data splits to represent all the classes from the population.

Let’s consider the above example which has a cosmetic product review of 1000 customers out of which 60% is female and 40% is male. I want to split the data into train and test data in proportion (80:20). 80% of 1,000 customers equals 800, chosen to include 480 reviews from the female population and 320 from the male population. Similarly, we will select 20% of 1,000 customers for the test data, maintaining the same female and male representation.

Image Source: stackexchange.com

This is exactly what stratified K-Fold CV does and it will create K-Folds by preserving the percentage of sample for each class. This solves the problem of random sampling associated with Hold out and K-Fold methods.

Quick implementation of Stratified K-Fold Cross-Validation in Python

from sklearn.model_selection import StratifiedKFold

X = np.array([[1,2],[3,4],[5,6],[7,8],[9,10],[11,12]])
y= np.array([0,0,1,0,1,1])

skf = StratifiedKFold(n_splits=3,random_state=None,shuffle=False)

for train_index,test_index in skf.split(X,y):
    print("Train:",train_index,'Test:',test_index)
    X_train,X_test = X[train_index], X[test_index]
    y_train,y_test = y[train_index], y[test_index]

Output

Train: [1 3 4 5] Test: [0 2]
Train: [0 2 3 5] Test: [1 4]
Train: [0 1 2 4] Test: [3 5]

The output clearly shows the stratified split done based on the classes ‘0’ and ‘1’ in ‘y’.

Bias – Variance Tradeoff

When we consider the test error rate estimates, K-Fold Cross Validation gives more accurate estimates than Leave One Out Cross-Validation. Whereas the Hold One Out CV method usually overestimates the test error rate because this approach uses only a portion of the data to train the machine learning model.

When it comes to bias, the Leave One Out Method gives unbiased estimates because each training set contains n-1 observations (which is pretty much all of the data). K-Fold CV leads to an intermediate level of bias depending on the number of k-folds when compared to LOOCV but it’s much lower when compared to the Hold Out Method.

To conclude, the Cross-Validation technique that we choose highly depends on the use case and bias-variance trade-off.

If you have read this article this far, here is a quick bonus for you 👏

sklearn.model_selection has a method cross_val_score which simplifies the process of cross-validation. Instead of iterating through the complete data using the ‘split’ function, we can use cross_val_score and check the accuracy score for the chosen cross-validation method

You can check out my Github for python implementation of different cross-validation methods on the UCI Breast Cancer data from Kaggle.

Frequently Asked Questions

Q1. What is cross-validation and its types?

A. Cross-validation serves as a technique in machine learning and statistical modeling to assess model performance and prevent overfitting. It involves dividing the dataset into multiple subsets, using some for training the model and the rest for testing, multiple times to obtain reliable performance metrics.

Types of Cross-validation:
1. K-Fold Cross-Validation
2. Leave-One-Out Cross-Validation (LOOCV)
3. Stratified K-Fold Cross-Validation
4. Time Series Cross-Validation
5. Shuffle-Split Cross-Validation
6. Group K-Fold Cross-Validation

Cross-validation provides a more robust assessment of a model’s performance and helps in selecting hyperparameters, assessing generalization, and avoiding overfitting by evaluating the model on different data points than those used for training.

Q2. What is K-fold used for?

A. K-fold cross-validation assesses the performance and generalization of a machine learning model. It divides the dataset into K subsets, trains the model K times using K-1 folds for training and one fold for testing in each iteration. This method helps obtain more reliable performance metrics, choose hyperparameters, and prevent overfitting by evaluating the model on different data subsets.

Below are some of my articles on Machine Learning

If you would like to share your thoughts, you can connect with me on LinkedIn.

Happy Learning!

Analytics Vidhya does not own the media shown in this article, and the author uses it at their discretion.

Lakshana

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Chirag

I would like to point out a discrepancy in the article. Under LOOCV, following paragraph is present : -------------------- "LOOCV has an extremely high variance because we are averaging the output of n-models which are fitted on an almost identical set of observations, and their outputs are highly positively correlated with each other." -------------------- Under K-Fold CV following line is present: -------------------- As the number of folds k increases, the variance also decreases (low variance). -------------------- But as we increase the number of folds, don't you think the variance should increase? The highest variance one might get is in LOOCV where one sample is left out, as the number of "k" folds increases, the sample per fold decreases and hence variance goes higher. Please shed some light on this.

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

4 Ways to Evaluate your Machine Learning Model: Cross-Validation Techniques (with Python code)

Why do we need Cross-Validation?

What is Cross Validation?

1. Hold Out method

2. Leave One Out Cross-Validation

3. K-Fold Cross-Validation

4. Stratified K-Fold Cross-Validation

Bias – Variance Tradeoff

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect