Is Jasprit Bumrah a Genius Bowler? Using AutoEncoders for Anomaly Detection in Cricket

balajisr Last Updated : 26 Jun, 2024

9 min read

Introduction

During one of the cricket matches in the ICC World Cup T20 Championship, Rohit Sharma, Captain of Indian Cricket Team had applauded Jasprit Bumrah as Genius Bowler. I decided to run a experiment and test it out using data available publicly. Even though it is a fun project, I was pleasantly surprised by the results. Let us get started.

Problem Definition

To do data analysis or building models, we need to convert business problem into data problem. How do we make our model understand meaning of Genius. Well, Genius can be defined as “Way over or Head and Shoulders above the rest”. Can we formulate this as an Anomaly detection problem? Yes.

There are many ways to solve Anomaly detection problem. We would stick to AutoEncoders using PyTorch.

We would use publicly available T20 Player Statistics from cricket data R package to train our AutoEncoders model. If AutoEncoders struggles to reconstruct values then Mean Square Error (MSE) would be high. MSE over a threshold would be an anomaly.

In our case, MSE for Jasprit Bumrah should be sufficiently high to be flagged as anomaly or Genius.

Learning Objectives

Grasp the basic architecture of AutoEncoders, including the roles of the encoder and decoder networks.
Understand how to utilize reconstruction error (Mean Squared Error) from AutoEncoders to identify anomalies.
Learn to preprocess data, create datasets, and set up data loaders for training and testing models.
Understand the process of training an AutoEncoders, including setting hyperparameters, loss functions, and optimizers.
Explore real-world applications of anomaly detection, such as customer management and fraud detection.

This article was published as a part of the Data Science Blogathon.

What are AutoEncoders?
Real World Applications of AutoEncoder
Data Collection, Cleaning and Feature Engineering
Train and Test Dataset
Model Training
Mean Squared Error (MSE)
MSE Breakdown or Drill Down
Frequently Asked Questions

What are AutoEncoders?

AutoEncoders are composed of two networks Encoder and Decoder. Encoder receives D dimensional vector V and encodes into a vector X of M dimension wherein M < D. Hence, Encoder compresses our Input. Decoder decompresses X and tries to recreate V as far as possible. Let us call output of Decoder as Z.

Typically, Decoder would able to recreate V for most of rows i.e for most rows Z would be closer to V. But for certain rows, Decoder would struggle to decode and difference between Z and V would be huge. We would call these values Anomaly. Anomaly values usually have high Mean Squared Error or MSE.

Real World Applications of AutoEncoder

Let us now explore real world applications of AutoEncoder.

Customer Management

Suppose a Organization deals with lot of customers and has a methodology to label Customers as good or bad, clean or risky, wealthy or non wealthly. Auto Encoder when trained only on good or clean or wealthly customers can decipher pattern on these top or ideal customers. When a new customer comes in we have a reliable way to know how different is the new customer from ideal customer. You may argue that it can be done manually. Humans are limited by amount of variables and data they can handle. Machines do not have this limitation.

Fraud Management

Similar to above, if a organization has methodology to label transactions as fraudulent or non fraudulent. We can train our Autoencoder on Non-Fradulent transactions alone and in production environments, we have a reliable mechanism to know how different the new transaction from ideal transaction.

Above is not exhaustive list of application of AutoEncoder.

Let us now go back to our original problem.

Data Collection, Cleaning and Feature Engineering

I collected T20 career statistics data of bowlers here.

I used R library cricketdata to download player T20 Career Statistics as python version of the same is not available as far as i know. T20 Statistics does not include leagues like IPL.

library(cricketdata)
# T20 Career Data
t20_career <- fetch_cricinfo("T20", "men", "Bowling",'career')
# T20 Innings Data
t20_innings <- fetch_cricinfo("T20", "men", "Bowling",'innings')

We need to join both these datasets and create final input dataset to be used for training AutoEncoders in Python. Before saving the file to disk we need to consider only Test Playing Countries for our Analysis.

final<-final[Country %in% c('Australia','West Indies','South Africa'
                            ,'Pakistan','Afghanistan','India'
                            ,'Sri Lanka','England','New Zealand'
                            ,'BAN')]
fwrite(final,'T20_Stats_Career.txt',sep="|")

We can name the final dataset as “T20_Stats_Career.txt”.

Now we will use Python for our rest of Analysis.

import numpy as np
import torch
import torch.optim as optim
import numpy as np
import torch
import torch.optim as optim
import torch.nn as nn
from torch.utils.data import TensorDataset,DataLoader
from sklearn.preprocessing import StandardScaler
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import random

We have imported all necessary libraries. We would now read Player’s data.

df = pd.read_csv('T20_Stats_Career.txt',sep='|')
df.head()

For every player, we have data of Number of Innings, Overs, Maidens, Runs, Wickets, Average, Economy and Strike Rate.

Feature Engineering

I have added two new features:

Maiden Percentage: No of Maidens / No of Overs
Wickets Per Over: No of Wickets / No of Overs

df['Maiden_PCT'] = df['Maidens'] / df['Overs'] * 100
df['Wickets_Per_over'] = df['Wickets'] / df['Overs']

We also need to drop Players with Number of Innings less than 15 so that we use only those players with sufficient match experience for our Analysis.

Train and Test Dataset

Train Dataset: Train Datasets would have T20 Statistics of players from nationalities other than India.

Test Dataset: Only Indian Players.

# Create Train and Test Dataset
test = df[df['Country'] == 'India']
train = df[df['Country'] != 'India']

We use the following features to train our Model:

Average
Economy
Strike Rate
No of Four Wickets
No of Five Wickets
Maiden Percentage
Wickets Per Over

Drop Unnecessary Features

features = ['Average','Economy','StrikeRate','FourWickets','FiveWickets'
,'Maiden_PCT','Wickets_Per_over']
X_train = train[features]
X_test = test[features]
print("Number of Players in Train Dataset",X_train.shape)
print("Number of Players in Test Dataset",X_test.shape)
X_train.head()

Data Standarization

We have train and test dataset. Now we need to standardize the data.

sc = StandardScaler()
sc.fit(X_train)
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)

Model Training

We now set appropriate device and set data loaders with batch size of 16.

# Create Tensor Dataset and Dataloders
device = 'cuda' if torch.cuda.is_available() else 'cpu'

torch.manual_seed(13)
x_train_tensor = torch.as_tensor(X_train).float().to(device)
y_train_tensor = torch.as_tensor(X_train).float().to(device)

x_test_tensor = torch.as_tensor(X_test).float().to(device)
y_test_tensor = torch.as_tensor(X_test).float().to(device)

train_dataset = TensorDataset(x_train_tensor,y_train_tensor)
test_dataset = TensorDataset(x_test_tensor,y_test_tensor)

train_loader = DataLoader(dataset=train_dataset,batch_size=16,shuffle=True)
test_loader = DataLoader(dataset=test_dataset,batch_size=16)

We set AutoEncoders Architecture as below:

7 ->4->2->4->7.

As we are dealing with very less data we would build a simple model.

We use Learning Rate as 0.001 and Adam as Optimizer.

# AutoEncoder Architecture

class AutoEncoder(nn.Module):
    def __init__(self):
        super(AutoEncoder,self).__init__()
        self.encoder = nn.Sequential()
        self.encoder.add_module('Hidden1',nn.Linear(7,4))
        self.encoder.add_module('Relu1',nn.ReLU())
        self.encoder.add_module('Hidden2',nn.Linear(4,2))
        
        self.decoder = nn.Sequential()
        self.decoder.add_module('Hidden3',nn.Linear(2,4))
        self.decoder.add_module('Relu2',nn.ReLU())
        self.decoder.add_module('Hidden4',nn.Linear(4,7))

    def forward(self,x):
        encoder = self.encoder(x)
        return self.decoder(encoder)
        
    # Predict Method
    def predict(model,x):

        model.eval()
        x_tensor = torch.as_tensor(x).float()
        y_hat = model(x_tensor.to(device))
        model.train()

        return y_hat.detach().cpu().numpy()

    # Plot Losses
     def plot_losses(train_losses,test_losses):
        fig = plt.figure(figsize=(10,4))
        plt.plot(train_losses,label='training_loss',c='b')
        #plt.plot(self.val_losses,label='val loss',c='r')
        if test_loader:
            plt.plot(test_losses,label='test loss',c='r')
        #plt.yscale('log')
        plt.xlabel('Epochs')
        plt.ylabel('Loss')
        plt.legend()
        plt.tight_layout()
        return fig

# Model Loss and Optimizer
lr = 0.001
torch.manual_seed(21)
model = AutoEncoder().to(device)
optimizer = optim.Adam(model.parameters(),lr = lr)
loss_fn =nn.MSELoss()

We train our model for 250 epochs.

num_epochs=250
train_loss=[]
test_loss=[]
seed=42
torch.backends.cudnn.deterministic=True
torch.backends.cudnn.benchmark=False
torch.manual_seed(seed)
np.random.seed(seed)
random.seed(seed)
for epoch in range(num_epochs):
    mini_batch_train_loss=[]
    mini_batch_test_loss=[]
    for train_batch,y_train in train_loader:
        train_batch =train_batch.to(device)
        model.train()
        yhat = model(train_batch)
        loss = loss_fn(yhat,y_train)
        mini_batch_train_loss.append(loss.cpu().detach().numpy())
        
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()
  
    train_epoch_loss = np.mean(mini_batch_train_loss)
    train_loss.append(train_epoch_loss)

    with torch.no_grad():
        for test_batch,y_test in test_loader:
            test_batch = test_batch.to(device)
            model.eval()
            yhat = model(test_batch)
            loss = loss_fn(yhat,y_test)
            mini_batch_test_loss.append(loss.cpu().detach().numpy())
        test_epoch_loss = np.mean(mini_batch_test_loss)
        test_loss.append(test_epoch_loss)
        
fig = plot_losses(train_loss,test_loss)
fig.savefig('Train_Test_Loss.png')

Train and Test Loss plot looks OK.

Mean Squared Error (MSE)

Using Predict Function we can predict for Train Dataset. We then compute Mean Squared Error by squaring difference between Actuals and Predicted. Also we will compute Z-Score using mean and standard deviation of MSE.

# Predict Train Dataset and get error
train_pred = predict(model,X_train)
print(train_pred.shape)
error = np.mean(np.power(X_train - train_pred,2),axis=1)
print(error.shape)
train['error'] = error
mean_error = np.mean(train['error'])
std_error =np.std(train['error'])
train['zscore'] = (train['error'] - mean_error) / std_error
train = train.sort_values(by='error').reset_index()
train.to_csv('Train_Output.txt',sep="|",index=None)

fig = plt.figure(figsize=(10,4))
plt.title('Distribution of MSE in Train Dataset')
train['error'].plot(kind='line')
plt.ylabel('MSE')
plt.show()
fig.savefig('Train_MSE.png')

We can infer there is steep increase in MSE for certain players.

Majority of players are within MSE of 1. Beyond 1.2 MSE there are only few players.

Top 3 Players in the train dataset with highest MSE are:

train.tail(3)

Please keep in mind that we are using Tail Function

We can infer that for some players auto encoder struggles to reconstruct original values resulting in high MSE.

By looking at above plots, we can set threshold to be 1.2.

I agree that we need to split data into train and validation and use data of validation dataset to set threshold. But in this case we have only 200 rows. We are forced to take this approach.

Test Dataset – Indian Players or Bowlers

Let us now compute Mean Squared Error and ZScore for Test Data.

# Predict Test Dataset and get error
test_pred = predict(model,X_test)
test_error = np.mean(np.power(X_test - test_pred,2),axis=1)
test['error'] = test_error
test['zscore'] = (test['error'] - mean_error) / std_error
test = test.sort_values(by='error').reset_index()
test.to_csv('Test_Output.txt',sep="|",index=None)

fig = plt.figure(figsize=(10,4))
plt.title('Distribution of MSE in Test Dataset')
test['error'].plot(kind='line')
plt.ylabel('MSE')
plt.show()
fig.savefig('Test_MSE.png')

Similar to Train Dataset there is steep increase in MSE for certain Indian Players.

test.tail(3)

Please keep in mind that we are using Tail Function. Hence Correct order is Kuldeep Yadav, JJ Bumrah and Harbhajan Singh.

As in train dataset, we create a new column named Error in test dataset which has MSE values. Similar to Train Dataset, Autoencoder is struggling to reconstruct original values for some Indian Players.

Using Train MSE we have computed mean and standard deviation. For each value in test dataset we compute Z-Score as (test error – train mean error) / train error standard deviation.

We can verify that Z-Score for Bumrah is more than 3 which signifies Anomaly or Genius.

MSE Breakdown or Drill Down

Let us now learn about the MSE breakdown for the players.

Jasprit Bumrah

Let us now, understand why MSE is high for Jasprit Bumrah. MSE of Jasprit Bumrah is 1.36. Let us drill down further on the MSE at variable level to understand contributing factors.

MSE is calculated as (Actual – Predicted) * (Actual – Predicted).

Please note that we are dealing with standardized values. Reason for the high MSE is mostly contributed by high Maiden Percentage. This means Bumrah would be an outstanding bowler at 19th or 20th over of the innings. High Maiden Percentage would create pressure on batsman which can result in other bowlers taking wickets in the next over. Please note that variable Maiden Percentage is was created by Feature Engineering.

Kuldeep Yadav

Kuldeep Yadav has uncanny ability of picking up wickets which will be useful in middle overs. Auto Encoder over predicted four wickets variable.

Overall Statistics of Top 2 Indian Bowlers

Jasprit Bumrah has 2 variables in more than 90th percentile. Kuldeep Yadav has 4 variables in more than 99th percentile.

Hope to see Kuldeep Yadav in action soon.

You can find full code here.

Conclusion

AutoEncoder is a powerful tool in one’s arsenal for Anomaly Detection but it is not the only method. We can also consider using ML algorithms like Isolation Forest or other simpler methods. Coming back to our problem, we can infer that AutoEncoder is able to correctly identify Anomalies. Toughest part in Anomaly Detection is to convince stakeholders of the reasons of the Anomaly. Here we computed drill down of MSE to identify reasons for the Anomaly. These insights are as important as detecting anomaly itself. Explainable AI is important.

Key Takeaways

We used R Package cricketdata to download T20 Player Statistics for test playing nations and save the data to disk.
Do Feature Engineering by computing Maiden Percentage and Wickets Per Over.
Using PyTorch we would train Auto Encoder model for 250 epochs on the Train Dataset. We use optimizer as Adam and set learning rate to 0.001.
Compute Mean Square Error by computing difference between Actual and Prediction in both train and test dataset.
We consider Reconstruction error beyond a certain threshold as an anomaly. In our case it is 1.2.
By looking at break up of MSE we can infer that Bumrah excels in bowling Maidens which is gold in T20.

Frequently Asked Questions

Q1. Do we really need to use AutoEncoder for this small dataset?

A. Yes we can use ML Methods like Isolation Forest or other Simpler methods to solve this problem. I have used AutoEncoder just for Illustration.

Q2. Suppose we train model on different seeds, do we get different output?

A. Yes, output is different when trained using different seeds as data is small.Main motive of this blog is to demonstrate application of AutoEncoder and how it can be used for generating insights to aid decision making.

Q3.What is the training time of the model?

A. Training time is less than a Minute.

Q4. Why PyTorch?

A. Deep Learning Framework should not matter. It all depends on the framework one is comfortable with.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

balajisr

Seasoned data profession with over 6+ years of experience as Data Scientist and over a decade of experience in Analytics.Kaggle Competition Expert. Passionate in developing data products and solutions.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Deep Learning

Feed Forward Networks

Gradient Descent

Loss Function

Activation Functions

Introduction to Neural networks

Forward and Backward Propagation

Optimizers

Learning Rate Schedulers

NN on Structured Data

Improving the Deep Learning Model

Deep Learning Model Optimization

Unsupervised Deep Learning

AutoDL

Model Deployment

Introduction to PyTorch

Is Jasprit Bumrah a Genius Bowler? Using AutoEncoders for Anomaly Detection in Cricket

Introduction

Problem Definition

Learning Objectives

Table of contents

What are AutoEncoders?

Real World Applications of AutoEncoder

Customer Management

Fraud Management

Data Collection, Cleaning and Feature Engineering

Feature Engineering

Train and Test Dataset

Drop Unnecessary Features

Data Standarization

Model Training

Mean Squared Error (MSE)

Test Dataset – Indian Players or Bowlers

MSE Breakdown or Drill Down

Jasprit Bumrah

Kuldeep Yadav

Overall Statistics of Top 2 Indian Bowlers

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID