Cricket embraces data analytics for strategic advantage. With franchise leagues like IPL and BBL, teams rely on statistical models and tools for competitive edge. This article explores how data analytics optimizes strategies by leveraging player performances and opposition weaknesses. Python programming predicts player performances, aiding team selections and game tactics. The analysis benefits fantasy cricket enthusiasts and revolutionizes the sport through machine learning and predictive modeling.
This project aims to demonstrate the usage of Python and machine learning to predict player performance in T20 matches. By the end of this article, you will be able to:
This article was published as a part of the Data Science Blogathon.
We aim to predict the player performance for an upcoming IPL match using Python and data analytics. The project includes collecting, processing, and analyzing data on player and team performance in previous T20 matches. It also involves building a predictive model that can forecast player performance in the next match.
The problem we aim to solve is to provide IPL team coaches, management and fantasy league enthusiasts with a tool to help them make data-driven decisions about player selection and game tactics. Traditionally, the selection of players and game tactics in cricket have been based on subjective assessments and experience. However, with the advent of data-driven analytics, one can now use statistical models to gain insights into player performance and make informed decisions about team selection and game strategies.
Our solution consists of building a predictive model that can accurately forecast player performance based on historical data. This will help individuals and teams identify the best players for the next match and devise strategies to maximize their chances of winning.
With the IPL 2023 season reaching its peak, cricket enthusiasts eagerly await the epic last league match between Gujarat Titans and Royal Challengers Bangalore. Determining this encounter’s outcome heavily relies on how each player performs. In our pursuit of insights into potential performances, we’ve curated a lineup of individuals who have consistently demonstrated their skillsets throughout the tournament:
We will attempt to predict the performances of these players for this crucial game by using advanced statistical models and historical data.
We will begin the data collection and preparation by scraping cricmetric.com for the most recent statistics of the relevant players. We structure and organize the collected data for model construction.
To begin with, we will import the required libraries, including time, pandas, and selenium. We utilize the Selenium library to control and orchestrate the Chrome web browser for web scraping purposes.
import time
import pandas as pd
import numpy as np
from selenium import webdriver
Specifying the path to the Chrome driver executable (chrome_driver_path) configures the Chrome driver. Additionally, the directory containing the Chrome driver is specified as the webdriver_path.
# Setting up the Chrome driver
chrome_driver_path = "{Your Chromedriver path}\chromedriver.exe"
webdriver_path = "{Your Webdriver path}\Chromedriver"
%cd webdriver_path
driver = webdriver.Chrome(chrome_driver_path)
We then initialize an empty DataFrame named final_data which will be used to store the collected player statistics. Next, we perform a loop that iterates over the list of our relevant player names.
For each player, the code performs the following steps:
Once we have collected the required data, we will apply the following transformations:
# Setting up the Chrome driver
%cd "C:\Users\akshi\OneDrive\Desktop\ISB\Data Collection\
Chromedriver"
driver = wb.Chrome("C:\\Users\\akshi\\OneDrive\\Desktop\
\\ISB\\Data Collection\\Chromedriver\\\
chromedriver.exe")
# Extracting recent stats of the players
final_data = pd.DataFrame() # Final dataframe to store
# all the player data
# Looping through all the players
for i in players[0:]:
# Accessing the web page for the current player's stats
driver.get("http://www.cricmetric.com/playerstats.py?\
player={}&role=all&format=TWENTY20&\
groupby=match&start_date=2022-01-01&\
end_date=2023-05-18&start_over=0&\
end_over=9999".format(i.replace(' ','+')))
# Scrolling down to load all the stats
driver.execute_script("window.scrollTo(0, 1080)")
driver.maximize_window()
time.sleep(3)
try:
# Extracting batting stats of the player
batting_table = driver.find_element_by_xpath(
'//*[@id="TWENTY20-Batting"]/div/table')
bat = batting_table.text
stats = pd.DataFrame(bat.split('\n'))[0].str.split(' ',
expand=True)[0:-1]
stats.columns = stats.iloc[0]
stats = stats[1:]
del stats['%']
stats = stats[['Match','Runs','Balls','Outs','SR',
'50','100','4s','6s','Dot']]
stats.columns = ['Match','Runs Scored','Balls Played',
'Out','Bat SR','50','100','4s Scored',
'6s Scored','Bat Dot%']
# Switching to bowling stats tab
bowling_tab = driver.find_element_by_xpath(
'//*[@id="TWENTY20-Bowling-tab"]')
bowling_tab.click()
time.sleep(5)
# Extracting bowling stats of the player
bowling_table = driver.find_element_by_xpath(
'//*[@id="TWENTY20-Bowling"]/div/table')
bowl = bowling_table.text
stats2 = pd.DataFrame(bowl.split('\n'))[0].str.split(' ',
expand=True)[0:-1]
stats2.columns = stats2.iloc[0]
stats2 = stats2[1:]
stats2 = stats2[['Match','Overs','Runs','Wickets','Econ',
'Avg','SR','5W','4s','6s','Dot%']]
stats2.columns = ['Match','Overs Bowled','Runs Given',
'Wickets Taken','Econ','Bowl Avg',
'Bowl SR','5W','4s Given','6s Given',
'Bowl Dot%']
except:
# If stats for current player not found,
# create empty dataframe
stats2 = pd.DataFrame({'Match':pd.Series(stats['Match'][0:1]),
'Overs Bowled':[0],'Runs Given':[0],
'Wickets Taken':[0],'Econ':[0],
'Bowl Avg':[0],'Bowl SR':[0],'5W':[0],
'4s Given':[0],'6s Given':[0],
'Bowl Dot%':[0]})
# Merge batting and bowling stats
merged_stats = pd.merge(stats,stats2,on='Match',how='outer').fillna(0)
merged_stats = merged_stats.sort_values(by=['Match'])
# Create lagged variables for future performance prediction
merged_stats.insert(loc=0, column='Player', value=i)
merged_stats['next_runs'] = merged_stats['Runs Scored'].shift(-1)
merged_stats['next_balls'] = merged_stats['Balls Played'].shift(-1)
merged_stats['next_overs'] = merged_stats['Overs Bowled'].shift(-1)
merged_stats['next_runs_given'] = merged_stats['Runs Given'].shift(-1)
merged_stats['next_wkts'] = merged_stats['Wickets Taken'].shift(-1)
final_data = final_data.append(merged_stats)
final_data = final_data[final_data['Match']!=0]
final_data['Bowl Avg'] = np.where(final_data['Bowl Avg']=='-',
0,final_data['Bowl Avg'])
final_data['Bowl SR'] = np.where(final_data['Bowl SR']=='-',
0,final_data['Bowl SR'])
final_data = final_data[['Player','Match', 'Runs Scored',
'Balls Played', 'Out', 'Bat SR',
'50', '100', '4s Scored',
'6s Scored','Bat Dot%',
'Overs Bowled','Runs Given',
'Wickets Taken', 'Econ',
'Bowl Avg', 'Bowl SR', '5W',
'4s Given', '6s Given',
'Bowl Dot%', 'next_runs',
'next_balls', 'next_overs',
'next_runs_given', 'next_wkts']]
final_data = final_data.replace('-',0)
final_data
When it comes to building the model, we first create an empty data frame called models. This DataFrame will be used to store the predictions for each player.
The above steps are repeated for each player in the players_list, resulting in a models DataFrame that contains the predictions and confidence intervals for all players.
models = pd.DataFrame()
# Iterate over the list of players
for player_name in players_list:
# Filter the data for the current player
player_data = final_data[final_data['Player'] == player_name]
# Remove rows with missing values
player_new = player_data.dropna()
# Predict next runs
X_runs = player_new[player_new.columns[2:11]]
y_runs = player_new[player_new.columns[21:22]]
X_train_runs, X_test_runs, y_train_runs, \
y_test_runs = train_test_split(X_runs, y_runs, \
random_state=123)
ridge_runs = pd.DataFrame()
# Iterate over a range of alpha values
for j in range(0, 101):
points_runs = linear_model.Ridge(alpha=j).fit(X_train_runs, \
y_train_runs)
ridge_df_runs = pd.DataFrame({'Alpha': pd.Series(j), \
'Train': pd.Series(points_runs.score(X_train_runs, \
y_train_runs)), 'Test': pd.Series(points_runs.score( \
X_test_runs, y_test_runs))})
ridge_runs = ridge_runs.append(ridge_df_runs)
# Calculate average score
ridge_runs['Average'] = ridge_runs[['Train', 'Test']].mean(axis=1)
try:
# Find the alpha value with the highest average score
k_runs = ridge_runs[ridge_runs['Average'] == \
ridge_runs['Average'].max()]['Alpha'][0]
k_runs = k_runs.head(1)[0]
except:
k_runs = ridge_runs[ridge_runs['Average'] == \
ridge_runs['Average'].max()]['Alpha'][0]
# Train the model with the best alpha value
next_runs = linear_model.Ridge(alpha=k_runs)
next_runs.fit(X_train_runs, y_train_runs)
sd_next_runs = stdev(X_train_runs['Runs Scored'].astype('float'))
# Predict next balls
X_balls = player_new[player_new.columns[2:11]]
y_balls = player_new[player_new.columns[22:23]]
X_train_balls, X_test_balls, y_train_balls, \
y_test_balls = train_test_split(X_balls, y_balls, \
random_state=123)
ridge_balls = pd.DataFrame()
# Iterate over a range of alpha values
for j in range(0, 101):
points_balls = linear_model.Ridge(alpha=j).fit(X_train_balls, \
y_train_balls)
ridge_df_balls = pd.DataFrame({'Alpha': pd.Series(j), \
'Train': pd.Series(points_balls.score(X_train_balls, \
y_train_balls)), 'Test': pd.Series(points_balls.score( \
X_test_balls, y_test_balls))})
ridge_balls = ridge_balls.append(ridge_df_balls)
# Calculate average score
ridge_balls['Average'] = ridge_balls[['Train', 'Test']].mean(axis=1)
try:
# Find the alpha value with the highest average score
k_balls = ridge_balls[ridge_balls['Average'] == \
ridge_balls['Average'].max()]['Alpha'][0]
k_balls = k_balls.head(1)[0]
except:
k_balls = ridge_balls[ridge_balls['Average'] == \
ridge_balls['Average'].max()]['Alpha'][0]
# Train the model with the best alpha value
next_balls = linear_model.Ridge(alpha=k_balls)
next_balls.fit(X_train_balls, y_train_balls)
sd_next_balls = stdev(X_train_balls['Balls Played'].astype('float'))
# Predict next overs
X_overs = player_new[player_new.columns[11:21]]
y_overs = player_new[player_new.columns[25:26]]
X_train_overs, X_test_overs, y_train_overs, \
y_test_overs = train_test_split(X_overs, y_overs, \
random_state=123)
ridge_overs = pd.DataFrame()
# Iterate over a range of alpha values
for j in range(0, 101):
points_overs = linear_model.Ridge(alpha=j).fit(X_train_overs, \
y_train_overs)
ridge_df_overs = pd.DataFrame({'Alpha': pd.Series(j), \
'Train': pd.Series(points_overs.score(X_train_overs, \
y_train_overs)), 'Test': pd.Series(points_overs.score( \
X_test_overs, y_test_overs))})
ridge_overs = ridge_overs.append(ridge_df_overs)
# Calculate average score
ridge_overs['Average'] = ridge_overs[['Train', 'Test']].mean(axis=1)
try:
# Find the alpha value with the highest average score
k_overs = ridge_overs[ridge_overs['Average'] == \
ridge_overs['Average'].max()]['Alpha'][0]
k_overs = k_overs.head(1)[0]
except:
k_overs = ridge_overs[ridge_overs['Average'] == \
ridge_overs['Average'].max()]['Alpha'][0]
# Train the model with the best alpha value
next_overs = linear_model.Ridge(alpha=k_overs)
next_overs.fit(X_train_overs, y_train_overs)
sd_next_overs = stdev(X_train_overs['Overs Bowled'].astype('float'))
# Predict next runs given
X_runs_given = player_new[player_new.columns[11:21]]
y_runs_given = player_new[player_new.columns[24:25]]
X_train_runs_given, X_test_runs_given, \
y_train_runs_given, y_test_runs_given = \
train_test_split(X_runs_given, y_runs_given, random_state=123)
ridge_runs_given = pd.DataFrame()
# Iterate over a range of alpha values
for j in range(0, 101):
points_runs_given = linear_model.Ridge(alpha=j).fit( \
X_train_runs_given, y_train_runs_given)
ridge_df_runs_given = pd.DataFrame({'Alpha': pd.Series(j), \
'Train': pd.Series(points_runs_given.score( \
X_train_runs_given, y_train_runs_given)), 'Test': \
pd.Series(points_runs_given.score(X_test_runs_given, \
y_test_runs_given))})
ridge_runs_given = ridge_runs_given.append(ridge_df_runs_given)
# Calculate average score
ridge_runs_given['Average'] = \
ridge_runs_given[['Train', 'Test']].mean(axis=1)
try:
# Find the alpha value with the highest average score
k_runs_given = ridge_runs_given[ridge_runs_given['Average'] == \
ridge_runs_given['Average'].max()]['Alpha'][0]
k_runs_given = k_runs_given.head(1)[0]
except:
k_runs_given = ridge_runs_given[ridge_runs_given['Average'] == \
ridge_runs_given['Average'].max()]['Alpha'][0]
# Train the model with the best alpha value
next_runs_given = linear_model.Ridge(alpha=k_runs_given)
next_runs_given.fit(X_train_runs_given, y_train_runs_given)
sd_next_runs_given = \
stdev(X_train_runs_given['Runs Given'].astype('float'))
# Get the latest data for the player
latest = player.groupby('Player').tail(1)
# Predict next runs, balls, overs, runs given, and wickets
latest['next_runs'] = next_runs.predict( \
latest[latest.columns[2:11]])
latest['next_balls'] = next_balls.predict( \
latest[latest.columns[2:11]])
latest['next_overs'] = next_overs.predict( \
latest[latest.columns[11:21]])
latest['next_runs_given'] = next_runs_given.predict( \
latest[latest.columns[11:21]])
latest['next_wkts'] = next_wkts.predict( \
latest[latest.columns[11:21]])
# Calculate confidence intervals for each prediction
latest['next_runs_ll_95'], latest['next_runs_ul_95'] = \
latest['next_runs'] - scipy.stats.norm.ppf(.95) * ( \
sd_next_runs / math.sqrt(len(X_train_runs))), \
latest['next_runs'] + scipy.stats.norm.ppf(.95) * ( \
sd_next_runs / math.sqrt(len(X_train_runs)))
latest['next_balls_ll_95'], latest['next_balls_ul_95'] = \
latest['next_balls'] - scipy.stats.norm.ppf(.95) * ( \
sd_next_balls / math.sqrt(len(X_train_balls))), \
latest['next_balls'] + scipy.stats.norm.ppf(.95) * ( \
sd_next_balls / math.sqrt(len(X_train_balls)))
latest['next_overs_ll_95'], latest['next_overs_ul_95'] = \
latest['next_overs'] - scipy.stats.norm.ppf(.95) * ( \
sd_next_overs / math.sqrt(len(X_train_overs))), \
latest['next_overs'] + scipy.stats.norm.ppf(.95) * ( \
sd_next_overs / math.sqrt(len(X_train_overs)))
latest['next_runs_given_ll_95'], latest['next_runs_given_ul_95'] \
= latest['next_runs_given'] - scipy.stats.norm.ppf(.95) * ( \
sd_next_runs_given / math.sqrt(len(X_train_runs_given))), \
latest['next_runs_given'] + scipy.stats.norm.ppf(.95) * ( \
sd_next_runs_given / math.sqrt(len(X_train_runs_given)))
latest['next_wkts_ll_95'], latest['next_wkts_ul_95'] = \
latest['next_wkts'] - scipy.stats.norm.ppf(.95) * ( \
sd_next_wkts / math.sqrt(len(X_train_wkts))), \
latest['next_wkts'] + scipy.stats.norm.ppf(.95) * ( \
sd_next_wkts / math.sqrt(len(X_train_wkts)))
# Append the latest predictions to the models dataframe
models = models.append(latest)
In this section of the code, we perform some adjustments and rounding operations on the values obtained from the models. These adjustments are implemented w.r.t the specific rules of the game, and their objective is to guarantee that the figures remain within acceptable boundaries in accordance with the nature of T20 cricket.
For a better understanding of the matter, let us scrutinize each stage:
These post-processing steps help in refining the predicted values obtained from the models by aligning them with the constraints and rules of T20 cricket. By making adjustments and rounding the values, we ensure that they are within meaningful ranges and suitable for practical interpretation in the context of the game.
# Adjusting values based on conditions and rounding
# Adjusting next_runs_given based on next_overs
models['next_runs_given'] = np.where(
models['next_overs'] > 4,
models['next_runs_given'] / models['next_overs'] * 4,
models['next_runs_given']
)
models['next_runs_given_ll_95'] = np.where(
models['next_overs'] > 4,
models['next_runs_given_ll_95'] / models['next_overs'] * 4,
models['next_runs_given_ll_95']
)
models['next_runs_given_ul_95'] = np.where(
models['next_overs'] > 4,
models['next_runs_given_ul_95'] / models['next_overs'] * 4,
models['next_runs_given_ul_95']
)
# Limiting next_overs to a maximum of 4
models['next_overs'] = np.where(
models['next_overs'] > 4,
4,
models['next_overs']
)
models['next_overs_ll_95'] = np.where(
models['next_overs_ll_95'] > 4,
4,
models['next_overs_ll_95']
)
models['next_overs_ul_95'] = np.where(
models['next_overs_ul_95'] > 4,
4,
models['next_overs_ul_95']
)
# Adjusting next_runs based on next_balls
models['next_runs'] = np.where(
models['next_balls'] < 0,
0,
models['next_runs']
)
models['next_runs_ll_95'] = np.where(
models['next_balls'] < 0,
0,
models['next_runs_ll_95']
)
models['next_runs_ul_95'] = np.where(
models['next_balls'] < 0,
0,
models['next_runs_ul_95']
)
# Setting next_runs to a minimum of 1
models['next_runs'] = np.where(
models['next_runs'] < 0,
1,
models['next_runs']
)
models['next_runs_ll_95'] = np.where(
models['next_runs_ll_95'] < 0,
1,
models['next_runs_ll_95']
)
models['next_runs_ul_95'] = np.where(
models['next_runs_ul_95'] < 0,
1,
models['next_runs_ul_95']
)
# Adjusting next_runs based on next_balls if next_balls > 100
models['next_runs'] = np.where(
models['next_balls'] > 100,
models['next_runs'] / models['next_balls'] * 5,
models['next_runs']
)
models['next_runs_ll_95'] = np.where(
models['next_balls'] > 100,
models['next_runs_ll_95'] / models['next_balls'] * 5,
models['next_runs_ll_95']
)
models['next_runs_ul_95'] = np.where(
models['next_balls'] > 100,
models['next_runs_ul_95'] / models['next_balls'] * 5,
models['next_runs_ul_95']
)
# Limiting next_balls to a maximum of 5
models['next_balls'] = np.where(
models['next_balls'] > 100,
5,
models['next_balls']
)
models['next_balls_ll_95'] = np.where(
models['next_balls_ll_95'] > 100,
5,
models['next_balls_ll_95']
)
models['next_balls_ul_95'] = np.where(
models['next_balls_ul_95'] > 100,
5,
models['next_balls_ul_95']
)
# Setting next_balls to a minimum of 1
models['next_balls'] = np.where(
models['next_balls'] < 0,
1,
models['next_balls']
)
models['next_balls_ll_95'] = np.where(
models['next_balls_ll_95'] < 0,
1,
models['next_balls_ll_95']
)
models['next_balls_ul_95'] = np.where(
models['next_balls_ul_95'] < 0,
1,
models['next_balls_ul_95']
)
# Setting next_wkts to a minimum of 1
models['next_wkts'] = np.where(
models['next_wkts'] < 0,
1,
models['next_wkts']
)
models['next_wkts_ll_95'] = np.where(
models['next_wkts_ll_95'] < 0,
1,
models['next_wkts_ll_95']
)
models['next_wkts_ul_95'] = np.where(
models['next_wkts_ul_95'] < 0,
1,
models['next_wkts_ul_95']
)
# Rounding values to 0 decimal places
models['next_runs'] = round(models['next_runs'], 0)
models['next_runs_ll_95'] = round(models['next_runs_ll_95'], 0)
models['next_runs_ul_95'] = round(models['next_runs_ul_95'], 0)
models['next_balls'] = round(models['next_balls'], 0)
models['next_balls_ll_95'] = round(models['next_balls_ll_95'], 0)
models['next_balls_ul_95'] = round(models['next_balls_ul_95'], 0)
models['next_wkts'] = round(models['next_wkts'], 0)
models['next_wkts_ll_95'] = round(models['next_wkts_ll_95'], 0)
models['next_wkts_ul_95'] = round(models['next_wkts_ul_95'], 0)
models['next_runs_given'] = round(models['next_runs_given'], 0)
models['next_runs_given_ll_95'] = round(models['next_runs_given_ll_95'], 0)
models['next_runs_given_ul_95'] = round(models['next_runs_given_ul_95'], 0)
models['next_overs'] = round(models['next_overs'], 0)
models['next_overs_ll_95'] = round(models['next_overs_ll_95'], 0)
models['next_overs_ul_95'] = round(models['next_overs_ul_95'], 0)
The outcome of the dataframe ‘models’ with the predicted values would be as follows:
While the predictive model described in this article provides valuable insights into Twenty20 cricket, its limitations must be acknowledged. The essence of the
model and the underlying data used for training and prediction result in these limitations. Understanding these limitations is essential to ensure that the model’s predictions are correctly interpreted and applied.
1. Dependence on Historical Data: The efficacy of the model’s training and prediction mechanisms heavily depends on historical data. The precision of this information’s quality, quantity, and relevance are crucial to its accuracy and dependability in the application. Changes in team composition, player form, pitch conditions, or match dynamics during various time intervals can significantly impact the model’s ability to predict outcomes accurately. Consequently, it is essential to routinely update the model with the most recent data in order to maintain its applicability.
2. T20 cricket is played in a variety of environments, including stadiums, pitches, weather conditions, and tournaments. It is possible that the model does not reflect the nuances of every specific condition, resulting in variations in predictions. Factors such as humidity, pitch deterioration, and ground dimensions can have a significant impact on match outcomes, but they may not be accounted for adequately in the model. In addition to the model’s predictions, it is essential to consider contextual factors and expert opinion.
In this article, we explored developing and applying a predictive model for T20 cricket. By leveraging historical match data and using advanced machine learning techniques, we demonstrated the potential of such a model to predict player performance and provide valuable insights into the game. As we conclude, let’s summarize the key learnings from this endeavour:
A. The three types of predictive models are classification models, regression models, and clustering models. Classification predicts categorical outcomes, regression predicts numerical values, and clustering identifies patterns or groups in data.
A.The two main predictive models are machine learning models and statistical models. Machine learning models use algorithms to learn patterns from data, while statistical models are based on mathematical equations and assumptions.
A. Predictive modeling is used to make predictions or forecasts about future events or outcomes based on historical data and patterns. It is applied in various fields such as finance, healthcare, marketing, weather forecasting, and risk analysis.
A. There are several types of predictive modeling techniques, including decision trees, random forests, neural networks, support vector machines, logistic regression, time series analysis, and ensemble methods. The choice of technique depends on the specific problem, data characteristics, and desired outcomes.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.