End-to-End Case Study: Bike Sharing Demand Prediction

kajal Last Updated : 13 Jun, 2023
11 min read

Introduction

Bike-sharing demand analysis refers to the study of factors that impact the usage of bike-sharing services and the demand for bikes at different times and locations. The purpose of this analysis is to understand the patterns and trends in bike usage and make predictions about future demand. This post will examine how statistical machine-learning methods can analyze the given data.

This article will use a small subset of this dataset and just focus on the functionality. Please note that the chances of inaccuracy are high for such a small subset of the dataset. Feel free to use the complete dataset for your analysis.

Learning Objectives:

  • Accurately predict the number of bike rentals for a given time period and location based on historical data and other relevant factors.
  • Identify and analyze the key factors that influence bike rental demand, like weather conditions, holidays, and events.
  • Develop and evaluate predictive models that can effectively forecast bike rental demand using techniques like regression analysis, time series analysis, and machine learning algorithms.
  • Use the forecasting results to optimize bike inventory and resources, ensuring that bike-sharing companies can meet customer demand and maximize revenue.
  • Continuously monitor and evaluate the forecasting accuracy, refine the models, and improve accuracy and reliability.

Dataset on Kaggle: https://www.kaggle.com/c/bike-sharing-demand

This article was published as a part of the Data Science Blogathon.

What is Bike Sharing Demand Forecasting?

Bike-sharing demand forecasting aims to provide bike-sharing companies with the insights and tools they need to make data-driven decisions and effectively manage their operations.

Factors often considered during bike sharing demand analysis include weather conditions, seasonality, day of the week, holiday periods, and events. Demographic information about users, like age, gender, and income. It can be used to understand usage patterns.

Methods used in bike-sharing demand analysis include statistical models like time-series analysis, regression analysis, and machine learning algorithms. Bike-sharing companies can use the analysis results to optimize their operations, distribution, pricing strategies, and marketing campaigns. Additionally, the findings can inform city planners in developing bike-friendly infrastructure and policies.

Why Bike Sharing System?

Bike-sharing systems have become increasingly popular in recent years due to their many benefits, which include:

  1. Affordable and Sustainable Transportation: Bike-sharing systems provide an affordable and sustainable mode of transportation, especially for short trips. They are a low-cost alternative to owning a personal bike and can help reduce reliance on private cars and sharing cars, which can positively impact the environment.
  2. Health and comfort: Bike-sharing systems promote physical action and exercise, positively impacting health and comfort. Regular cycling can help reduce the risk of heart disease, stroke, and other chronic diseases.
  3. Convenience: Bike-sharing systems are often located in densely populated city areas, making them a convenient mode of transportation for short trips. They can be easily accessed, making them a flexible and convenient option for commuters and tourists alike.
  4. Reduced Traffic Congestion: Bike-sharing systems can help reduce traffic congestion by providing an alternative mode of transportation for short trips. This can have a positive impact on town mobility.

In summary, bike-sharing systems provide multiple benefits, including affordable and sustainable transportation, health and comfort, convenience, reduced traffic congestion, and tourism and economic development. These benefits have contributed to the popularity of bike-sharing systems in many cities around the world.

Problem Statement

The problem statement for bike-sharing demand is predicting the number of bikes that will be rented from a bike-sharing system at a given time based on factors such as weather, day of the week, and time of day. The purpose is to build a predictive model that can accurately forecast bike rental demand to optimize bike allocation and improve the bike-sharing system’s overall efficiency.

The problem statement may involve answering specific questions such as:

  • What is the expected demand for bikes during peak hours, weekdays, or weekends?
  • How does weather (e.g., wind, temperature, precipitation) affect bike rental demand?
  • Are any specific locations or routes with higher or lower demand for bikes?
  • How can we optimize the bike-sharing system to meet fluctuating demand and minimize operational costs?
  • Can the bike-sharing system expand or improve to better serve users’ needs and promote sustainable transportation?

The problem statement for bike-sharing demand analysis typically involves predicting bike rental demand and optimizing bike allocation to improve the bike-sharing system’s efficiency and sustainability.

The company management wants:

  • To create a model of the demand for shared bikes with the available independent variables.
  • To understand the demand dynamics of the market using the model.

Reading and Understanding the Data

To build a bike-sharing demand forecasting model, it’s important to start by reading and understanding the data. The key steps involved in this process are loading, exploring, cleaning, preprocessing, and visualizing the data. By following these steps, analysts can gain a deeper understanding of the data and identify any issues that need addressing before building the bike-sharing demand forecasting model. This helps ensure the model is accurate and reliable, which is essential for optimizing bike-sharing operations.

import pandas as pd

bikeshare_df = pd.read_csv("day.csv")

print(bikeshare_df.head())
loading the dataset | bike-sharing demand analysis & prediction
bike_sharing.info()
dataset to test the analysis and prediction of bike rental services
bike_sharing.describe()
bike-sharing demand prediction dataset

Visualizing the Data

Visualizing the data is an important step in the bike-sharing demand forecasting process. It can help identify patterns and trends that may not be immediately apparent from raw data.

import matplotlib.pyplot as plt
import seaborn as sns


#Plotting pairplot of all the numeric variables

sns.pairplot(bike_sharing[["temp","atemp","hum","windspeed","casual","registered","cnt"]])
plt.show()

plotting pairplot of all the numeric variables
#Plotting box plot of continuous variables

plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
plt.boxplot(bike_sharing["temp"])
plt.subplot(2,3,2)
plt.boxplot(bike_sharing["atemp"])
plt.subplot(2,3,3)
plt.boxplot(bike_sharing["hum"])
plt.subplot(2,3,4)
plt.boxplot(bike_sharing["windspeed"])
plt.subplot(2,3,5)
plt.boxplot(bike_sharing["casual"])
plt.subplot(2,3,6)
plt.boxplot(bike_sharing["registered"])
plt.show()
plotting box plots of continuous variables

Visualising Categorical Variables

#Plotting box plot of categorical variables

plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,2)
sns.boxplot(x = 'yr', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,3)
sns.boxplot(x = 'mnth', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,4)
sns.boxplot(x = 'holiday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,5)
sns.boxplot(x = 'weekday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,6)
sns.boxplot(x = 'workingday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,7)
sns.boxplot(x = 'weathersit', y = 'cnt', data = bike_sharing)
plt.show()
Plotting box plot of categorical variables

Data Preparation

Data preparation is a crucial step in bike-sharing demand forecasting, as it involves cleaning, transforming, and organizing the data to make it suitable for analysis. By preparing the data in this way, analysts can ensure that the data is suitable for analysis and that any biases or errors in the data are addressed. This can lead to more accurate and reliable forecasting models that can help bike-sharing companies optimize their operations and better meet customer demand.

Dropping unnecessary columns instant, dteday, casual & registered

  • instant – It is just a sequence number of rows
  • dteday – It is not required since columns for year & month already exists
  • casual – This variable cannot be predicted.
  • registered – This variable cannot be predicted.
bike_sharing.drop(columns=["instant","dteday","casual","registered"],axis=1,inplace =True)
bike_sharing.head()
data preparation | bike-sharing demand analysis & prediction

Dummy Variables

season_type = pd.get_dummies(bike_sharing['season'], drop_first = True)
season_type.rename(columns={2:"season_summer", 3:"season_fall", 4:"season_winter"},inplace=True)
season_type.head()
dummy variables | bike-sharing demand analysis & prediction
weather_type = pd.get_dummies(bike_sharing['weathersit'], drop_first = True)
weather_type.rename(columns={2:"weather_mist_cloud", 3:"weather_light_snow_rain"},inplace=True)
weather_type.head()
weather data | bike-sharing demand analysis & prediction
#Concatenating new dummy variables to the main dataframe

bike_sharing = pd.concat([bike_sharing, season_type, weather_type], axis = 1)

#Dropping columns season & weathersit since we have already created dummies for them

bike_sharing.drop(columns=["season", "weathersit"],axis=1,inplace =True)

#Analysing dataframe after dropping columns

bike_sharing.info()
"

Creating derived variables for the categorical variable month

#Creating year_quarter derived columns from month columns.
#Note that last quarter has not been created since we need only 3 columns to define the four quarters.

bike_sharing["Quarter_JanFebMar"] = bike_sharing["mnth"].apply(lambda x: 1 if x<=3 else 0)
bike_sharing["Quarter_AprMayJun"] = bike_sharing["mnth"].apply(lambda x: 1 if 4<=x<=6 else 0)
bike_sharing["Quarter_JulAugSep"] = bike_sharing["mnth"].apply(lambda x: 1 if 7<=x<=9 else 0)

#Dropping column mnth since we have already created dummies.

bike_sharing.drop(columns=["mnth"],axis=1,inplace =True)
bike_sharing["weekend"] = bike_sharing["weekday"].apply(lambda x: 0 if 1<=x<=5 else 1)
bike_sharing.drop(columns=["weekday"],axis=1,inplace =True)
bike_sharing.drop(columns=["workingday"],axis=1,inplace =True)
bike_sharing.head()

monthly weather variables | bike-sharing demand analysis & prediction
#Analysing dataframe after dropping columns weekday & workingday

bike_sharing.info()
analyzing data frame | bike-sharing demand analysis & prediction
#Plotting correlation heatmap to analyze the linearity between the variables in the dataframe 

plt.figure(figsize = (16, 10))
sns.heatmap(bike_sharing.corr(), annot = True, cmap="Greens")
plt.show()
heatmap plotting | bike-sharing demand analysis & prediction
#Dropping column temp since it is very highly collinear with the column atemp.
#Further,the column atemp is more appropriate for modelling compared to column temp from human perspective.

bike_sharing.drop(columns=["temp"],axis=1,inplace =True)
bike_sharing.head()
seasonal data | bike-sharing demand analysis & prediction

Splitting the Data into Training and Testing Sets

Splitting the data into training and testing sets is a critical step in bike-sharing demand forecasting. It enables analysts to evaluate the performance of their forecasting models on unseen data. The general approach is to use historical data to train the model and then test the model’s performance on a separate, holdout set of data.

#Importing library

from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
bike_sharing_train, bike_sharing_test = train_test_split(bike_sharing, train_size = 0.7, test_size = 0.3, random_state = 100)

Rescaling the training dataframe using the MinMax scaling function after the split to achieve optimum beta coefficients for all features.

#importing library
from sklearn.preprocessing import MinMaxScaler

#assigning variable to scaler
scaler = MinMaxScaler()
# Applying scaler to all the columns except the derived and 'dummy' variables that are already in 0 & 1.

numeric_var = ['atemp','hum','windspeed','cnt']
bike_sharing_train[numeric_var] = scaler.fit_transform(bike_sharing_train[numeric_var])

# Analysing the train dataframe after scaling
bike_sharing_train.head()
sample data frame | bike-sharing demand analysis & prediction

By splitting the data into training and testing sets, analysts can evaluate the performance of their forecasting models on unseen data and ensure that the models are robust and reliable. This can help bike-sharing companies optimize their operations and better meet customer demand.

y_train = bike_sharing_train.pop('cnt')
X_train = bike_sharing_train

print (y_train.head())
print (X_train.head())
training dataset | bike-sharing demand analysis & prediction

Building the Linear Model

Building a linear model for bike-sharing demand forecasting involves creating a model that uses linear regression to predict bike rental demand based on a set of input variables. The linear regression model is trained using the training set, with the input variables used to predict the target variable (bike rental demand). The model is optimized to minimize the error between the predicted and actual demands in the training set.

Using the LinearRegression function from SciKit Learn and Recursive Feature Elimination (RFE):


# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
# Running RFE with the output number of the variable equal to 12
lm = LinearRegression()
lm.fit(X_train, y_train)

rfe = RFE(lm, 12)             # running RFE
rfe = rfe.fit(X_train, y_train)

list(zip(X_train.columns,rfe.support_,rfe.ranking_))
list of columns for model building

By building a linear model for bike-sharing demand forecasting, analysts can develop a simple yet effective forecasting system to optimize bike-sharing operations and improve customer satisfaction. However, it’s important to note that linear models may have limitations in capturing more complex patterns and relationships in the data, so other modeling techniques (such as decision trees or neural networks) can be more accurate predictions.

# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[columns_rfe]
X_train_rfe
dataset for building the model | bike-sharing demand analysis & prediction

Residual Analysis of the Training Data

Residual analysis is an essential step in evaluating the performance of a linear model for bike-sharing demand forecasting. Residuals are the difference between the predicted demand and the actual demand, and analyzing these residuals can help identify any patterns or biases in the model’s predictions.

#using the final model lr5 on train data to predict y_train_cnt values
y_train_cnt = lr5.predict(X_train_lr5)
# Plotting the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                   
plt.xlabel('Errors', fontsize = 18)
error terms
plt.scatter(y_train,(y_train - y_train_cnt))
plt.show()
residual analysis of training data

Making Predictions Using the Final Model lr5

To make predictions using the final linear model for bike-sharing demand forecasting (lr5), you will need to provide values for the input variables and use the model to generate a prediction for the target variable (bike rental demand).

#Applying the scaling on the test sets

numeric_vars = ['atemp','hum','windspeed','cnt']
bike_sharing_test[numeric_vars] = scaler.transform(bike_sharing_test[numeric_vars])

bike_sharing_test.describe()
bike-sharing test data | bike-sharing demand analysis & prediction

Dividing into X_test and y_test

y_test = bike_sharing_test.pop('cnt')
X_test = bike_sharing_test

# Adding constant variable to test dataframe
X_test_lr5 = sm.add_constant(X_test)

# Updating X_test_lr5 dataframe by dropping the variables as analyzed from the above models

X_test_lr5 =X_test_lr5.drop(["atemp", "hum", "season_fall", "Quarter_AprMayJun", "weekend","Quarter_JanFebMar"], axis = 1)

# Making predictions using the fifth model

y_pred = lr5.predict(X_test_lr5)

Model Evaluation

Model evaluation is a critical step in assessing the performance of a bike-sharing demand forecasting model. Use various metrics to evaluate the performance of a model, including mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R-squared).

# Plotting y_test and y_pred to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_pred)
fig.suptitle('y_test vs y_pred', fontsize = 20)               
plt.xlabel('y_test', fontsize = 18)                          

plt.ylabel('y_pred', fontsize = 16)
bike-sharing demand analysis & prediction model evaluation

You should evaluate the model’s performance using metrics such as MAE, RMSE, and R-squared. MAE and RMSE measure the average magnitude of the errors between the predicted and actual values. R-squared measures the proportion of variance in the target variable, explained by the input variables.

#importing library and checking mean squared error
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print('Mean_Squared_Error :' ,mse)

#importing library and checking R2

from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
"
"

Conclusion

This study aimed to improve the bike-sharing activities of Capital Bikeshare and support the reinvention of the city transportation system. This comprehensive exploratory data analysis on their publicly available data helped us understand and analyze the underlying patterns and characteristics of the bike-share network and to work on this data to achieve data-driven results.

We performed an analysis on the growth in popularity of bike-share over the two years, 2011–2012, and the effect of the seasonal and day factors on the ridership patterns. The impacts of seasonal and weather parameters were to understand the ridership pattern in Washington, DC. Analysis of the trip data helped to understand the characteristics of the locality where the stations are located.

Keeping these inferences in mind, we could suggest the following recommendations:

  1. Most of the rentals are for commuting to workshops and colleges on a daily basis. So CaBi should launch more stations near these landmarks to reach out to their main customers.
  2. Planning for more sharing bikes to stations must consider the peak rental hours, i.e., 7–9 am and 5–6 pm.
  3. The offer should not be a fixed price. Instead, it should be based on seasonal variations to promote bike usage during the fall and winter seasons.
  4. Data about the most used routes can help build roads/lanes dedicated to bikes specifically.
  5. Due to the low usage of bikes at night, it would be better to do bike maintenance at night. Removing some bikes from the streets at night time will not cause trouble for the customers.
  6. Converting registered customers to casual customers on the weekends by providing them with discounts and coupons.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Frequently Asked Questions

Q1. What is bike sharing demand prediction?

A. Bike sharing demand prediction refers to the process of forecasting the number of bicycles that will be rented within a specific time period, aiding in resource allocation and system optimization.

Q2. What is the trend of bike share?

A. The trend of bike share is experiencing steady growth worldwide, with an increasing number of cities implementing bike sharing programs to promote sustainable transportation and reduce traffic congestion.

Q3. Is bike share profitable?

A. The profitability of bike share systems can vary depending on factors such as user demand, operational costs, pricing strategies, and partnerships with local businesses. Careful planning and efficient management are crucial for long-term profitability.

Q4. Why is bike sharing popular?

A. Bike sharing is popular for several reasons. It offers a convenient and flexible mode of transportation, promotes physical activity and health, reduces carbon emissions, alleviates parking congestion, and provides an affordable alternative for short-distance travel in urban areas.

Hi, I am Kajal Kumari. have completed my Master’s from IIT(ISM) Dhanbad in Computer Science & Engineering. As of now, I am working as Machine Learning Engineer in Hyderabad.
hope that you have enjoyed the article. If you like it, share it with your friends also. Please feel free to comment if you have any thoughts that can improve my article writing.

If you want to read my previous blogs, you can read Previous Data Science Blog posts here. Connect with me

Responses From Readers

Congratulations, You Did It!
Well Done on Completing Your Learning Journey. Stay curious and keep exploring!

We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.

Show details