Bike-sharing demand analysis refers to the study of factors that impact the usage of bike-sharing services and the demand for bikes at different times and locations. The purpose of this analysis is to understand the patterns and trends in bike usage and make predictions about future demand. This post will examine how statistical machine-learning methods can analyze the given data.
This article will use a small subset of this dataset and just focus on the functionality. Please note that the chances of inaccuracy are high for such a small subset of the dataset. Feel free to use the complete dataset for your analysis.
Learning Objectives:
Dataset on Kaggle: https://www.kaggle.com/c/bike-sharing-demand
This article was published as a part of the Data Science Blogathon.
Bike-sharing demand forecasting aims to provide bike-sharing companies with the insights and tools they need to make data-driven decisions and effectively manage their operations.
Factors often considered during bike sharing demand analysis include weather conditions, seasonality, day of the week, holiday periods, and events. Demographic information about users, like age, gender, and income. It can be used to understand usage patterns.
Methods used in bike-sharing demand analysis include statistical models like time-series analysis, regression analysis, and machine learning algorithms. Bike-sharing companies can use the analysis results to optimize their operations, distribution, pricing strategies, and marketing campaigns. Additionally, the findings can inform city planners in developing bike-friendly infrastructure and policies.
Bike-sharing systems have become increasingly popular in recent years due to their many benefits, which include:
In summary, bike-sharing systems provide multiple benefits, including affordable and sustainable transportation, health and comfort, convenience, reduced traffic congestion, and tourism and economic development. These benefits have contributed to the popularity of bike-sharing systems in many cities around the world.
The problem statement for bike-sharing demand is predicting the number of bikes that will be rented from a bike-sharing system at a given time based on factors such as weather, day of the week, and time of day. The purpose is to build a predictive model that can accurately forecast bike rental demand to optimize bike allocation and improve the bike-sharing system’s overall efficiency.
The problem statement may involve answering specific questions such as:
The problem statement for bike-sharing demand analysis typically involves predicting bike rental demand and optimizing bike allocation to improve the bike-sharing system’s efficiency and sustainability.
The company management wants:
To build a bike-sharing demand forecasting model, it’s important to start by reading and understanding the data. The key steps involved in this process are loading, exploring, cleaning, preprocessing, and visualizing the data. By following these steps, analysts can gain a deeper understanding of the data and identify any issues that need addressing before building the bike-sharing demand forecasting model. This helps ensure the model is accurate and reliable, which is essential for optimizing bike-sharing operations.
import pandas as pd
bikeshare_df = pd.read_csv("day.csv")
print(bikeshare_df.head())
bike_sharing.info()
bike_sharing.describe()
Visualizing the data is an important step in the bike-sharing demand forecasting process. It can help identify patterns and trends that may not be immediately apparent from raw data.
import matplotlib.pyplot as plt
import seaborn as sns
#Plotting pairplot of all the numeric variables
sns.pairplot(bike_sharing[["temp","atemp","hum","windspeed","casual","registered","cnt"]])
plt.show()
#Plotting box plot of continuous variables
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
plt.boxplot(bike_sharing["temp"])
plt.subplot(2,3,2)
plt.boxplot(bike_sharing["atemp"])
plt.subplot(2,3,3)
plt.boxplot(bike_sharing["hum"])
plt.subplot(2,3,4)
plt.boxplot(bike_sharing["windspeed"])
plt.subplot(2,3,5)
plt.boxplot(bike_sharing["casual"])
plt.subplot(2,3,6)
plt.boxplot(bike_sharing["registered"])
plt.show()
#Plotting box plot of categorical variables
plt.figure(figsize=(20, 12))
plt.subplot(3,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,2)
sns.boxplot(x = 'yr', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,3)
sns.boxplot(x = 'mnth', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,4)
sns.boxplot(x = 'holiday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,5)
sns.boxplot(x = 'weekday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,6)
sns.boxplot(x = 'workingday', y = 'cnt', data = bike_sharing)
plt.subplot(3,3,7)
sns.boxplot(x = 'weathersit', y = 'cnt', data = bike_sharing)
plt.show()
Data preparation is a crucial step in bike-sharing demand forecasting, as it involves cleaning, transforming, and organizing the data to make it suitable for analysis. By preparing the data in this way, analysts can ensure that the data is suitable for analysis and that any biases or errors in the data are addressed. This can lead to more accurate and reliable forecasting models that can help bike-sharing companies optimize their operations and better meet customer demand.
Dropping unnecessary columns instant, dteday, casual & registered
bike_sharing.drop(columns=["instant","dteday","casual","registered"],axis=1,inplace =True)
bike_sharing.head()
Dummy Variables
season_type = pd.get_dummies(bike_sharing['season'], drop_first = True)
season_type.rename(columns={2:"season_summer", 3:"season_fall", 4:"season_winter"},inplace=True)
season_type.head()
weather_type = pd.get_dummies(bike_sharing['weathersit'], drop_first = True)
weather_type.rename(columns={2:"weather_mist_cloud", 3:"weather_light_snow_rain"},inplace=True)
weather_type.head()
#Concatenating new dummy variables to the main dataframe
bike_sharing = pd.concat([bike_sharing, season_type, weather_type], axis = 1)
#Dropping columns season & weathersit since we have already created dummies for them
bike_sharing.drop(columns=["season", "weathersit"],axis=1,inplace =True)
#Analysing dataframe after dropping columns
bike_sharing.info()
Creating derived variables for the categorical variable month
#Creating year_quarter derived columns from month columns.
#Note that last quarter has not been created since we need only 3 columns to define the four quarters.
bike_sharing["Quarter_JanFebMar"] = bike_sharing["mnth"].apply(lambda x: 1 if x<=3 else 0)
bike_sharing["Quarter_AprMayJun"] = bike_sharing["mnth"].apply(lambda x: 1 if 4<=x<=6 else 0)
bike_sharing["Quarter_JulAugSep"] = bike_sharing["mnth"].apply(lambda x: 1 if 7<=x<=9 else 0)
#Dropping column mnth since we have already created dummies.
bike_sharing.drop(columns=["mnth"],axis=1,inplace =True)
bike_sharing["weekend"] = bike_sharing["weekday"].apply(lambda x: 0 if 1<=x<=5 else 1)
bike_sharing.drop(columns=["weekday"],axis=1,inplace =True)
bike_sharing.drop(columns=["workingday"],axis=1,inplace =True)
bike_sharing.head()
#Analysing dataframe after dropping columns weekday & workingday
bike_sharing.info()
#Plotting correlation heatmap to analyze the linearity between the variables in the dataframe
plt.figure(figsize = (16, 10))
sns.heatmap(bike_sharing.corr(), annot = True, cmap="Greens")
plt.show()
#Dropping column temp since it is very highly collinear with the column atemp.
#Further,the column atemp is more appropriate for modelling compared to column temp from human perspective.
bike_sharing.drop(columns=["temp"],axis=1,inplace =True)
bike_sharing.head()
Splitting the data into training and testing sets is a critical step in bike-sharing demand forecasting. It enables analysts to evaluate the performance of their forecasting models on unseen data. The general approach is to use historical data to train the model and then test the model’s performance on a separate, holdout set of data.
#Importing library
from sklearn.model_selection import train_test_split
# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
bike_sharing_train, bike_sharing_test = train_test_split(bike_sharing, train_size = 0.7, test_size = 0.3, random_state = 100)
Rescaling the training dataframe using the MinMax scaling function after the split to achieve optimum beta coefficients for all features.
#importing library
from sklearn.preprocessing import MinMaxScaler
#assigning variable to scaler
scaler = MinMaxScaler()
# Applying scaler to all the columns except the derived and 'dummy' variables that are already in 0 & 1.
numeric_var = ['atemp','hum','windspeed','cnt']
bike_sharing_train[numeric_var] = scaler.fit_transform(bike_sharing_train[numeric_var])
# Analysing the train dataframe after scaling
bike_sharing_train.head()
By splitting the data into training and testing sets, analysts can evaluate the performance of their forecasting models on unseen data and ensure that the models are robust and reliable. This can help bike-sharing companies optimize their operations and better meet customer demand.
y_train = bike_sharing_train.pop('cnt')
X_train = bike_sharing_train
print (y_train.head())
print (X_train.head())
Building a linear model for bike-sharing demand forecasting involves creating a model that uses linear regression to predict bike rental demand based on a set of input variables. The linear regression model is trained using the training set, with the input variables used to predict the target variable (bike rental demand). The model is optimized to minimize the error between the predicted and actual demands in the training set.
Using the LinearRegression function from SciKit Learn and Recursive Feature Elimination (RFE):
# Importing RFE and LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
# Running RFE with the output number of the variable equal to 12
lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm, 12) # running RFE
rfe = rfe.fit(X_train, y_train)
list(zip(X_train.columns,rfe.support_,rfe.ranking_))
By building a linear model for bike-sharing demand forecasting, analysts can develop a simple yet effective forecasting system to optimize bike-sharing operations and improve customer satisfaction. However, it’s important to note that linear models may have limitations in capturing more complex patterns and relationships in the data, so other modeling techniques (such as decision trees or neural networks) can be more accurate predictions.
# Creating X_test dataframe with RFE selected variables
X_train_rfe = X_train[columns_rfe]
X_train_rfe
Residual analysis is an essential step in evaluating the performance of a linear model for bike-sharing demand forecasting. Residuals are the difference between the predicted demand and the actual demand, and analyzing these residuals can help identify any patterns or biases in the model’s predictions.
#using the final model lr5 on train data to predict y_train_cnt values
y_train_cnt = lr5.predict(X_train_lr5)
# Plotting the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)
plt.xlabel('Errors', fontsize = 18)
plt.scatter(y_train,(y_train - y_train_cnt))
plt.show()
To make predictions using the final linear model for bike-sharing demand forecasting (lr5), you will need to provide values for the input variables and use the model to generate a prediction for the target variable (bike rental demand).
#Applying the scaling on the test sets
numeric_vars = ['atemp','hum','windspeed','cnt']
bike_sharing_test[numeric_vars] = scaler.transform(bike_sharing_test[numeric_vars])
bike_sharing_test.describe()
Dividing into X_test and y_test
y_test = bike_sharing_test.pop('cnt')
X_test = bike_sharing_test
# Adding constant variable to test dataframe
X_test_lr5 = sm.add_constant(X_test)
# Updating X_test_lr5 dataframe by dropping the variables as analyzed from the above models
X_test_lr5 =X_test_lr5.drop(["atemp", "hum", "season_fall", "Quarter_AprMayJun", "weekend","Quarter_JanFebMar"], axis = 1)
# Making predictions using the fifth model
y_pred = lr5.predict(X_test_lr5)
Model evaluation is a critical step in assessing the performance of a bike-sharing demand forecasting model. Use various metrics to evaluate the performance of a model, including mean absolute error (MAE), root mean squared error (RMSE), and coefficient of determination (R-squared).
# Plotting y_test and y_pred to understand the spread
fig = plt.figure()
plt.scatter(y_test, y_pred)
fig.suptitle('y_test vs y_pred', fontsize = 20)
plt.xlabel('y_test', fontsize = 18)
plt.ylabel('y_pred', fontsize = 16)
You should evaluate the model’s performance using metrics such as MAE, RMSE, and R-squared. MAE and RMSE measure the average magnitude of the errors between the predicted and actual values. R-squared measures the proportion of variance in the target variable, explained by the input variables.
#importing library and checking mean squared error
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print('Mean_Squared_Error :' ,mse)
#importing library and checking R2
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
This study aimed to improve the bike-sharing activities of Capital Bikeshare and support the reinvention of the city transportation system. This comprehensive exploratory data analysis on their publicly available data helped us understand and analyze the underlying patterns and characteristics of the bike-share network and to work on this data to achieve data-driven results.
We performed an analysis on the growth in popularity of bike-share over the two years, 2011–2012, and the effect of the seasonal and day factors on the ridership patterns. The impacts of seasonal and weather parameters were to understand the ridership pattern in Washington, DC. Analysis of the trip data helped to understand the characteristics of the locality where the stations are located.
Keeping these inferences in mind, we could suggest the following recommendations:
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
A. Bike sharing demand prediction refers to the process of forecasting the number of bicycles that will be rented within a specific time period, aiding in resource allocation and system optimization.
A. The trend of bike share is experiencing steady growth worldwide, with an increasing number of cities implementing bike sharing programs to promote sustainable transportation and reduce traffic congestion.
A. The profitability of bike share systems can vary depending on factors such as user demand, operational costs, pricing strategies, and partnerships with local businesses. Careful planning and efficient management are crucial for long-term profitability.
A. Bike sharing is popular for several reasons. It offers a convenient and flexible mode of transportation, promotes physical activity and health, reduces carbon emissions, alleviates parking congestion, and provides an affordable alternative for short-distance travel in urban areas.