According to a report by the International Energy Agency (IEA), the lifecycle of buildings from construction to demolition was responsible for 37% of global energy-related and process-related CO2 emissions in 2020. Yet it is possible to drastically reduce buildings’ energy consumption by combining easy-to-implement fixes and state-of-the-art strategies. In this article, we will develop a data science solution to predict site EUI using machine learning techniques and then deploy the model on streamlit, cloud platform for inference. Such a solution can be used by government agencies or construction firms to build and maintain energy intensity within a certain threshold and help reduce carbon emissions. So let’s get started with this exciting project.
Source: streamlit
Site energy intensity is an important metric to measure for any building to understand its energy utilization and identify opportunities for energy efficiency. It also gives us an indication of the building’s overall CO2 emissions and helps us control the emissions.
Site EUI is a measure that is a function of a building’s energy intensity as the size of the building. Site EUI depends upon various characteristics of the building and weather-related factors of location.
In this article, we aim to explore and identify the causes behind the site energy intensity of types of buildings located in various states of the USA. Measuring Site EUI helps in the following way:
In this project, the dataset consists of building characteristics (e.g., floor area, facility type, building class, etc.), weather data for the location of the building (e.g., annual average temperature, annual total precipitation, etc.) as well as the energy usage for the building, and the given year, measured as Site Energy Usage Intensity (Site EUI). Each row in the data corresponds to the single building observed in a given year.
We will use various data science techniques to understand the data and explore the data to find insights on energy intensity levels of rows, followed by the use of machine learning techniques to predict site EUI. We will also look at the Explainable AI (XAI) technique to understand the predictions of the machine learning model. After developing the model, we will deploy the model on the streamlit cloud for real-time and batch predictions.
You are provided with two datasets:
(1) The ‘train_dataset’ where the observed values of the Site EUI for each row are provided.
(2) The ‘x_test dataset’, the observed values of the Site EUI for each row is removed and provided separately in y_test.
Your task is to predict the Site EUI for each row (using the complete training dataset), given the characteristics of the building and the weather data for the location of the building. Use the test sets for validation and testing. The target variable is site_eui for the predictive analytics problem.
Evaluation Metric: Root Mean Squared Error (RMSE)
This project is for intermediate and advanced learners who are looking to increase their knowledge about explainable AI and the deployment of machine learning models. Beginners in the data science field can also work on this project if they are familiar with the following tools:
The dataset is taken from WiDS Datathon 2022, which consists of 100k observations of buildings in the USA over 7 years. Let’s look at the dataset description before moving ahead with the project workflow.
Dataset source: Click Here
Now that we have an understanding of what’s the project problem statement. Let’s quickly jump on to the loading dataset and exploratory data analysis.
I have used the Kaggle environment to work on this project, and the dataset was loaded from the above-mentioned source in the Kaggle input directory. You may use your local environment or any cloud-based IDE to work on this project.
Note: I will also attach the GitHub repository at the end of this article so that you can look at it for your reference.
# importing all the Python libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from statsmodels.stats.outliers_influence import variance_inflation_factor
#importing algorithms
from sklearn.model_selection import train_test_split, GridSearchCV
from xgboost import XGBRegressor
import xgboost
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold, cross_val_score
import sklearn
import shap
import optuna
import joblib
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# Load the dataset
df = pd.read_csv("/kaggle/input/site-energy-intensity-dataset/train_dataset.csv")
test_df = pd.read_csv("/kaggle/input/site-energy-intensity-dataset/x_test.csv")
df.head()
The output of the above code
Our train dataset has about 75K rows and 64 columns, while the test dataset contains 25K rows of observations. we have stored the test dataset in the test_df variable in our project.
Now that we have the dataset loaded into our environment, we can use some of the pandas’ functions to get descriptions about the dataset.
# columns with null values
df.isnull().sum()[df.isnull().sum() != 0]
The output of the above code
# find the correlation between numerical columns
df.corr().T
The output of the above code
Using corr() function, we find correlation coefficients of the numerical columns, which allow us to infer which columns are influencing the target variable.
# descriptive statistics of dataframe
df.describe()
The output of the above code
Due to the large output size, I have shown only a portion of the output of codes to understand the purpose. you can check out linked github repositories at the end of the article for more information.
Some observations from the data understanding phase:
In the next step, we will do exploratory data analysis to find some actionable insights from the dataset, which can further help us to develop accurate machine learning models.
Heatmap of Dataframe
# heatmap to visualize multi collinearity in dataset
plt.figure(figsize=(30,20))
sns.set(font_scale=0.8)
sns.heatmap(df.corr(), annot=True, cmap=plt.cm.CMRmap_r)
The output of the above code
Using Seaborn’s heatmap function, we can visualize the correlation between columns. The above heatmap shows the relationship between various columns and derives insights on highly correlated columns to take further actions.
Note: for clear code outputs and images, please refer to the code repository linked at the end of the article.
Observations:
# create obj_cols list with two object columns from the dataset
obj_cols = ['State_Factor','building_class']
# Using for loop plot the stripplot for both object columns
for col in obj_cols:
plt.figure(figsize=(5,5))
sns.stripplot(x=col, y='site_eui', data=df)
plt.show()
The output of the above code
Relations Between Numerical Columns and Target
# create a new dataframe to extract all the columns from the dataset
ndf = df.drop(['id','site_eui'], axis=1)
target = df['site_eui']
col_list = [col for col in ndf.columns if ndf[col].dtype != 'object']
# plot the scatterplot of all numerical columns with target variable
fig, ax = plt.subplots(30,2, figsize=(12,60))
for idx, col in enumerate(col_list):
sns.scatterplot(x=col, y='site_eui', data=df, ax=ax[idx//2, idx%2])
ax[idx//2, idx%2].grid(visible=True, linestyle='--', color='#0b1324')
ax[idx//2, idx%2].set_title(f'{col} ', pad=12)
plt.suptitle(f'Relations between columns and target', fontsize=15, fontweight='bold',
color='#0b1324')
plt.show()
The output of the above code
Observations:
After EDA, the next step in any data science project lifecycle is Data pre-processing and feature engineering to develop quality and accurate machine learning models. Let’s start with handling missing values.
# last 4 columns are having more than 50% of the missing values
(ndf.isnull().sum()[ndf.isnull().sum() != 0]/len(target))*100
The output of the above code
The above code tells us what percentage of values are missing in our dataset. By using the above code, we can know the percentages of missing values in each column. In the next code, we will fill the missing values.
# remove noisy features
remove_col = ['Year_Factor','facility_type','year_built','days_above_110F']
n_df = ndf.drop(columns=remove_col, axis=1)
# fill the missing values of the columns
n_df['energy_star_rating'] = n_df['energy_star_rating'].fillna(n_df['energy_star_rating'].mean())
n_df['direction_max_wind_speed'] = n_df['direction_max_wind_speed'].fillna(1.0)
n_df['direction_peak_wind_speed'] = n_df['direction_peak_wind_speed'].fillna(1.0)
n_df['max_wind_speed'] = n_df['max_wind_speed'].fillna(1.0)
n_df['days_with_fog'] = n_df['days_with_fog'].fillna(n_df['days_with_fog'].mean())
Using the ‘fillna()’ method, we can fill the missing values with the mean values of each column. One can consider many approaches to handle missing values depending on the problem type.
In our dataset, many columns are highly correlated with others which can lead to inaccurate predictions for our target variable. To tackle this multi-collinearity problem, we will use the Variance Inflation Factor (VIF) method. VIF is calculated using the statsmodels library with an in-built function. Let’s identify multicollinear features and remove them before modeling.
# VIF calculations
def get_vif(X):
X_matrix = np.array(X)
vif = [variance_inflation_factor(X_matrix, i) for i in range(X_matrix.shape[1])]
vif_factors = pd.DataFrame()
vif_factors['columns'] = X.columns
vif_factors['VIF'] = vif
return vif_factors
X = n_df.drop(columns=['State_Factor','building_class'], axis=1)
vif_factors = get_vif(X)
# total 56 columns
vif_factors.info()
# find the columns with large VIF (vif > 4)
cols_with_large_vif = vif_factors[vif_factors.VIF > 4]['columns']
cols_with_large_vif
As per the book ‘Machine learning using the python’ by Dinesh Kumar and M Pradhan, if VIF values are higher than 4 then those columns are highly correlated with each other. In that case, we should remove those columns before applying machine learning algorithms. Now we will do feature engineering by keeping multicollinearity in mind.
As we know some of the columns in certain seasons are highly correlated, we will aggregate values within each season to reduce the total number of columns.
# Summetion of seasonal (i.e. winter and summer) columns to reduce columns and MC
n_df['Avg_min_temp_winter'] = (n_df['january_min_temp'] + n_df['february_min_temp']
+ n_df['march_min_temp'] +
n_df['april_min_temp'] + n_df['october_min_temp']
+ n_df['november_min_temp'] + n_df['december_min_temp'])/7
n_df['Avg_max_temp_winter'] = (n_df['january_max_temp'] + n_df['february_max_temp'] +
n_df['march_max_temp'] +
n_df['april_max_temp'] + n_df['october_max_temp'] +
n_df['november_max_temp'] + n_df['december_max_temp'])/7
n_df['Avg_temp_winter'] = (n_df['january_avg_temp'] + n_df['february_avg_temp'] +
n_df['march_avg_temp'] +
n_df['april_avg_temp'] + n_df['october_avg_temp'] +
n_df['november_avg_temp'] + n_df['december_avg_temp'])/7
n_df['Avg_min_temp_summer'] = (n_df['may_min_temp'] + n_df['june_min_temp'] +
n_df['july_min_temp'] + n_df['august_min_temp'] + n_df['september_min_temp'])/5
n_df['Avg_max_temp_summer'] = (n_df['may_max_temp'] + n_df['june_max_temp'] +
n_df['july_max_temp'] + n_df['august_max_temp'] + n_df['september_max_temp'])/5
n_df['Avg_temp_summer'] = (n_df['may_avg_temp'] + n_df['june_avg_temp'] + n_df['july_avg_temp']
+ n_df['august_avg_temp'] + n_df['september_avg_temp'])/5
n_df['Avg_days_below30F'] = (n_df['days_below_30F'] + n_df['days_below_20F'] +
n_df['days_below_10F'] + n_df['days_below_0F'])/4
# columns to remove from the dataset
months_cols = list(n_df.iloc[:,5:41].columns) + ['days_below_30F','days_below_20F',
'days_below_10F','days_below_0F','direction_max_wind_speed',
'direction_peak_wind_speed','snowdepth_inches','avg_temp','days_above_90F']
xdf = n_df.drop(columns=months_cols, axis=1)
xdf.info()
# columns to remove from the datset
months_cols = list(n_df.iloc[:,5:41].columns) + ['days_below_30F',
'days_below_20F','days_below_10F','days_below_0F','direction_max_wind_speed',
'direction_peak_wind_speed','snowdepth_inches','avg_temp','days_above_90F']
xdf = n_df.drop(columns=months_cols, axis=1)
xdf.info()
# final dataframw of 18 features to continue for modeling and evaluation
zdf = xdf.drop(columns=['State_Factor','building_class'], axis=1)
The output of the above code
In the above code, we have done feature engineering using the aggregation method and then removed unnecessary columns from the dataframe. Finally, we have stored all the features in a variable called ‘zdf’ to apply machine learning modeling techniques for our regression task.
In this section, we will apply regression modeling techniques like XGBoost and Random Forest, followed by hyperparameter tuning to build accurate regression models and predict Site EUI to optimize the energy usage of the building.
# zdf and target are the final dataset for the time being for baseline modeling
X_train, X_val, y_train, y_val = train_test_split(zdf, target,
test_size=0.3, random_state=42)
print(X_train.shape, X_val.shape, y_train.shape, y_val.shape)
--------------------------------[output]--------------------------------------
(53029, 18) (22728, 18) (53029,) (22728,)
In the above code, we split the data into train and test datasets for modeling and evaluation purposes. Now we will write the modeling function to train using XGBoost and Random Forest algorithms.
# modelling function for xgb and rf
def modeling(X_train, y_train, X_test, y_test, **kwargs):
scores = {}
models = []
if 'xgb' in kwargs.keys() and kwargs['xgb']:
xgb = XGBRegressor()
xgb.fit(X_train._get_numeric_data(), np.ravel(y_train, order='C'))
y_pred = xgb.predict(X_test._get_numeric_data())
scores['xgb']= [np.sqrt(mean_squared_error(y_test, y_pred))]
models.append(xgb)
if 'rf' in kwargs.keys() and kwargs['rf']:
rf = RandomForestRegressor(n_estimators=200, max_depth=8)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
scores['rf']= [np.sqrt(mean_squared_error(y_test, y_pred))]
models.append(rf)
return scores
# call the function using train and val datasets
modeling(X_train, y_train, X_val, y_val, xgb=True, rf=True)
------------------------------[Output]-----------------------------------
{'xgb': [48.27921059144114], 'rf': [49.49613354383458]}
By training the model using the default hyperparameters we are getting an RMSE score of 48.27 in XGB and 49.49 in the RF model. We will further optimize the model by using hyperparameters techniques.
gkf = KFold(n_splits=3, shuffle=True, random_state=42).split(X=X_train, y=y_train)
# A parameter grid for random forest
params = {
'n_estimators': range(100, 200, 100),
'ccp_alpha': [0.0, 0.1],
'criterion': ['squared_error'],
'max_depth': [5,6],
'min_samples_split': [2,3,5],
}
rf_estimator = RandomForestRegressor()
gsearch = GridSearchCV(
estimator= rf_estimator,
param_grid= params,
scoring='neg_mean_squared_error',
n_jobs=-1,
cv=gkf,
verbose=1,
)
rf_model = gsearch.fit(X=X_train, y=y_train)
(gsearch.best_params_, gsearch.best_score_)
--------------------------------[output]---------------------------------------
Fitting 3 folds for each of 12 candidates, totaling 36 fits
({'ccp_alpha': 0.1,
'criterion': 'squared_error',
'max_depth': 6,
'min_samples_split': 3,
'n_estimators': 100},
-2754.1303968681623)
In the above code, we have used the ‘gridsearch’ technique to search through the parameters grid and print the best parameters after fitting all the different combinations of the parameters to the model.
By using the ‘gridsearch’ technique, we could find the best parameters for the model. In the above code, we trained the model again using optimal hyperparameters to reduce the error, and we can see that RMSE has reduced to 46.51 from the previous score of 48.27 which is a reasonable improvement. now we can save the model using joblib for inference. We will also look at explainable AI to understand how models come up with a certain prediction.
Source: mathpix
Machine learning model explainability has become a really important area recently to explain complex models and make them interpretable. In this section, we will be using shap library to explain our trained XGBoost model.
# initialize shap module
shap.initjs()
sample_set = X_val.sample(100)
# shap values
shap_values = shap.TreeExplainer(xgb_estimator).shap_values(sample_set)
ntree_limit is deprecated, use `iteration_range` or model slicing instead.
# plot the summary plot for shaply values
shap.summary_plot(shap_values, sample_set, plot_type="bar")
Summary plot of the model
Above code outputs the summary plot, which tells us which feature is a major contributor to predicting the target. In our case, we can see that ‘energy_star_rating’ and ‘floor_area’ are the top contributor to predicting the target, followed by ‘snowfall_inches’ and ‘ELEVATION’ features. Similarly, we can plot a summary plot without bars to understand the impact of each feature in more detail.
# summary plot with bar chart
shap.summary_plot(shap_values, sample_set)
The output of the above code
Now, let’s look at the Decision plot to understand the variation in output by the model to predict the target variable.
# plot the decision plot to understand the predictions
shap.decision_plot(shap.TreeExplainer(xgb_estimator).expected_value[0],
shap_values[50:100],
feature_names=sample_set.columns.tolist()
)
The output of the above code
For more information on ‘shaply values’, I’d recommend visiting their official documentation, which you can find here.
Source: thenewstack.io
In this section, we will look at Deploying the saved model in a cloud environment using the streamlit library. If you are new the streamlit, then I highly recommend going through my previous article published on Analytics Vidhya, which was about the detailed guide on Streamlit; you can read the article by visiting the link here.
Now, let’s get started with Streamlit App deployment.
1. Create a project repository structure as shown in the below image:
We need to create a project repository structure that enables web application deployment on the streamlit cloud. In the above repo, we have the ‘app.py’ file, which is the Python script that will run on Stremlit’s managed cloud service. After deploying and running the script, it will automatically generate a web URL to access the site and make inferences.
2. Log in to your streamlit account to push the repo to the cloud:
As shown in the above image, you must assign a repository where your project files lie, Branch name, and Main File path (i.e., python file name). Finally, just click on that ‘Delpoy!’ button, and your web application will be live in a few minutes.
It’s that simple to deploy apps using streamlit, and for more information on the project code files, you can refer to the GitHub repository — here.
Here are some of the website screenshots from the deployed app.
Github repository link for detailed code: Click here
In conclusion, this article covers the end-to-end project lifecycle to predict site energy intensity, which helps to control the energy usage by buildings and implement energy efficiency measures to cut the cost and energy usage while helping to conserve and preserve the environment at large. Let’s summarise the lessons learned from this article.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.