During Covid, the hospitality industry has suffered a massive drop in revenue. So when people are traveling more, getting the customer remains a challenge. We will develop an ML tool to solve this problem to counter this problem and set the fitting room to attract more customers. Using the hotel’s dataset, we will build an AI tool to select the correct room price, increase the occupancy rate, and increase the hotel revenue.
This article was published as a part of the Data Science Blogathon.
The hotel booking dataset contains data from different sources, which includes columns such as hotel type, number of adults, stay time, special requirements, etc. These values can help predict the hotel room price and help in increasing hotel revenue.
In Hotel room price analysis, we will analyze the dataset’s pattern and trend. Using this information, we will make decisions related to pricing and operation. These things will depend upon several factors.
Setting the room price is essential to increase revenue and profit. The importance of setting the right hotel price is as follows:
Data collection and preprocessing is the essential part of hotel room price analysis. The data is collected from hotel websites, booking websites, and public datasets. This dataset is then converted to the required format for visualization purposes. In preprocessing, the dataset undergoes data cleaning and transformation. The new transformed dataset is used in visualization and model building.
Visualizing the dataset helps get insight and find the pattern to make a better decision. Below are the Python tools to provide better visualization.
The hotel booking dataset has multiple use cases and applications as described below:
Hotel room booking dates can have several challenges due to various reasons:
Best practices in hotel room data analysis:
As consumer spending increases, it greatly benefits the hotel & tourism industry. This creates new trends and data to analyze customer spending and behavior. The increase in AI tools creates an opportunity to explore and maximize the industry. With the help of an AI tool, we can gather the required data and remove unwanted data, i.e., performing data preprocessing.
On top of this data, we can train our model to generate valuable insight and produce real-time analysis. This also helps in providing personalized experiences based on individual customers and guests. This highly benefits the hotel and the customer.
Data analysis also helps the management team to understand their customer and inventory. This will help in setting dynamic room pricing based on demand. Better inventory management helps in reducing the cost.
Let us perform a fundamental Data analysis with Python implementation on a dataset from Kaggle. To download the dataset, click here.
Hostel Booking dataset includes information on different hotel types, such as Resort hotels and City Hotels, and Market Segmentation.
#Importing the Library
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
#Read the file and convert to dataframe
df = pd.read_csv('data\hotel_bookings.csv')
#Display the dataframe shape
df.shape
(119390, 32)
#Checking the data sample
df.head()
#Checking the dataset info
df.info()
#Checking null values
df.isna().sum()
#Boxplot Distribution of Nights Spent at Hotels by Market Segment and Hotel Type
plt.figure(figsize = (15,8))
sns.boxplot(x = "market_segment", y = "stays_in_week_nights", data = df, hue = "hotel",
palette = 'Set1')
#Plotting box plot for market segment vs stay in weekend night
plt.figure(figsize=(12,5))
sns.boxplot(x = "market_segment", y = "stays_in_weekend_nights", data = df,
hue = "hotel", palette = 'Set1');
The above plots show that most groups are normally distributed, and some have high skewness. Most people tend to stay less than a week. The customers from the Aviation Segment do not seem to be staying at the resort hotels and have a relatively lower day average.
#Barplot of average daily revenue vs Month
plt.figure(figsize = (12,5))
sns.barplot(x = 'arrival_date_month', y = 'adr', data = df);
In the implementation part, I will show how I used a ZenML pipeline to create a model that uses historical customer data to predict the review score for the next order or purchase. I also deployed a Streamlit
application to present the end product.
ZenML is an open-source MLOps framework that streamlines production-ready ML pipeline creations. A pipeline is a series of interconnected steps, where the output of one step serves as an input to another step, leading to the creation of a finished product. Below are reasons for selecting ZenML Pipeline:
Building a model is not enough; we have to deploy the model into production and monitor the model performance over time and how it interacts with accurate world data. An end-to-end machine
learning pipeline is a series of interconnected steps where the output of one step serves as an input to another step. The entire machine learning workflow can be automated through this process, from data preparation to model training and deployment. This can help us continuously predict and confidently deploy machine learning models. This way, we can track our production-ready model. I highly suggest you refer to the ZenML document for more details.
The first pipeline we create consists of the following
steps:
As we discussed above, different steps. Now, we will focus on the coding part.
class IngestData:
"""
Ingesting data from the data_path
"""
def __init__(self,data_path:str) -> None:
"""
Args:
data_path: Path an which data file is located
"""
self.data_path = data_path
def get_data(self):
"""
Ingesting the data from data_path
Returns the ingested data
"""
logging.info(f"Ingesting data from {self.data_path}")
return pd.read_csv(self.data_path)
@step
def ingest_df(data_path:str) -> pd.DataFrame:
""""
Ingesting data from the data_path.
Args:
data_path: path to the data
Returns:
pd.DataFrame: the ingested data
"""
try:
ingest_data = IngestData(data_path)
df = ingest_data.get_data()
return df
except Exception as e:
logging.error(f"Error occur while ingesting data")
raise e
Above, we have defined an ingest_df() method, which takes the file path as an argument and returns the dataframe. Here @step is a zenml decorator. It is used to register the function as a step in a pipeline.
data["agent"].fillna(data["agent"].median(),inplace=True)
data["children"].replace(np.nan,0, inplace=True)
data = data.drop(data[data['adr'] < 50].index)
data = data.drop(data[data['adr'] > 5000].index)
data["total_stay"] = data['stays_in_week_nights'] + data['stays_in_weekend_nights']
data["total_person"] = data["adults"] + data["children"] + data["babies"]
#Feature Engineering
le = LabelEncoder()
data['hotel'] = le.fit_transform(data['hotel'])
data['arrival_date_month'] = le.fit_transform(data['arrival_date_month'])
data['meal'] = le.fit_transform(data['meal'])
data['country'] = le.fit_transform(data['country'])
data['market_segment'] = le.fit_transform(data['market_segment'])
data['reserved_room_type'] = le.fit_transform(data['reserved_room_type'])
data['assigned_room_type'] = le.fit_transform(data['assigned_room_type'])
data['deposit_type'] = le.fit_transform(data['deposit_type'])
data['customer_type'] = le.fit_transform(data['customer_type'])
from zenml import pipeline
@pipeline(enable_cache=False)
def train_pipeline(data_path: str):
df = ingest_df(data_path)
X_train, X_test, y_train, y_test = clean_df(df)
model = train_model(X_train, X_test, y_train, y_test)
r2_score,rsme = evaluate_model(model,X_test,y_test)
We will use the zenml @pipeline decorator to define the train_pipeline() method. The train_pipeline method takes the file path as an argument. After data ingestion and splitting the data into training and test sets, the train_model() method is called. This method, train_model(), will use different algorithms such as Lightgbm, Random Forest, Xgboost, and Linear_Regression to train on the dataset.
We will use the RMSE, R2 score, and MSE of different algorithms to determine the best one. In the below code, we have defined the evaluate_model() method to use other evaluation metrics.
@step(experiment_tracker=experiment_tracker.name)
def evaluate_model(model: RegressorMixin,
X_test: pd.DataFrame,
y_test: pd.DataFrame,
) -> Tuple[
Annotated[float, "r2_score"],
Annotated[float, "rmse"]
]:
"""
Evaluates the model on the ingested data.
Args:
model: RegressorMixin
x_test: pd.DataFrame
y_test: pd.DataFrame
Returns:
r2 r2 score,
rmse RSME
"""
try:
prediction = model.predict(X_test)
mse_class = MSE()
mse = mse_class.calculate_scores(y_test,prediction)
mlflow.log_metric("mse",mse)
r2_class = R2()
r2 = r2_class.calculate_scores(y_test,prediction)
mlflow.log_metric("r2",r2)
rmse_class = RMSE()
rmse = rmse_class.calculate_scores(y_test,prediction)
mlflow.log_metric("rmse",rmse)
return r2,rmse
except Exception as e:
logging.error("Error in evaluating model: {}".format(e))
raise e
Create the virtual environment using Python or Anaconda.
#Command to create virtual environment
python3 -m venv <virtual_environment_name>
You must install some Python packages in your environment using the command below.
cd zenml -project /hotel-room-booking
pip install -r requirements.txt
For running the run_deployment.py script, you will also need to install some integrations using ZenML:
zenml init
zenml integration install mlflow -y
In this project, we have created two pipelines
run_pipeline.py will take the file path as an argument, executing the train_pipeline() method. Below is the pictorial view of the different operations performed by run_pipeline(). This can be viewed by using the dashboard provided by Zenml.
Dashboard URL: http://127.0.0.1:8237/workspaces/default/pipelines/95881272-b1cc-46d6-9f73-7b967f28cbe1/runs/803ae9c5-dc35-4daa-a134-02bccb7d55fd/dag
run_deployment.py:- Under this file, we will execute the continuous_deployment_pipeline and inference_pipeline.
from pipelines.deployment_pipeline import continuous_deployment_pipeline,inference_pipeline
def main(config: str,min_accuracy:float):
mlflow_model_deployment_component = MLFlowModelDeployer.get_active_model_deployer()
deploy = config == DEPLOY or config == DEPLOY_AND_PREDICT
predict = config == PREDICT or config == DEPLOY_AND_PREDICT
if deploy:
continuous_deployment_pipeline(
data_path=str
min_accuracy=min_accuracy,
workers=3,
timeout=60
)
df = ingest_df(data_path=data_path)
X_train, X_test, y_train, y_test = clean_df(df)
model = train_model(X_train, X_test, y_train, y_test)
r2_score, rmse = evaluate_model(model,X_test,y_test)
deployment_decision = deployment_trigger(r2_score)
mlflow_model_deployer_step(model=model,
deploy_decision=deployment_decision,
workers=workers,
timeout=timeout)
In the abThede, they create a continuous deployment pipeline to take the data and perform data ingestion, splitting, and model training. Once they train the model, they will then evaluate it.
@pipeline(enable_cache=False, settings={"docker": docker_settings})
def inference_pipeline(pipeline_name: str, pipeline_step_name: str):
# Link all the steps artifacts together
batch_data = dynamic_importer()
model_deployment_service = prediction_service_loader(
pipeline_name=pipeline_name,
pipeline_step_name=pipeline_step_name,
running=False,
)
predictor(service=model_deployment_service, data=batch_data)
In inference_pipeline, we will predict once the model is trained on the training dataset. In the above code, use dynamic_importer, prediction_service_loader, and predictor. Each of these method have different functionality.
Now we will visualize the pipelines using Zenml dashboard to clear view.
Dashboard url:- http://127.0.0.1:8237/workspaces/default/pipelines/9eb06aba-d7df-43ef-a017-8cb5bb13cd89/runs/e4208fa5-48c8-4a8c-91f1-011c5e1ddbf9/dag
Dashboard url:-http://127.0.0.1:8237/workspaces/default/pipelines/07351bb1-6b0d-400e-aeea-551159346f0e/runs/c1ce61f8-dd12-4244-a4d6-514e5520b879/dag
We have deployed a Streamlit app that uses the latest model service asynchronously from the pipeline. It can be done quickly with ZenML within the Streamlit code. To run this Streamlit app in your local system, use the below command:
# command to run the streamlit app locally
streamlit run streamlit_app.py
You can get the complete end-to-end implementation code here
We have experimented with multiple algorithms and compared the performance of each model. The results are as follows:
Models | MSE | RMSE | R2_Score |
---|---|---|---|
XGboost | 267.465 | 16.354 | 16.354 |
LightGBM | 319.477 | 17.873 | 0.839 |
RandomForest | 14.485 | 209.837 | 0.894 |
Linear Regression |
1338.777 | 36.589 | 0.325 |
The Random Forest model performs the best, with the lowest MSE and the highest R^2 score. This means that it is the most accurate at predicting the target variable and explains the most variance in the target variable. LightGBM model is the second best model, followed by the XGBoost model. The Linear Regression model performs the worst.
A live demo application of this project using Streamlit. It takes some input features for the product and predicts the customer satisfaction rate using our trained models.
The hotel room booking sector is also rapidly evolving as internet accessibility has increased in different parts of the world. Due to this, the demand for online hotel room booking has increased. Hotel management wants to know how to keep their guests and improve products and services to make better decisions. Machine learning is vital in various businesses, like customer segmentation, demand forecasting, product recommendation, guest satisfaction, etc.
Several features determine the room price. Some of them are hotel_type, room_type, arrival_date, departure_date, number_of_guests, etc.
The model aims to set the correct room price so the hotels can keep the occupancy rate as high as possible. Multiple parties, such as hotels, travel websites, and businesses, can use this data.
A hotel room price optimization model is an ML tool that predicts the room price based on total stay days, room type, any special request, etc. Hotels can use this tool to set competitive prices and maximize profit.
In hotels, the prediction of room prices relies on several factors, including data type and quality. If the model undergoes training with additional parameters, it improves its ability to predict prices more accurately.
This model can be used in hotels to establish competitive prices, attract more customers, and increase occupancy rates. Travelers can utilize it to secure the best deals at reasonable rates without hotels overcharging them. This also helps in travel budget planning.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.