The Problem Statement
The hotel industry faces a perennial problem of losing revenue due to room cancellations. Cancellation occurs largely due to customers seeking best deals; giving up a prior reservation in favour of a better prices or deals in another hotel.
As hotels are affected by the service characteristic of perishability, where room capacity within a hotel is largely fixed and cannot be stored for later sale or use, revenue lost from an unsold room can never be earned back. Potential cancellations and no-shows will result in opportunity costs incurred due to unoccupied rooms which in turn, affects revenue generation. To illustrate the severity of cancellations, our case study, a hotel chain in Portugal, faces 41% and 27% cancellations respectively in its city and resort hotels across all bookings made, resulting in a corresponding loss of potential revenue at $10.5 million and $5.7 million.

Hence, hotels tend to incorporate overbooking tactics to minimize potential losses in revenue, where guest reservations are still accepted beyond the available room capacity. However, such a strategy is risky as customers could potentially be turned away due insufficient rooms despite making a booking which can negatively affect a hotel’s reputation.
The Dataset
A hotel chain in Portugal was keen to explore ways to increase revenue, and has provided a customer dataset consisting of key customer attributes such as lead time, total nights stayed, average daily room rate and country of origin for bookings, distribution channel, room type booked as well as the main outcome variable: reservation status (cancelled/confirmed). The hotel chain also offered two types of hotels, mainly resort hotels and city hotels. Total revenue generated was a function of average daily room rate (ADR), the price of a room, multiplied by occupancy rate.

From a revenue management perspective, optimal revenue generation can be achieved through increasing demand and occupancy rates with ADR, selecting the ideal mix of distribution channels and targeting the right market and customer segments. These concepts were explored in detail for this project and will be highlighted during the presentation. Also, in the context of the dataset used, as we do not have any information on operational costs, we have decided to focus on occupancy rates instead of ADR as a driver of total revenue.
Exploratory Data Analysis
Seasonality
Revenue is subjected to demand fluctuations in terms of seasonality, where occupancy rates and ADR can experience peaks and troughs through the different seasons. For example, ADR will be high in peak seasons where there is a surge in demand as guests seek out holidays and conversely, ADRs are typically lower in the off-season as hotels seek to increase occupancy amidst low demand.

From the 2 hotels within the data set, we can see the 2 distinct effects of seasonality on each city and resort hotel. For the city hotel, there is a stronger impact of seasonality on demand and hence bookings tend to fluctuate, experiencing peaks during the months of March to October and thereafter, bookings dip. Looking at the revenue, it follows the same trend as bookings, influenced by the highs and lows of seasonality.
Bookings for the resort hotel tend to be more consistent throughout the year, visibility less affected by seasonality though it experiences higher bookings in February to May, and then a sudden peak in October. However, revenue earned reflects a strong fluctuation in room prices as the resort hotel achieves high revenue in the months of July and August, suggesting extremely high ADR during those 2 months.
Distribution Channels
In a highly connected world, customers are often presented with a wide array of distribution channels such as direct bookings with the hotel (direct channel), online travel agents (OTA) such as Booking.com, Expedia and Global Distribution Systems for B2B bookings.
In recent years, hotels have a strong dependency on OTAs as a channel to gain better visibility on the internet. However, non-direct channels often entail commissions, with OTAs commanding rates up to 30%. In the dataset, most of bookings made for the city and resort hotel are made through the TA/TO channel at 82% and 68% respectively. In terms of revenue lost to commission, considering a commission rate of 30%, sums amounted to $3.6 million for the city hotel and $2.6 million for the resort hotel. To worsen the situation, the OTA commission model often includes a clause on parity, preventing hotels from offering lower prices when booking directly with them. Thus, this illustrates a need to re-direct guests to make reservations directly with the hotel to ensure maximum profitability.

Data Preprocessing
Upon analysing the dataset, our group noticed a few characteristics:
- Dataset is well-balanced for binary classes Cancel and No Cancel
- Features are not on the same scale
- Missing data values
- Both categorical and numerical columns

Hence, we took the following steps to pre-process the data:
- Feature Importance was analysed and columns that did not contribute much to cancellation rate was removed e.g. meal type, guest’s company, number of babies etc.
- Resort Hotel data and City Hotel data were seperated into two datasets
- Drop rows with missing data
- Encoding done for categorical columns
- Scaling done for numerical columns
- Training and test sets created for both Resort and City Hotel datasets, with an 80-20 ratio
Data Science Solution
Classification
As hotel booking cancellation prediction is crucial in revenue and resource management for hotels. Therefore, we propose the use of supervised binary classification models to predict hotel booking cancellation. With the model output and analysis, we will then be able to:
- Predict Cancellation Rate: By using historical data, we train a classification model that can predict the cancellation outcome (Cancel or No Cancel) given certain attributes of customers.
- Identify Cancellation Reasons: From the model, we can derive the importance of each attribute behind the overall decision.
From there, we will be able to improve future revenue and optimize resource management for hotels.
Clustering
By clustering the customer base, we can identify unique characteristics of each customer group. Therefore, we propose the use of unsupervised clustering models to perform customer segmentation. With the model output and analysis, we will then be able to:
- Identify potential partnerships with external vendors to increase customer retention based on each customer group’s habits
- Encourage direct bookings with the hotel to reduce costs from commissions paid to OTAs
Classification
Different machine learning classifiers such as Logistic Regression, Random Forrest, Gradient Tree Boosting, XGBoost are employed with ROC score and Accuracy as the metrics to compare model performances.


After doing the model training and evaluation, it was noted that the Gradient Boosting model has the best performance for Resort Hotels, while XGBoost has the best performance for City Hotels.
Besides training a model to predict cancellation rate, our group also did a feature importance analysis to find the top attributes in predicting the outcome variable compared to the other variables in the model.
Reasons to Cancel
Amongst the attributes in the dataset, we found that several attributes led to higher odds of cancellation than others.
- For both Resort and City Hotel, certain room types have higher odds of cancellation vs. other type of rooms
- For both Resort and City Hotel, customers with a previous history of cancellation, in specific higher number of previous cancellations, were more likely to cancel
- Both longer lead time and higher room rate also led to higher chance to cancel. However, for resort hotel, cost is more important to cancel decision while for city hotel lead time is more important to cancel decision
- And finally, both Resort and City Hotel saw higher chance of cancellation for Transient customer type whose booking is not part of a group or contract.
Reasons to Stay
Amongst the attributes in the dataset, we found that several attributes led to lower odds of cancellation than others.
- Customers who required parking spaces have almost no chance of cancelling
- Certain room types reduces the odds of cancellation by a factor of 0.3 or more
- Bookings made for the summer season have their odds reduced by a factor of 0.7 at resort hotels
- And finally, the odds of cancellation is roughly 0.25 times the original odds for bookings made by foreigners
Clustering
We split the dataset by hotel type and also only looked at bookings with completed stays, because this would contain the characteristics of customers that the hotel needs to retain and attract. We then ran the elbow curve to determine the number of clusters for each hotel, before performing K-means clustering on the dataset.

We were able to identify certain customer profiles for both hotels. For example, the Resort Luxury cluster are mostly locals with higher ADR as compared to the profile of the City Luxury cluster. In the corporate clusters, we see that both are budget conscious and have short stays, but the Corporate City cluster could be said to be more price sensitive as at least 20% of reservations requested for no meal catering.
Formulating a Business Plan
With the results from Classification and Clustering, we were able to translate the data science results into a business plan.
A Closed-Loop Classification Data ECOSYSTEM
We propose a data hub to be set up that incorporates the existing data warehouse that the hotel has, together with the new classification model developed. The data warehouse would then feed the historical booking data into the model to predict future cancellation rates.

First, the data hub stores all the historical data on customer bookings. The classification model is then trained on the data and new bookings can then be evaluated for their most probable outcome: either cancel or no cancel. As the data continuously flows into the data hub and its classification model, top attributes that lead to higher cancellation rates, as well as top reasons that entice customers not to cancel can be identified.
This information can then be fed to the business stakeholders, who can calculate the overall expected rate of cancellation, and decide the threshold percentage in which they would like to overbook the hotel for. This would make up for some lost revenue in anticipation of future cancellation. Secondly, with the top reasons for cancellation identified, the hotel management can then rethink possible discount packages to offer to customers, so as to defray the cost of the room and incentivise them to continue on with their booking.
The second purpose of the data hub would be to leverage on the existing strengths of the hotels, such that opportunities to capture greater revenue could be secured. For example, for both resort and city hotels, certain room types and the requirement of parking spaces were a key factor behind non cancellations. This could lead to partnerships with several external partners such as tour vendors, to offer a holistic tour package involving car rentals, hotel rooms and day tours to strengthen booking rates during lull periods, leading to higher revenue.
Overall, the outcome of this data ecosystem would be to improve customer experience, and as the hotel improves on its services, new customers would then come in and provide new top reasons for cancellation, leading to a continuous process of improvement and closing the data loop.
Tailored Promotion Mechanics for each customer profile
One of the main losses of revenue stem from losing hotel bookings to online travel agents (OTA), with some charging as high as 30% for commissions. To reduce the loss of revenue through external bookings, promotions and marketing campaigns can be done to entice customers to make direct bookings with the hotel instead. These marketing campaigns can be tailored to target individual customer profiles as highlighted from the Clustering model.
For example, customers who fall under the Luxury Resort profile can be enticed with discounts on luxury resort rooms coupled with spa packages when they make a direct booking with a resort hotel. To further identify the amount of discount to be given, a sensitivity analysis can be done for each cluster, based on their ADR, to calculate the incremental revenue that could be earned for every percentage discount given and percentage of conversion.