Air travel is an essential part of modern life, allowing people to easily and quickly travel from one location to another. However, the cost of air travel can often be a confusing and frustrating experience. It is not fixed like the prices of other commodities, but rather very flexible and dependent on the individual cases. Prices drop and rise drastically over the night, and people often have to change their travel plans due to differences in prices on different dates.
For example, last summer, I was looking for tickets for summer school at the University of Chicago and the differences between flights on different days were extremely substantial. On some days, I would have to pay 500 euros for a return ticket, while on other days, the same route would cost 900 euros.
Moreover, when I was about to board my flight back home, I was told that it was overbooked, and I was asked to give up my seat for compensation of 600 euros. This was one of the most surprising things that has ever happened to me, and as a curious mathematics student, I was intrigued to understand how could this be profitable for the airline and how they balance the risks and benefits of overbooking. It became clear that they use mathematics to maximize their revenue, so I decided to use this Internal assessment as an opportunity to explore at least a fraction of the complex calculations airlines undertake before actually putting tickets on sale.
The aim of this research is to explore mathematical models that predicts ticket prices, based on simple factors, such as route distance, day of the week, demand, and competition on the market. I aim to:
To accomplish this, I will use statistical techniques such as linear regression to identify correlations between ticket prices and those various factors, and statistical analysis to examine the distribution of prices and identify any trends or outliers in the data.
I will also take overbooking into account and investigate overbooking strategies that airlines use to maximize revenue while minimizing the risk of empty seats, which then lower the prices of tickets sold. To analyze this strategy, I will create a model that uses probability of passengers not showing up for flight and data on compensation payed by airlines in order to see how many passengers can airline overbook without consequences for it's profitability.
The research question that will guide my work is: How can a mathematical model that predicts the flight prices of Air Serbia be created, taking into account the route distance, demand and competition on the market, and how do overbooking strategies adhere to this model?
The data used for this analysis was collected from Air Serbia's website (Air Serbia, n.d.) and covers the base fare of six intra-European routes - Belgrade to Amsterdam, Paris, London, Athens, Banja Luka and Moscow, and one trans-continental one Belgrade - New York, for the month of May 2023. We will use IATA's (International Air Transport Association) codes to label the routes, as outlined in the table below (IATA, n.d.):
Route | Code |
---|---|
Belgrade - Amsterdam | BEG - AMS |
Belgrade - Paris | BEG - CDG |
Belgrade - London | BEG - LHR |
Belgrade - Athens | BEG - ATH |
Belgrade - Banja Luka | BEG - BNX |
Belgrade - Moscow | BEG - SVO |
Belgrade - New York City | BEG - JFK |
It's worth noting that all flights were operated by the same airline and departed from the same airport, which means that they faced similar taxes and fees that were not under the airline's control but were still included in the ticket price, which will enable us to exclude those costs in our calculations.
Therefore, the prices that we will look upon will be based almost exclusively on the demand pattern and competition, and the distance travelled, and we will approach the task from the perspective of planners in the aviation industry.
One of the most important factors in our calculations is the average price, calculated as the Mean of our data set:
\[ \text{Average price/Mean} = \frac{p(1)+p(2)+p(3)...+p(30)+p(31)}{31} \]
Where:
Exception are the routes for which we don't have prices for every day. Then, we would simply divide the sum of all prices with the number of prices that sum up to it. For example, for route Belgrade - Amsterdam(BEG-AMS), the calculator would be as follows:
\[ \text{Average price}(BEG-AMS) = \frac{162.84+162.84+112.84+...+187.84+112.84+112.84}{28} \]
Route: | BEG-AMS | BEG-SVO | BEG-ATH | BEG-CDG | BEG-LHR | BEG-BNX | BEG-JFK |
---|---|---|---|---|---|---|---|
Mean | 132.52 | 559.33 | 67.19 | 122.29 | 105.74 | 32.28 | 523.18 |
Standard deviation is a useful tool which we will use to calculate how much prices are spread out around the mean - how much do they differ (Devore and Berk 2018). This will be important as the routes with higher standard deviation will have more variations in prices so they will be more interesting for this investigation. Standard deviation is calculated as follows:
\[ \text{Std. Deviation} = \sqrt{\frac{\Sigma(p(i)-\text{mean})^2}{N}} \]
Where:
So for example, for route Belgrade - Athens (BEG-ATH), the standard deviation is calculated as follows:
\[ \text{Std. Deviation} = \sqrt{\frac{(102.84-67.19)^2+(102.84-67.19)^2+(68.84-67.19)^2+...+(85.84-67.19)^2+(85.84-67.19)^2}{31}} \]
Route: | BEG-AMS | BEG-SVO | BEG-ATH | BEG-CDG | BEG-LHR | BEG-BNX | BEG-JFK |
---|---|---|---|---|---|---|---|
Std Dev | 39.11 | 67.26 | 32.16 | 42.45 | 27.89 | 5.39 | 56.35 |
On the level of each route in our dataset, minimum and maximum prices have been identified. This will help us to establish the boundaries of price variability and identify any unusual or unexpected patterns in pricing.
Here, we can calculate another important parameter, Range. This is the difference between the highest and lowest value in a data set. So for example, for the route Belgrade - New York, the range will be calculated as follows:
\[ \text{Range}(BEG-JFC) = 643.64-508.64 = 135 EUR \]
In the table below, collected data is presented:
Route: | BEG-AMS | BEG-SVO | BEG-ATH | BEG-CDG | BEG-LHR | BEG-BNX | BEG-JFK |
---|---|---|---|---|---|---|---|
Min | 58.84 | 449.18 | 48.84 | 75.84 | 78.84 | 31.84 | 508.64 |
Max | 217.84 | 624.18 | 168.84 | 230.84 | 169.84 | 41.84 | 643.64 |
Range | 159.00 | 175.00 | 120.00 | 155.00 | 91.00 | 10.00 | 135.00 |
To analyze the relationship between distance and price of flights, I will use the data from the previous section. Specifically, I will use a linear regression model to estimate the average price of flights based on their distance (Devore and Berk 2018). The first step is to plot the data on a scatter plot:
Now, we can fit a linear regression line to it. The slope of the line represents the average price per kilometre of Air Serbia flights from Belgrade:
The slope of the trendline is passing above most of the values. One of the main insights I gained from analyzing the scatter plot above is the presence of an outlier on the route from Belgrade to Moscow (BEG-SVO). This route is significantly more expensive than the other routes in our dataset, with an average price of over 500 euros, compared to an average of around 100 euros for the other routes of similar distance.
One possible explanation for this anomaly is the imposition of sanctions on Russia by most of the European countries (BBC, 2022), which decreased the number of flights there, and consequently a rise in prices due to a too-high demand. To further investigate this possibility, we could perform a regression analysis to see if there is a statistically significant relationship between the imposition of sanctions and the price of flights on this route. Alternatively, we could look at other factors that could be driving up the price, such as a lack of competition or higher operating costs for airlines. Therefore, I can conclude that this route has specific external factors, which are not the topic of this investigation. As such, to ensure the accuracy and reliability of my analysis, it may be prudent to exclude this route from our calculations and focus solely on the remaining routes. This approach will allow me to obtain a more accurate understanding of the underlying patterns and trends in flight pricing, and thus make more informed decisions based on the results of our analysis. The trendline of remaining routes:
After removing the Belgrade to Moscow route from our analysis, we have noticed a significant change in the trendline, which slope is given by:
\[ \operatorname{price}(EUR) = 0.0889 \times \text{distance}(\mathrm{km})+36.561 \]
This trend line fits the remaining data points much more closely. This has resulted in a higher R-value(correlation coefficient) of our linear regression, which now stands at an impressive 0.99 which represents how well the linear regression meets the model. This coefficient takes values between 0 and 1, with higher one indicating better fit. Almost a perfect result of our R value therefore indicates a very strong correlation between the distance and price of flights for the remaining routes, suggesting that our model is now very reliable and accurate.
Now that I have identified the correlation between the route length and price, I decided to look more closely at routes and identify how prices change within a time period. Here is the graph of the data for 5 middle distance intra-European routes from our data set (BEG-BNX and BEG-JFK routes are not relevant for this due to the fact that they don't operate frequently enough for this part of the investigation)
After analysing this graph, I have realized that the tickets are changing on a smaller scale than a monthly one - they seem to change over a course of a week, which makes sense given the nature of demand - most people tend to travel on the same days of the week, depending on the purpose of their trip (family visits - weekends, business - the start of the week...).
Therefore, I decided to examine those prices on a level of the week. To find the average ticket price for every day of the week, I will try to model the variation in ticket prices over time using regression of function. This function will help me identify any patterns or trends in ticket prices that repeat over this fixed period(week).
For this purpose, I choose to use the route from Belgrade to Paris. The main reason for this was the fact that it operates every single day in the month, and fits the previous linear regression(distance vs price), meaning that it shouldn't have any exceptional external moderating factors which influence the price.
To implement this approach, I will rearrange the data on ticket prices for each day of the week. I will then create a scatter plot where on the x-axis, numbers 1-7 represent the days of the week, while the y-axis represents variation in prices. As the graph is linear, y is given by \(p(x)=ax+b:\)
The slope of this graph is 1.887, meaning that prices rise as the graph is increasing. However, we can clearly see that Monday is almost exclusively more expensive than other weekdays. It's quite clear that this graph does not clearly represent the trend in prices in a week period. My next try will be a graph of quadratic function, with a form:
\[ p(x)=ax^2+bx+c \]
AI Assist
Expand