Airline itinerary choice modeling using machine learning,Journal of Choice Modelling

当前位置： X-MOL 学术 › J. Choice Model. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Airline itinerary choice modeling using machine learning
Journal of Choice Modelling ( IF 4.164 ) Pub Date : 2019-06-01 , DOI: 10.1016/j.jocm.2018.02.002
Alix Lhéritier , Michael Bocamazo , Thierry Delahaye , Rodrigo Acuna-Agost

This paper deals with the airline itinerary choice problem. Consider for example that a customer is searching for flights from London to New York, traveling next week on Tuesday and coming back on Saturday. This search request is then processed by a travel provider (e.g., an online travel agent) that proposes between 50 and 100 different alternatives (itineraries) to the customer. The itineraries have different attributes, among others: number of stops, total trip duration, and price. The question is: “which one is (probably) going to be selected by the customer?” There is a growing interest within the travel industry in better understanding how customers choose between different itineraries when searching for flights. Such an understanding can help travel providers, either airlines or travel agents, to better adapt their offer to market conditions and customer needs, thus increasing their revenue. This can be used for filtering alternatives, sorting them or even for changing some attributes in real-time (e.g., change the price). The field of customer choice modelling is dominated by traditional statistical approaches, such as the Multinomial Logit (MNL) model, that are linear with respect to features and are tightly bound to their assumptions about the distribution of error (Gaussian, Gumbel, etc.). While these models offer the dual advantages of simplicity and readability, they lack flexibility to handle correlations between alternatives or non-linearity in the effect of alternative attributes. A large part of the existing modelling work focuses in adapting these modelled distributions, so that they can match observed behaviour. Nested (NL) and Cross-Nested (CNL) Logit models are good examples of this: they add terms for highly specific feature interactions, so that substitution patterns between sub-groups of alternatives can be captured. In this work, we present an alternative modelling approach based on machine learning techniques. The selected machine learning methods do not require any assumption about the distribution of errors, they can also create non-linear relationships between feature values and the target class, include collinear features, and have more modelling flexibility in general. In particular, we have chosen to work with Random Forests. Random Forests are ensembles of decision trees which aggregate their predictions. A decision tree is a tree of nodes, each of which applies a linear threshold to a single variable. Each decision tree in a random forest receives a random subset of the input features, and then generates a tree deterministically from those features. In fact, Random Forests are well adapted for our problem as model bifurcations (branches) automatically partition the customers into segments and, at the same time, it captures nonlinearities relationships within attributes of alternatives and characteristics of the decision maker. Indeed, there are two main segments to be taken into account for our particular problem: business and leisure air passengers behave very different when it comes to book flights. Business passenger tends to favour alternatives with convenient schedules like shorter connection times and time preferences, while leisure passengers are very price sensitive, in other words they can accept a longer connection time if this is reflected in a lower price of the tickets. The problem is that the “segment” is not explicitly known when the customer is searching, however it could be derived by combining different factors. For example, industry experts know that business passengers have a tendency to book with less anticipation and are not predisposed to stay on Saturdays nights. In spite of this, these are not “black or white” rules, so this reinforce the need for a model able to detect these rules depending on the data and actual customer behaviour. Another observed advantage is that Random Forests are also fairly quick to train and very quick to predict, which enables fast iteration. As a matter of fact, our numerical experiments show that Machine Learning methods were faster than MNL for the learning process, both using public libraries with default parameters. We trained and tested our models on a dataset consisting of flight searches and bookings on six European markets, extracted from GDS (global distribution system) logs. The choice set consists of the results of a flight search request. Each search request includes between 10 and 250 different itineraries, one of which has been booked by the customer. Choice sets are regrouped by Origin and Destination market. Our main experiments consist in comparing MNL vs Machine Learning. We evaluate the performance of the different models by comparing their prediction of fractional shares of choices against the actual distribution of choices. It should be noted that in our data set the alternatives are not fixed from one shopping session (choice situation) to the next, we compare the predicted and actual market share of groups of alternatives, such as flight numbers, flights from one airline, or flights departing during a specific time window. These KPIs are particularly useful, as they are often the final result expected from the model. Our general finding is that, on most origin and destination markets, machine learning models outperform MNL on all relevant metrics. In particular, we found a reduction in the error of airline prediction of more than 70% on four of the six markets we considered. We also find that training a model on several markets at once results in similar performance – something which could greatly help in scaling the models to a large amount of markets. The main conclusion of our work is thus that non-linear machine learning methods such as the ones we present here can provide clear benefits to some choice modelling applications, such as air travel itinerary choice by passengers, notably thanks to the better handling of non-linearity, overall greater flexibility they provide, and fast learning computation.

中文翻译：

使用机器学习的航空路线选择建模

本文讨论了航空公司的行程选择问题。例如，考虑某个客户正在搜索从伦敦到纽约的航班，该航班在下周二旅行，在周六回来。然后由旅行提供者（例如，在线旅行代理商）处理该搜索请求，该旅行提供者向顾客提出50到100个不同的备选方案（行程）。行程具有不同的属性，其中包括：停靠点数，总行程时间和价格。问题是：“（可能）客户会选择哪个？” 在旅游业中，越来越有兴趣更好地了解客户在搜索航班时如何在不同的路线之间进行选择。这样的理解可以帮助旅行提供者，无论是航空公司还是旅行社，使他们的报价更好地适应市场条件和客户需求，从而增加他们的收入。这可用于过滤备选方案，对其进行排序，甚至可用于实时更改某些属性（例如，更改价格）。客户选择建模领域以传统的统计方法（例如，多项式Logit（MNL）模型）为主导，这些方法相对于特征是线性的，并且与误差分布的假设（高斯，古贝尔等）紧密相关。。尽管这些模型具有简单性和可读性的双重优点，但是它们缺乏灵活性来处理替代项之间的相关性或替代项属性的非线性影响。现有建模工作的很大一部分集中在适应这些建模分布，以便它们可以匹配观察到的行为。嵌套（NL）和交叉嵌套（CNL）Logit模型就是一个很好的例子：它们为高度特定的特征交互添加了术语，以便可以捕获替代子组之间的替换模式。在这项工作中，我们提出了一种基于机器学习技术的替代建模方法。所选的机器学习方法不需要对错误的分布进行任何假设，它们还可以在特征值和目标类别之间创建非线性关系，包括共线特征，并且通常具有更大的建模灵活性。特别是，我们选择了与随机森林合作。随机森林是决策树的集合，汇总了它们的预测。决策树是节点的树，每个节点都将线性阈值应用于单个变量。随机森林中的每个决策树接收输入特征的随机子集，然后根据这些特征确定性地生成树。实际上，随机森林非常适合我们的问题，因为模型分支（分支）将客户自动划分为多个细分，同时，它捕获了备选方案属性和决策者特征之间的非线性关系。实际上，对于我们的特殊问题，有两个主要方面需要考虑：商务和休闲航空乘客在预订航班时的行为有很大不同。商务旅客倾向于选择时间表方便的替代方案，例如连接时间短和时间偏好短，而休闲旅客对价格非常敏感，换句话说，如果门票价格较低，则他们可以接受更长的连接时间。问题在于，当客户进行搜索时，“细分”并不是明确已知的，但是可以通过组合不同的因素来得出。例如，行业专家知道，商务旅客倾向于以较少的期望进行预订，并且不倾向于在星期六晚上住宿。尽管如此，这些不是“黑白”规则，因此，这需要一种能够根据数据和实际客户行为检测这些规则的模型。另一个观察到的优势是，随机森林训练起来也非常快，并且预测也非常快，这使得快速迭代成为可能。事实上，我们的数值实验表明，在使用默认参数的公共库的情况下，机器学习方法的学习过程都比MNL快。我们在一个数据集上训练和测试了我们的模型，该数据集包括从GDS（全球分销系统）日志中提取的六个欧洲市场的航班搜索和预订。选择集由航班搜索请求的结果组成。每个搜索请求包括10到250个不同的行程，其中之一已由客户预订。选择集按始发地和目的地市场重新分组。我们的主要实验包括比较MNL与机器学习。我们通过将选择的分数份额的预测与选择的实际分布进行比较，来评估不同模型的性能。应当注意，在我们的数据集中，从一个购物时段（选择情况）到下一个购物时段，替代品不是固定的，我们比较了替代品组的预测市场份额和实际市场份额，例如航班号，某家航空公司的航班或在特定时间段内起飞的航班。这些KPI尤其有用，因为它们通常是模型预期的最终结果。我们的总体发现是，在大多数原始市场和目的地市场上，机器学习模型在所有相关指标上均优于MNL。特别是，我们发现在我们考虑的六个市场中，有四个市场的航空公司预测误差降低了70％以上。我们还发现，一次在多个市场上训练模型会产生相似的性能-这可以极大地帮助将模型扩展到大量市场。

更新日期：2019-06-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>