Information theoretic-based sampling of observations
Introduction
In numerous fields, recent technological advances have led to a surge in the amount of data that are being collected. These emerging data sources are changing the data landscape as well as the methods by which data are analysed. For instance, in the field of transport mobile phone, GPS, WIFI, and public transport smartcard data (Iqbal et al., 2014; Jánošíková et al., 2014; Prato et al., 2014; Farooq et al., 2015) are nowadays complementing or fully replacing traditional travel survey methods (Rieser-Schüssler, 2012), and data-driven methods (such as e.g. machine learning), as opposed to theory-driven methods, are increasingly becoming part of the standard toolbox of transport analysts (Wong et al., 2017). Moreover, it is widely believed that the amount of data that are being collected will continue to increase rapidly in the decades to come (Witlox, 2015).
Although these emerging data sources are widely believed to assist in understanding and solving numerous societal problems, they pose all sorts of new challenges to analysts. For choice modellers one major challenge relates to the computational burden. In particular, the size of these new data renders estimation of sophisticated state-of-the-art discrete choice models, such as Mixed Logit models (Revelt and Train, 1998) computationally burdensome, or even technically infeasible (Vlahogianni et al., 2015). This, in turn, is limiting the use of these emerging data sources in numerous fields where choice models are used. Moreover, even if model estimation is technically feasible on the large data set, many different model specifications are often tested. Therefore, long estimation times (which for current data sets may already easily take several days) quickly become prohibitive.
To deal with increasingly large choice data sets two types of approaches are commonly taken by analysts. The first approach is to revert to less computationally demanding models, such as Multinomial Logit (McFadden, 1974) and Nested Logit (Daly and Zachery, 1978) models. However, despite being very effective in reducing the computational efforts, this approach severely limits the analyst's ability to adequately model complex types of choice behaviour. As such, this approach is far from desirable. The second commonly taken approach is to modify estimation code in order to increase computational power, e.g. by taking advantage of parallel computing or cluster computing facilities. Recent technological advances have made it easier to employ cloud computing facilities and high performance computation clusters. However, many analysts do not have access to such facilities and existing widely available estimation software is typically not ready for taking full advantage of these technologies.
To the best of the authors’ knowledge, no efforts have been undertaken to scale down large choice data sets,1 such that initially large data sets can be used with standard discrete choice estimation software packages, such as Biogeme, Nlogit, and Alogit. While removing valid observations is considered a sin by many analysts working with discrete choice models, in fields like machine-learning down-scaling of data sets is more common practice (e.g. Arnaiz-González et al., 2016; Loyola et al., 2016). As we will argue in this paper, using a carefully sampled subset of choice observations can give nearly identical estimation results as compared to using the complete dataset. Hence, we believe that this approach is worth exploring from a practical point of view.
This study proposes a new information theoretic-based method that lowers the computational burden to estimate sophisticated discrete choice models based on large data sets. The method – which we call Sampling of Observations (SoO) – is inspired by, and closely related to, efficient experimental design. It reduces the size of the data by combining practices from the field of experimental design in Stated Choice (SC) studies (see e.g. Rose and Bliemer, 2009), with established notions from information theory (Shannon and Weaver, 1949). SoO constructs a subset of the data that consists of a manageable number of observations that are jointly highly informative on the behaviour that is being studied. It does so by sampling observations from the full data set in such a way that the D-error statistic of the subset is minimised, which means that Fisher information is maximised. The D-error is evaluated using what we call the sampling model. This model is a ‘simplified’ version of the sophisticated choice model that the analyst ultimately wishes to estimate. By using a simple model in the sampling stage, SoO is computationally cheap and fast to conduct.
The remaining part of this paper is organised as follows. Section 2 presents the methodology on information theoretic-based sampling of observations. Section 3 explores the efficacy of the method using Monte Carlo analyses. Finally, Section 4 closes with conclusions and a discussion.
Section snippets
Preliminary: the effect of sample size
For asymptotically consistent estimators the standard errors associated with the estimates decrease with increasing sample size. Specifically, in case the observations are randomly drawn from the target population, standard errors decrease at a rate of (Fisher, 1925), where N denotes the sample size, see Fig. 1. This implies that the slope flattens out fast: at a rate of . This reflects the fact that relatively speaking less and less new information is revealed on the data generating
Monte Carlo analyses
This section puts the proposed information-theoretic based sampling method to a test. To be able to draw conclusions regarding the merits of the method we conduct a series of Monte Carlo experiments. Ultimately, we are interested in the reduction of the estimation time that can be achieved using the method, while taking into account the statistical cost of the method (in terms of the precision at which the parameter estimates are recovered). Furthermore, although there is a priori no
Conclusions and discussion
This paper presented a new method to lower the computational burden of estimating sophisticated discrete choice models based on large data sets. This method – which we call Sampling of Observations (SoO) – scales down the size of the choice data set based on information-theoretic principles. SoO extracts a subset of observations from the full data set which is much smaller in volume than the original data set, yet produces nearly identical result. The method is inspired by, and closely related
Statement of contribution
Due to the surge in the amount of data that are being collected, analysts are increasingly faced with very large data sets. Estimation of sophisticated discrete choice models (such as Mixed Logit models) based on these large data sets can be computationally burdensome, or even technically infeasible. This research contributes to the state-of-the-art on challenges in the field of choice modelling associated with utilizing the full potential of emerging data sources. It develops a new sampling
Acknowledgements
The authors would like to thank Prof. Carlo G. Prato, dr. Thomas K. Rasmussen and Prof. Otto A. Nielsen for sharing their data with us.
References (36)
- et al.
Instance selection of linear complexity for big data
Knowl. Base Syst.
(2016) Efficient stated choice experiments for estimating nested logit models
Transp. Res. Part B Methodol.
(2009)- et al.
On determining priors for the generation of efficient stated choice experimental designs
J. Choice Model.
(2016) - et al.
Detecting dominancy in stated choice data and accounting for dominancy-based scale differences in logit models
Transp. Res. Part B Methodol.
(2017) - et al.
Designs with a priori information for nonmarket valuation with choice experiments: a Monte Carlo study
J. Environ. Econ. Manag.
(2007) - et al.
On the use of a modified Latin Hypercube sampling (MLHS) method in the estimation of a mixed logit model for vehicle choice
Transp. Res. Part B Methodol.
(2006) - et al.
Development of origin–destination matrices using mobile phone call data
Transport. Res. C Emerg. Technol.
(2014) A stochastic transit assignment model considering differences in passengers utility functions
Transp. Res. Part B Methodol.
(2000)- et al.
On the robustness of efficient experimental designs towards the underlying decision rule
Transport. Res. Pol. Prac
(2018) - et al.
Big data in transportation and traffic engineering
Transport. Res. C Emerg. Technol.
(2015)
Stochastic route choice set generation: behavioral and probabilistic foundations
Transportmetrica
A comparison of algorithms for constructing exact d-optimal designs
Technometrics
Improved multiple choice models
Sample size requirements for discrete-choice experiments in healthcare: a practical guide
The Patient - Patient-Cent. Outcomes Res.
Ubiquitous Monitoring of Pedestrian Dynamics: exploring Wireless ad hoc Network of Multi-sensor Technologies
Theory of Optimal Experiments
Statistical Methods for Research Workers
The importance of utility balance in efficient choice designs
J. Market. Res.
Cited by (5)
Revealing and reducing bias when modelling choice behaviour on imbalanced panel datasets
2024, Journal of Choice ModellingResampling estimation of discrete choice models
2024, Journal of Choice ModellingModeling train route decisions during track works
2022, Journal of Rail Transport Planning and ManagementCitation Excerpt :In addition, especially in the case of large datasets (as in the current case with more than 150,000 observations and 39 explanatory variables), the estimation of MNL models is computationally cumbersome, and even more so with more advanced specifications such as the MIXL model (e.g. Bhat, 2001). In order to be able to improve the discrete choice model specification and use a sufficient number of random draws to guarantee stability in parameter estimates (Walker and Ben-Akiva, 2002), we simplify the models and present a data compression method similar as in van Cranenburgh and Bliemer (2019) to effectively reduce the size of the estimation dataset. By moving away from the main goal of maximizing out-of-sample PA, we create behaviorally richer choice models, which are later used to calculate the marginal probability effects (MPE) and elasticities (E).
mixl: An open-source R package for estimating complex choice models on large datasets
2021, Journal of Choice ModellingCitation Excerpt :Additionally, the size of the datasets that researchers are working with are also becoming larger. Hence, it is common that complex model formulations can take many hours or even days or weeks to estimate when using simulation methods. van Cranenburgh and Bliemer (2019) developed the approach Sampling of Observations (SoO) to scale down choice datasets while producing similar results, noting that large models can be otherwise computationally infeasible.
New software tools for creating stated choice experimental designs efficient for regret minimisation and utility maximisation decision rules
2019, Journal of Choice ModellingCitation Excerpt :After that, it will start searching the solution space using a Modified Fedorov algorithm (Fedorov, 1972), and will indicate this in the output message box. The Fedorov algorithm starts by taking a randomly drawn design (this is termed ‘an iteration’), and continues by replacing choice tasks in the design with those from the candidate set and evaluating its impact (see e.g. Van Cranenburgh and Bliemer, 2018). After having exhausted the first series of replacements, a first efficient design is found.