Elsevier

Journal of Choice Modelling

Volume 31, June 2019, Pages 181-197
Journal of Choice Modelling

Information theoretic-based sampling of observations

https://doi.org/10.1016/j.jocm.2018.02.003Get rights and content

Abstract

Due to the surge in the amount of data that are being collected, analysts are increasingly faced with very large data sets. Estimation of sophisticated discrete choice models (such as Mixed Logit models) based on these typically large data sets can be computationally burdensome, or even infeasible. Hitherto, analysts tried to overcome these computational burdens by reverting to less computationally demanding choice models or by taking advantage of the increase in computational resources. In this paper we take a different approach: we develop a new method called Sampling of Observations (SoO) which scales down the size of the choice data set, prior to the estimation. More specifically, based on information-theoretic principles this method extracts a subset of observations from the data which is much smaller in volume than the original data set, yet produces statistically nearly identical results. We show that this method can be used to estimate sophisticated discrete choice models based on data sets that were originally too large to conduct sophisticated choice analysis.

Introduction

In numerous fields, recent technological advances have led to a surge in the amount of data that are being collected. These emerging data sources are changing the data landscape as well as the methods by which data are analysed. For instance, in the field of transport mobile phone, GPS, WIFI, and public transport smartcard data (Iqbal et al., 2014; Jánošíková et al., 2014; Prato et al., 2014; Farooq et al., 2015) are nowadays complementing or fully replacing traditional travel survey methods (Rieser-Schüssler, 2012), and data-driven methods (such as e.g. machine learning), as opposed to theory-driven methods, are increasingly becoming part of the standard toolbox of transport analysts (Wong et al., 2017). Moreover, it is widely believed that the amount of data that are being collected will continue to increase rapidly in the decades to come (Witlox, 2015).

Although these emerging data sources are widely believed to assist in understanding and solving numerous societal problems, they pose all sorts of new challenges to analysts. For choice modellers one major challenge relates to the computational burden. In particular, the size of these new data renders estimation of sophisticated state-of-the-art discrete choice models, such as Mixed Logit models (Revelt and Train, 1998) computationally burdensome, or even technically infeasible (Vlahogianni et al., 2015). This, in turn, is limiting the use of these emerging data sources in numerous fields where choice models are used. Moreover, even if model estimation is technically feasible on the large data set, many different model specifications are often tested. Therefore, long estimation times (which for current data sets may already easily take several days) quickly become prohibitive.

To deal with increasingly large choice data sets two types of approaches are commonly taken by analysts. The first approach is to revert to less computationally demanding models, such as Multinomial Logit (McFadden, 1974) and Nested Logit (Daly and Zachery, 1978) models. However, despite being very effective in reducing the computational efforts, this approach severely limits the analyst's ability to adequately model complex types of choice behaviour. As such, this approach is far from desirable. The second commonly taken approach is to modify estimation code in order to increase computational power, e.g. by taking advantage of parallel computing or cluster computing facilities. Recent technological advances have made it easier to employ cloud computing facilities and high performance computation clusters. However, many analysts do not have access to such facilities and existing widely available estimation software is typically not ready for taking full advantage of these technologies.

To the best of the authors’ knowledge, no efforts have been undertaken to scale down large choice data sets,1 such that initially large data sets can be used with standard discrete choice estimation software packages, such as Biogeme, Nlogit, and Alogit. While removing valid observations is considered a sin by many analysts working with discrete choice models, in fields like machine-learning down-scaling of data sets is more common practice (e.g. Arnaiz-González et al., 2016; Loyola et al., 2016). As we will argue in this paper, using a carefully sampled subset of choice observations can give nearly identical estimation results as compared to using the complete dataset. Hence, we believe that this approach is worth exploring from a practical point of view.

This study proposes a new information theoretic-based method that lowers the computational burden to estimate sophisticated discrete choice models based on large data sets. The method – which we call Sampling of Observations (SoO) – is inspired by, and closely related to, efficient experimental design. It reduces the size of the data by combining practices from the field of experimental design in Stated Choice (SC) studies (see e.g. Rose and Bliemer, 2009), with established notions from information theory (Shannon and Weaver, 1949). SoO constructs a subset of the data that consists of a manageable number of observations that are jointly highly informative on the behaviour that is being studied. It does so by sampling observations from the full data set in such a way that the D-error statistic of the subset is minimised, which means that Fisher information is maximised. The D-error is evaluated using what we call the sampling model. This model is a ‘simplified’ version of the sophisticated choice model that the analyst ultimately wishes to estimate. By using a simple model in the sampling stage, SoO is computationally cheap and fast to conduct.

The remaining part of this paper is organised as follows. Section 2 presents the methodology on information theoretic-based sampling of observations. Section 3 explores the efficacy of the method using Monte Carlo analyses. Finally, Section 4 closes with conclusions and a discussion.

Section snippets

Preliminary: the effect of sample size

For asymptotically consistent estimators the standard errors associated with the estimates decrease with increasing sample size. Specifically, in case the observations are randomly drawn from the target population, standard errors decrease at a rate of 1/N (Fisher, 1925), where N denotes the sample size, see Fig. 1. This implies that the slope flattens out fast: at a rate of 1/N1.5. This reflects the fact that relatively speaking less and less new information is revealed on the data generating

Monte Carlo analyses

This section puts the proposed information-theoretic based sampling method to a test. To be able to draw conclusions regarding the merits of the method we conduct a series of Monte Carlo experiments. Ultimately, we are interested in the reduction of the estimation time that can be achieved using the method, while taking into account the statistical cost of the method (in terms of the precision at which the parameter estimates are recovered). Furthermore, although there is a priori no

Conclusions and discussion

This paper presented a new method to lower the computational burden of estimating sophisticated discrete choice models based on large data sets. This method – which we call Sampling of Observations (SoO) – scales down the size of the choice data set based on information-theoretic principles. SoO extracts a subset of observations from the full data set which is much smaller in volume than the original data set, yet produces nearly identical result. The method is inspired by, and closely related

Statement of contribution

Due to the surge in the amount of data that are being collected, analysts are increasingly faced with very large data sets. Estimation of sophisticated discrete choice models (such as Mixed Logit models) based on these large data sets can be computationally burdensome, or even technically infeasible. This research contributes to the state-of-the-art on challenges in the field of choice modelling associated with utilizing the full potential of emerging data sources. It develops a new sampling

Acknowledgements

The authors would like to thank Prof. Carlo G. Prato, dr. Thomas K. Rasmussen and Prof. Otto A. Nielsen for sharing their data with us.

References (36)

  • P.H.L. Bovy et al.

    Stochastic route choice set generation: behavioral and probabilistic foundations

    Transportmetrica

    (2007)
  • R.D. Cook et al.

    A comparison of algorithms for constructing exact d-optimal designs

    Technometrics

    (1980)
  • A. Daly et al.

    Improved multiple choice models

  • E.W. de Bekker-Grob et al.

    Sample size requirements for discrete-choice experiments in healthcare: a practical guide

    The Patient - Patient-Cent. Outcomes Res.

    (2015)
  • B. Farooq et al.

    Ubiquitous Monitoring of Pedestrian Dynamics: exploring Wireless ad hoc Network of Multi-sensor Technologies

    (2015)
  • V.V. Fedorov

    Theory of Optimal Experiments

    (1972)
  • R.A. Fisher

    Statistical Methods for Research Workers

    (1925)
  • J. Huber et al.

    The importance of utility balance in efficient choice designs

    J. Market. Res.

    (1996)
  • Cited by (5)

    • Resampling estimation of discrete choice models

      2024, Journal of Choice Modelling
    • Modeling train route decisions during track works

      2022, Journal of Rail Transport Planning and Management
      Citation Excerpt :

      In addition, especially in the case of large datasets (as in the current case with more than 150,000 observations and 39 explanatory variables), the estimation of MNL models is computationally cumbersome, and even more so with more advanced specifications such as the MIXL model (e.g. Bhat, 2001). In order to be able to improve the discrete choice model specification and use a sufficient number of random draws to guarantee stability in parameter estimates (Walker and Ben-Akiva, 2002), we simplify the models and present a data compression method similar as in van Cranenburgh and Bliemer (2019) to effectively reduce the size of the estimation dataset. By moving away from the main goal of maximizing out-of-sample PA, we create behaviorally richer choice models, which are later used to calculate the marginal probability effects (MPE) and elasticities (E).

    • mixl: An open-source R package for estimating complex choice models on large datasets

      2021, Journal of Choice Modelling
      Citation Excerpt :

      Additionally, the size of the datasets that researchers are working with are also becoming larger. Hence, it is common that complex model formulations can take many hours or even days or weeks to estimate when using simulation methods. van Cranenburgh and Bliemer (2019) developed the approach Sampling of Observations (SoO) to scale down choice datasets while producing similar results, noting that large models can be otherwise computationally infeasible.

    • New software tools for creating stated choice experimental designs efficient for regret minimisation and utility maximisation decision rules

      2019, Journal of Choice Modelling
      Citation Excerpt :

      After that, it will start searching the solution space using a Modified Fedorov algorithm (Fedorov, 1972), and will indicate this in the output message box. The Fedorov algorithm starts by taking a randomly drawn design (this is termed ‘an iteration’), and continues by replacing choice tasks in the design with those from the candidate set and evaluating its impact (see e.g. Van Cranenburgh and Bliemer, 2018). After having exhausted the first series of replacements, a first efficient design is found.

    View full text