Estimation of regional transition probabilities for spatial dynamic microsimulations from survey data lacking in regional detail

doi:10.1016/j.csda.2020.107048

Computational Statistics & Data Analysis

Volume 154, February 2021, 107048

https://doi.org/10.1016/j.csda.2020.107048 Get rights and content

Abstract

Spatial dynamic microsimulations allow for the multivariate analysis of complex systems with geographic segmentation. A synthetic replica of the system is stochastically projected into future periods using micro-level transition probabilities. These should accurately represent the dynamics of the system to allow for reliable simulation outcomes. In practice, transition probabilities are unknown and must be estimated from suitable survey data. This can be challenging when the dynamics vary locally. Survey data often lacks in regional detail due to confidentiality restrictions and limited sampling resources. In that case, transition probability estimates may misrepresent regional dynamics due to insufficient local observations and coverage problems. The simulation process subsequently fails to provide an authentic evolution of the system. A constrained maximum likelihood approach for probability alignment to solve these issues is proposed. It accounts for regional heterogeneity in transition dynamics through the consideration of external benchmarks from administrative records. It is proven that the method is consistent. A parametric bootstrap for uncertainty estimation is presented. Simulation experiments are conducted to compare the approach with an existing method for probability alignment. Furthermore, an empirical application to labor force estimation based on the German Microcensus is provided.

Introduction

Microsimulations are powerful tools for the multivariate analysis of complex systems, such as economic markets or medical care infrastructures. They differ from the more established macrosimulations in terms of the objects that are considered in the simulation process. While in macrosimulations the behavior of aggregated system-intrinsic entities is modeled, microsimulations target the smallest entities of the system (units) directly. This allows for the investigation of multidimensional interactions and nonlinear dependencies within the system that cannot be studied by macrosimulations. Examples for microsimulation models can be found in Klevmarken, 2010, Lawson, 2011 and O’Donoghue et al. (2011), as well as Markham et al. (2017). Microsimulations are often conducted according to a basic procedure. First, a base population as a synthetic replica of the system of interest is constructed. In practice, this may be either artificially generated data or real-world observations from administrative records and surveys (Li and O’Donoghue, 2014). Next, multiple parameters that characterize the system in its initial state are altered in scenarios. The alterations are designed to target important properties of the system in light of the research objectives. The effects of the alterations are projected into future periods, such that every scenario initializes an individual branch in the system’s evolution. After a given number of periods (simulation horizon), the branches are compared giving insights on important dynamics and dependencies within the system (Burgard et al., 2019c).

There are different types of microsimulations. They mainly differ with respect to the manner in which the mentioned alterations are projected. An important distinction is between static and dynamic microsimulations (Li and O’Donoghue, 2013). Static microsimulations are characterized by the constancy of unit characteristics over time. When constructing the synthetic replica, every unit is provided with a set of characteristics that determines its behavior and interaction with other units. In static microsimulations, these characteristics do not change over the simulation horizon. Only specific simulation inputs are altered, depending on the research objectives. Examples of static microsimulation models can be found in Peichl et al. (2010) as well as Sutherland and Figari (2013). Dynamic microsimulations, on the other hand, are characterized by stochastic changes of unit characteristics (state transitions) over time. Since a unit’s behavior within the synthetic replica is driven by its respective characteristics, the interactions between units are also subject to temporal variation. Examples of dynamic microsimulation models can be found in O’Donoghue et al. (2009) and Fialka et al. (2011). If the dynamic microsimulation is time-discrete, state transitions can only appear periodically at distinct points in time (Schmid et al., 2018). If the simulation is time-continuous, they can appear at any given time and thus are modeled via survival functions (Willekens, 2009).

Hereafter, we focus on dynamic microsimulations with discrete time. In particular, we look at microsimulations in socio-economic research, where polytomous variables are of interest. This conceptual delimitation differentiates the topic from other fields where corresponding simulations are also used, such as particle physics (Elvira, 2017) or cancer research (Jayasekera et al., 2018). The simulation setup requires the definition of transition probability sets for every unit of the synthetic replica. These sets provide the conditional likelihood of a state transition for some unit characteristic given its current state and other characteristics. The probabilities constitute stochastic processes within the synthetic replica over the simulation horizon, which need to represent real-world dynamics as accurately as possible to obtain valid simulation results. In practice, transition probabilities are usually unknown and must be estimated from survey data. This is often done via statistical models, such as the multinomial logit model (McCullagh and Nelder, 1989, Greene, 2002, Forcina, 2017).

Microsimulations that account for regional data structures are often referred to as small area or spatial microsimulations (Rahman et al., 2010, Rahman and Harding, 2016, Tanton et al., 2018). Transition probability estimation can be challenging in these settings, as there may be heterogeneous transition dynamics across regions. The estimation method must explicitly account for corresponding differences to accurately resemble the system’s evolution. However, in practice, we often encounter the problem that the survey data used for transition probability estimation lacks in regional detail. Due to confidentiality restrictions, regional identifiers that would allow for a localization of the sample elements may be censored. Thus, regional heterogeneity cannot be observed as spatial aggregates are indistinguishable. Further, even if regional identifiers are available, the majority of survey samples often contain only a few observations per region due to limited resources. In that case, observed regional transition frequencies may be inaccurate or even biased as a result of coverage problems. Ignoring these issues may cause only small deviations in the initial phase of the simulation. But due to the complex interactions between units, the inaccuracies accumulate and self-reinforce over the simulation horizon. Hence, local transition dynamics are misrepresented over time and the simulation fails to provide an authentic evolution regarding the real system. The simulation outcomes are not reliable anymore (Chin and Harding, 2006, Tanton, 2014). Thus, if the survey data lacks in regional detail, methodological adjustments are required.

In this paper, we discuss so-called alignment methods (Bækgaard, 2002, Kelly and Percival, 2009, Li and O’Donoghue, 2014, Stephensen, 2016) to address these issues. Alignment is commonly used in microsimulations to calibrate transition dynamics to external benchmarks to prevent unrealistic projections over time. However, we show that these methods can also be used to recover local heterogeneity in transition dynamics when the primary database lacks in regional detail. For this, we consider a situation in which external benchmarks on regional transition dynamics are available (e.g. from census data). The general idea is to incorporate this regional information in the multinomial logit model such that resulting micro-level probability estimates reproduce the benchmarks when aggregated. This allows us to recover the unobserved regional heterogeneity in transition dynamics on the micro-level and calibrate the data generating process such that the simulated evolution is genuine.

Two alignment methods are studied for this purpose. The first method is called logit scaling and was originally proposed by Stephensen (2016). It is an ex-post approach based on iterative proportional fitting (Bishop et al., 1970). After the initial model parameter estimation has been performed, the transition probability estimates under the model are adjusted sequentially until they reproduce the external benchmarks. Stephensen (2016) showed that the method minimizes the Kullback–Leibler divergence between original and adjusted probability estimates. The second method is new and draws from constrained maximum likelihood theory (e.g. Dong and Wets, 2000, Chatterjee et al., 2016). The regional benchmarks are used to modify the fitting process in the multinomial logit model directly. This is done by imposing a set of box constraints on transition probabilities resulting from the parameter estimates. The underlying optimization is solved via sequential quadratic programming (Kraft, 1994). We present a parametric bootstrap (Reynolds and Templin, 2004, Zoubir and Iskander, 2004) to estimate the variance of model parameter estimates and the mean squared error (MSE) of the transition probability estimates. Further, we proof that this alignment method is consistent in model parameter estimation.

Both methods are described and discussed in theory. Next, they are tested in simulation experiments to evaluate their performances in a controlled environment. And finally, both methods are applied to labor forceprobability estimation based on the German Microcensus 2012 (Statistisches Bundesamt, 2017). We find that the inclusion of aggregated regional benchmarks allows for the recovery of local micro-level transition dynamics despite a lack in regional detail. The remainder of the paper is organized as follows. In Section 2, the basic methodology is described. This includes the presentation of a suitable statistical framework, the multinomial logit model, as well as logit scaling as a standard method for alignment. Section 3 introduces constrained maximum likelihood as an alternative alignment approach. Section 4 contains the simulation experiments, while Section 5 encloses the application. Section 6 closes with some conclusive remarks. Note that a preprint of this paper has been published as working paper by Burgard et al. (2019a).

Section snippets

Statistical framework

This section introduces a statistical framework to describe transition dynamics in microsimulations. Based on these descriptions, we can present transition probability estimation methods in later sections. For illustrative purposes, we assume that the system of interest is a population of individuals and the objective is to study its labor market in regional detail. In what follows, three representations of the population are considered to conduct the simulation study.

First, let $U$ denote the

Method

Hereafter, we introduce a new method for aligning transition probability estimates to regional benchmarks in the sense of Section 2.3. It can be viewed as a special case of constrained maximum likelihood estimation (Dong and Wets, 2000, Chatterjee et al., 2016). Recall the maximum likelihood problem in (9). We modify it by adding a set of regional inequality constraints with respect to the benchmarks. The negative log-likelihood is minimized while the set of feasible solutions is limited to a

Simulation experiments

Hereafter, we present several simulation experiments in order to demonstrate the effectiveness of the proposed methods within a controlled environment. We focus on three performance aspects: (i) model parameter estimation, (ii) predictive inference, as well as (iii) uncertainty estimation.

Setup

In what follows, we apply the three methods (Mod, LS, CML) in both the binary and the polytomous case to regional transition probability estimation on real data. For this, we consider observations obtained from all individuals aged 15–85 from the German Microcensus 2012. The overall setup is similar to the (model-based) simulation study from the last section. However, note that the data is used directly rather than as a basis for an artificial population as in Section 4. The advantage of this

Conclusion

The estimation of regional transition probabilities from survey data lacking in regional detail is a major challenge when conducting dynamic spatial microsimulations. Missing regional observations can either lead to coverage problems in local samples or prevent a spatial localization of the sample observations. It could be shown that common estimation methods obtain inefficient or even biased results in these cases. We discussed two methods that are able to account for regional heterogeneity by

Acknowledgments

We would like to thank two anonymous reviewers for their very helpful and constructive comments. Under this guidance, the overall quality and legibility of the paper could be improved significantly.

Funding

This work was supported by the research project REMIKIS — Regionale Mikrosimulationen und Indikatorsysteme, funded by the Nikolaus Koch Foundation; the research unit MikroSim — Sektorenübergreifendes kleinräumiges Mikrosimulationsmodell (FOR 2559), funded by the German Research Foundation; and the

References (49)

ForcinaA.
A Fisher-scoring algorithm for fitting latent class models with individual covariates
Econom. Stat.
(2017)
KlevmarkenA.
Dynamic microsimulation for policy analysis: Problems and solutions
LiJ. et al.
Evaluating binary alignment methods in microsimulation models
J. Artif. Soc. Soc. Simul.
(2014)
MarkhamF. et al.
Improving spatial microsimulation estimates of health outcomes by including geographic indicators of health behaviour: The example of problem gambling
Health Place
(2017)
SchmidM. et al.
Discrimination measures for discrete time-to-event predictions
Econom. Stat.
(2018)
AdjeiL.A. et al.
An application of bootstrapping in logistic regression model
Open Access Libr. J.
(2016)
BækgaardH.
Micro-macro linkage and the alignment of transition processesTechnical Report 25
(2002)
BarasaK.S. et al.
Incorporating survey weights into binary and multinomial logistic regression models
Sci. J. Appl. Math. Stat.
(2015)
BishopC.M.
Pattern Recognition and Machine Learning
(2006)
BishopY. et al.
Discrete Multivariate Analysis: Theory and Practice
(1970)

BöhningD.

Multinomial logistic regression algorithm

Ann. Inst. Statist. Math.

(1992)

BurgardJ.P. et al.

Regularized area-level modelling for robust small area estimation in the presence of unknown covariate measurement errors

Res. Papers Econ.

(2019)

BurgardJ.P. et al.

Conducting a dynamic microsimulation for care research: Data generation, transition probabilities and sensitivity analysis

BurgardJ. et al.