Abstract
Traditionally model averaging has been viewed as an alternative to model selection with the ultimate goal to incorporate the uncertainty associated with the model selection process in standard errors and confidence intervals by using a weighted combination of candidate models. In recent years, a new class of model averaging estimators has emerged in the literature, suggesting to combine models such that the squared risk, or other risk functions, are minimized. We argue that, contrary to popular belief, these estimators do not necessarily address the challenges induced by model selection uncertainty, but should be regarded as attractive complements for the machine learning and forecasting literature, as well as tools to identify causal parameters. We illustrate our point by means of several targeted simulation studies.
Similar content being viewed by others
References
Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Proceeding of the second international symposiumon information theory, Budapest, pp 267–281
Bang H, Robins JM (2005) Doubly robust estimation in missing data and causal inference models. Biometrics 64(2):962–972
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Buckland ST, Burnham KP, Augustin NH (1997) Model selection: an integral part of inference. Biometrics 53:603–618
Burnham K, Anderson D (2002) Model selection and multimodel inference. A practical information-theoretic approach. Springer, New York
Chatfield C (1995) Model uncertainty, data mining and statistical inference. J R Stat Soc A 158:419–466
Cheng TCF, Ing CK, Yu SH (2015) Toward optimal model averaging in regression models with time series errors. J Econometr 189(2):321–334
Daniel RM, Cousens SN, De Stavola BL, Kenward MG, Sterne JA (2013) Methods for dealing with time-dependent confounding. Stat Med 32(9):1584–1618
Draper D (1995) Assessment and propagation of model uncertainty. J R Stat Soc B 57:45–97
Fletcher D, Dillingham PW (2011) Model-averaged confidence intervals for factorial experiments. Comput Stat Data Anal 55:3041–3048
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Gao Y, Zhang XY, Wang SY, Zou GH (2016) Model averaging based on leave-subject-out cross-validation. J Econometr 192(1):139–151
Gelman A, Su YS (2016) arm: data analysis using regression and multilevel/hierarchical models. R package version 1.9-3. https://CRAN.R-project.org/package=arm. Accessed 12 Sept 2018
Gruber S, van der Laan MJ (2012) tmle: an R package for targeted maximum likelihood estimation. J Stat Softw 51(13):1–35
Hansen BE (2007) Least squares model averaging. Econometrica 75:1175–1189
Hansen BE (2008) Least squares forecast averaging. J Econometr 146:342–350
Hansen BE, Racine J (2012) Jackknife model averaging. J Econometr 167:38–46
Hjort L, Claeskens G (2003) Frequentist model average estimators. J Am Stat Assoc 98:879–945
Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–417
Kabaila P, Welsh A, Abeysekera W (2016) Model-averaged confidence intervals. Scand J Stat 43:35–48
Leeb H, Pötscher BM (2005) Model selection and inference: facts and fiction. Econometr Theory 21:21–59
Leeb H, Pötscher BM (2008) Model Selection. Springer, New York, pp 785–821
Lendle SD, Schwab J, Petersen ML, van der Laan MJ (2017) ltmle: an R package implementing targeted minimum loss-based estimation for longitudinal data. J Stat Softw 81(1):1–21
Liang H, Zou GH, Wan ATK, Zhang XY (2011) Optimal weight choice for frequentist model average estimators. J Am Stat Assoc 106(495):1053–1066
Liu C, Kuo B (2016) Model averaging in predictive regressions. Econometr J 19(2):203–231
Liu QF, Okui R, Yoshimura A (2016) Generalized least squares model averaging. Econometr Rev 35(8–10):1692–1752
Mallows C (1973) Some comments on \(C_p\). Technometrics 15:661–675
Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan M (2014) Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. J Causal Inference 2:147–185
Polley E, LeDell E, Kennedy C, van der Laan M (2017) SuperLearner: super learner prediction. R package version 2.0-22. https://CRAN.R-project.org/package=SuperLearner. Accessed 12 Sept 2018
Pötscher B (2006) The distribution of model averaging estimators and an impossibility result regarding its estimation. Lect Notes Monogr Ser 52:113–129
Raftery A, Hoeting J, Volinsky C, Painter I, Yeung KY (2017) BMA: Bayesian model averaging. R package version 3.18.7. https://CRAN.R-project.org/package=BMA. Accessed 12 Sept 2018
Rao CR, Wu Y (2001) On model selection. Lect Notes Monogr Ser 38:1–64
Robins J, Hernan MA (2009) Estimation of the causal effects of time-varying exposures. In: Fitzmaurice G, Davidian M, Verbeke G, Molenberghs G (eds) Longitudinal data analysis. CRC Press, Boca Raton, pp 553–599
Sala-I-Martin X, Doppelhofer G, Miller RI (2004) Determinants of long-term growth: a Bayesian averaging of classical estimates (bace) approach. Am Econ Rev 94(4):813–835
Schomaker M (2012) Shrinkage averaging estimation. Stat Pap 53(4):1015–1034
Schomaker M (2017a) MAMI: model averaging (and model selection) after multiple imputation. R package version 0.9.10
Schomaker M (2017b) Model averaging and model selection after multiple imputation using the R-package MAMI. http://mami.r-forge.r-project.org. Accessed 12 Sept 2018
Schomaker M, Heumann C (2014) Model selection and model averaging after multiple imputation. Comput Stat Data Anal 71:758–770
Schomaker M, Heumann C (2018) Bootstrap inference when using multiple imputation. Stat Med 37(14):2252–2266
Schomaker M, Davies MA, Malateste K, Renner L, Sawry S, N’Gbeche S, Technau K, Eboua FT, Tanser F, Sygnate-Sy H, Phiri S, Amorissani-Folquet M, Cox V, Koueta F, Chimbete C, Lawson-Evi A, Giddy J, Amani-Bosse C, Wood R, Egger M, Leroy V (2016) Growth and mortality outcomes for different antiretroviral therapy initiation criteria in children aged 1–5 years: a causal modelling analysis from West and Southern Africa. Epidemiology 27:237–246
Sofrygin O, van der Laan MJ, Neugebauer R (2017) simcausal R package: conducting transparent and reproducible simulation studies of causal effect estimation with complex longitudinal data. J Stat Softw 81(2):1–47
Tibsharani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
Turek D, Fletcher D (2012) Model-averaged wald confidence intervals. Comput Stat Data Anal 56:2809–2815
Van der Laan M, Petersen M (2007) Statistical learning of origin-specific statistically optimal individualized treatment rules. Int J Biostat 3:3
Van der Laan M, Rose S (2011) Targeted learning. Springer, New York
Van der Laan M, Polley E, Hubbard A (2008) Super learner. Stat Appl Genet Mol Biol 6:25
Wan ATK, Zhang X, Zou GH (2010) Least squares model averaging by Mallows criterion. J Econometr 156:277–283
Wang H, Zhou S (2012) Interval estimation by frequentist model averaging. Commun Stat Theory Methods 42(23):4342–4356
Wang H, Zhang X, Zou G (2009) Frequentist model averaging: a review. J Syst Sci Complex 22:732–748
Wood SN (2006) Generalized additive models: an introduction with R. Chapman and Hall/CRC, Boca Raton
Yan J (2007) Enjoy the joy of copulas: with package copula. J Stat Softw 21:1–21
Zhang X, Liu CA (2017) Inference after model averaging in linear regression models. IEAS working paper: academic research 17-A005. Institute of Economics, Academia Sinica, Taipei, Taiwan. https://ideas.repec.org/p/sin/wpaper/17-a005.html. Accessed 12 Sept 2018
Zhang XY, Zou GH, Liang H (2014) Model averaging and weight choice in linear mixed-effects models. Biometrika 101(1):205–218
Zhang XY, Zou GH, Carroll RJ (2015) Model averaging based on Kullback-Leibler distance. Stat Sin 25(4):1583–1598
Zhang XY, Ullah A, Zhao SW (2016a) On the dominance of mallows model averaging estimator over ordinary least squares estimator. Econ Lett 142:69–73
Zhang XY, Yu DL, Zou GH, Liang H (2016b) Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. J Am Stat Assoc 111(516):1775–1790
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Software
We have implemented Mallow’s model averaging, Jackknife model averaging, and Lasso averaging in the R-package MAMI (Schomaker 2017a), available at http://mami.r-forge.r-project.org/. In addition to this, we have implemented several wrappers that make optimal model averaging easily useable for super learning (Polley et al. 2017), and in conjunction with causal inference packages such as tmle (Gruber and van der Laan 2012) and ltmle (Lendle et al. 2017). Available wrappers are explained by calling listSLWrappers(), and examples are given in the documentation (Schomaker 2017b).
Appendix B: Notation and background on the sequential g-formula
Consider a sample of size n of which measurements are available both at baseline (\(t=0\)) and during a series of follow-up times \(t=1,\ldots ,T\). At each point in time we measure the outcome \(Y_t\), the intervention \(A_t\), time-dependent covariates \(\mathbf {L}_t=\{L^1_t,\ldots ,L^q_t\}\), and a censoring indicator \(C_t\). \(\mathbf {{L}}_{t}\) may include baseline variables \(\mathbf {V}=\{L^1_0,\ldots ,L^{q_V}_0\}\) and can potentially contain variables which refer to the outcome variable before time t, for instance \({Y}_{t-1}\). The treatment and covariate history of an individual i up to and including time t is represented as \(\bar{A}_{t,i}=(A_{0,i},\ldots ,A_{t,i})\) and \(\bar{L}^s_{t,i}=(L^s_{0,i},\ldots ,L^s_{t,i})\) respectively. \(C_t\) equals 1 if a subject gets censored in the interval \((t-1,t]\), and 0 otherwise. Therefore, \(\bar{C}_t=0\) is the event that an individual remains uncensored until time t.
The counterfactual outcome \(Y_{t,i}^{\bar{a}_{t}}\) refers to the hypothetical outcome that would have been observed at time t if subject i had received, possibly contrary to the fact, the treatment history \(\bar{A}_{t,i}=\bar{a}_t\). Similarly, \(\mathbf {L}_{t,i}^{\bar{a}_{t}}\) are the counterfactual covariates related to the intervention \(\bar{A}_{t,i}=\bar{a}_t\). The above notation refers to static treatment rules; a treatment rule may, however, depend on covariates, and in this case it is called dynamic. A dynamic rule \({d}_t({\bar{\mathbf {L}}}_{t})\) assigns treatment \({A}_{t,i} \in \{0,1\}\) as a function of the covariate history \({\bar{\mathbf {L}}}_{t,i}\). The vector of decisions \(d_t, t=0,\ldots ,T\), is denoted as \(\bar{d}_T=\bar{d}\). The notation \(\bar{A}_t = \bar{d}\) refers to the treatment history according to the rule \(\bar{d}\). The counterfactual outcome related to a dynamic rule \(\bar{d}\) is \(Y_{t,i}^{\bar{d}}\), and the counterfactual covariates are \(\mathbf {L}_{t,i}^{\bar{d}}\).
In Sect. 3.3 we consider the expected value of Y at time 6, under no censoring, for a given treatment rule \(\bar{d}\) to be the main quantity of interest, that is \(\psi =\mathbb {E}(Y_6^{\bar{d}})\).
The sequential g-formula can estimate this target quantity by sequentially marginalizing the distribution with respect to \(\mathbf {L}\) given the intervention rule of interest. It holds that
see for example Bang and Robins (2005). Equation (14) is valid under several assumptions: sequential conditional exchangeability, consistency, positivity and the time ordering \(\mathbf {L}_t \rightarrow A_t \rightarrow C_t \rightarrow Y_t\). These assumptions essentially mean that all confounders need to be measured, that the intervention is well-defined and that individuals have a positive probability of continuing to receive treatment according to the assigned treatment rule, given that they have done so thus far and irrespective of the covariate history; see Daniel et al. (2013) and Robins and Hernan (2009) for more details and interpretations. Note that the two interventions defined in Sect. 3 also assign \(C_t=0\) meaning that we are interested in the effect estimate under no censoring.
To estimate \(\psi \) one needs to the the following for \(t=T,\ldots ,0\): (i) use an appropriate model to estimate \(\mathbb {E}(Y_T|{\bar{A}}_{T-1}={\bar{d}}_{T-1}, {\bar{\mathbf{L}}}_T)\). The model is fit on all subjects that are uncensored (until \(T-1\)). Note that the outcome refers to the measured outcome for \(t=T\) and to the prediction (of the conditional outcome) from step (ii) if \(t<T\). Then, (ii) plug in \(\bar{A}_t=\bar{d}_{t}\) to predict \(Y_t\) at time t; (iii) For \(t=0\) the estimate \(\hat{\psi }\) is obtained by calculating the arithmetic mean of the predicted outcome from the second step.
Appendix C: Data generating process in the simulation study
Both baseline data (\(t=0\)) and follow-up data (\(t=1,\ldots ,12\)) were created using structural equations using the R-package simcausal (Sofrygin et al. 2017). The below listed distributions, listed in temporal order, describe the data-generating process motivated by the analysis from Schomaker et al. (2016). Baseline data refers to region, sex, age, CD4 count, CD4%, WAZ and HAZ respectively (\(V^1\), \(V^2\), \(V^3\), \(L^1_0\), \(L^2_0\), \(L^3_0\), \(Y_0\)), see Schomaker et al. (2016) for a full explanation of variables and motivational question. Follow-up data refers to CD4 count, CD4%, WAZ and HAZ (\(L^1_t\), \(L^2_t\), \(L^3_t\), \(Y_t\)), as well as a treatment (\(A_t\)) and censoring (\(C_t\)) indicator. In addition to Bernoulli (B), uniform (U) and normal (N) distributions, we also use truncated normal distributions which are denoted by \(N_{[a,b]}\) where a and b are the truncation levels. Values which are smaller a are replaced by a random draw from a \(U(a_1,a_2)\) distribution and values greater than b are drawn from a \(U(b_1,b_2)\) distribution. Values for \((a_1, a_2, b_1, b_2)\) are (0, 50, 5000, 10000) for \(L^1\), (0.03,0.09,0.7,0.8) for \(L^2\), and \((-10,3,3,10)\) for both \(L^3\) and Y. The notation \(\bar{\mathcal {D}}\) means “conditional on the data that has already been measured (generated) according the the time ordering”.
For \(t=0\):
For \(t>0\):
Rights and permissions
About this article
Cite this article
Schomaker, M., Heumann, C. When and when not to use optimal model averaging. Stat Papers 61, 2221–2240 (2020). https://doi.org/10.1007/s00362-018-1048-3
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00362-018-1048-3