Skip to main content
Log in

When and when not to use optimal model averaging

  • Regular Article
  • Published:
Statistical Papers Aims and scope Submit manuscript

Abstract

Traditionally model averaging has been viewed as an alternative to model selection with the ultimate goal to incorporate the uncertainty associated with the model selection process in standard errors and confidence intervals by using a weighted combination of candidate models. In recent years, a new class of model averaging estimators has emerged in the literature, suggesting to combine models such that the squared risk, or other risk functions, are minimized. We argue that, contrary to popular belief, these estimators do not necessarily address the challenges induced by model selection uncertainty, but should be regarded as attractive complements for the machine learning and forecasting literature, as well as tools to identify causal parameters. We illustrate our point by means of several targeted simulation studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Akaike H (1973) Information theory and an extension of the maximum likelihood principle. In: Proceeding of the second international symposiumon information theory, Budapest, pp 267–281

  • Bang H, Robins JM (2005) Doubly robust estimation in missing data and causal inference models. Biometrics 64(2):962–972

    MathSciNet  MATH  Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    MATH  Google Scholar 

  • Buckland ST, Burnham KP, Augustin NH (1997) Model selection: an integral part of inference. Biometrics 53:603–618

    MATH  Google Scholar 

  • Burnham K, Anderson D (2002) Model selection and multimodel inference. A practical information-theoretic approach. Springer, New York

    MATH  Google Scholar 

  • Chatfield C (1995) Model uncertainty, data mining and statistical inference. J R Stat Soc A 158:419–466

    Google Scholar 

  • Cheng TCF, Ing CK, Yu SH (2015) Toward optimal model averaging in regression models with time series errors. J Econometr 189(2):321–334

    MathSciNet  MATH  Google Scholar 

  • Daniel RM, Cousens SN, De Stavola BL, Kenward MG, Sterne JA (2013) Methods for dealing with time-dependent confounding. Stat Med 32(9):1584–1618

    MathSciNet  Google Scholar 

  • Draper D (1995) Assessment and propagation of model uncertainty. J R Stat Soc B 57:45–97

    MathSciNet  MATH  Google Scholar 

  • Fletcher D, Dillingham PW (2011) Model-averaged confidence intervals for factorial experiments. Comput Stat Data Anal 55:3041–3048

    MathSciNet  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22

    Google Scholar 

  • Gao Y, Zhang XY, Wang SY, Zou GH (2016) Model averaging based on leave-subject-out cross-validation. J Econometr 192(1):139–151

    MathSciNet  MATH  Google Scholar 

  • Gelman A, Su YS (2016) arm: data analysis using regression and multilevel/hierarchical models. R package version 1.9-3. https://CRAN.R-project.org/package=arm. Accessed 12 Sept 2018

  • Gruber S, van der Laan MJ (2012) tmle: an R package for targeted maximum likelihood estimation. J Stat Softw 51(13):1–35

    Google Scholar 

  • Hansen BE (2007) Least squares model averaging. Econometrica 75:1175–1189

    MathSciNet  MATH  Google Scholar 

  • Hansen BE (2008) Least squares forecast averaging. J Econometr 146:342–350

    MathSciNet  MATH  Google Scholar 

  • Hansen BE, Racine J (2012) Jackknife model averaging. J Econometr 167:38–46

    MathSciNet  MATH  Google Scholar 

  • Hjort L, Claeskens G (2003) Frequentist model average estimators. J Am Stat Assoc 98:879–945

    MathSciNet  MATH  Google Scholar 

  • Hoeting JA, Madigan D, Raftery AE, Volinsky CT (1999) Bayesian model averaging: a tutorial. Stat Sci 14:382–417

    MathSciNet  MATH  Google Scholar 

  • Kabaila P, Welsh A, Abeysekera W (2016) Model-averaged confidence intervals. Scand J Stat 43:35–48

    MathSciNet  MATH  Google Scholar 

  • Leeb H, Pötscher BM (2005) Model selection and inference: facts and fiction. Econometr Theory 21:21–59

    MathSciNet  MATH  Google Scholar 

  • Leeb H, Pötscher BM (2008) Model Selection. Springer, New York, pp 785–821

    MATH  Google Scholar 

  • Lendle SD, Schwab J, Petersen ML, van der Laan MJ (2017) ltmle: an R package implementing targeted minimum loss-based estimation for longitudinal data. J Stat Softw 81(1):1–21

    Google Scholar 

  • Liang H, Zou GH, Wan ATK, Zhang XY (2011) Optimal weight choice for frequentist model average estimators. J Am Stat Assoc 106(495):1053–1066

    MathSciNet  MATH  Google Scholar 

  • Liu C, Kuo B (2016) Model averaging in predictive regressions. Econometr J 19(2):203–231

    MathSciNet  Google Scholar 

  • Liu QF, Okui R, Yoshimura A (2016) Generalized least squares model averaging. Econometr Rev 35(8–10):1692–1752

    MathSciNet  Google Scholar 

  • Mallows C (1973) Some comments on \(C_p\). Technometrics 15:661–675

    MATH  Google Scholar 

  • Petersen M, Schwab J, Gruber S, Blaser N, Schomaker M, van der Laan M (2014) Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. J Causal Inference 2:147–185

    Google Scholar 

  • Polley E, LeDell E, Kennedy C, van der Laan M (2017) SuperLearner: super learner prediction. R package version 2.0-22. https://CRAN.R-project.org/package=SuperLearner. Accessed 12 Sept 2018

  • Pötscher B (2006) The distribution of model averaging estimators and an impossibility result regarding its estimation. Lect Notes Monogr Ser 52:113–129

    MathSciNet  MATH  Google Scholar 

  • Raftery A, Hoeting J, Volinsky C, Painter I, Yeung KY (2017) BMA: Bayesian model averaging. R package version 3.18.7. https://CRAN.R-project.org/package=BMA. Accessed 12 Sept 2018

  • Rao CR, Wu Y (2001) On model selection. Lect Notes Monogr Ser 38:1–64

    MathSciNet  Google Scholar 

  • Robins J, Hernan MA (2009) Estimation of the causal effects of time-varying exposures. In: Fitzmaurice G, Davidian M, Verbeke G, Molenberghs G (eds) Longitudinal data analysis. CRC Press, Boca Raton, pp 553–599

    Google Scholar 

  • Sala-I-Martin X, Doppelhofer G, Miller RI (2004) Determinants of long-term growth: a Bayesian averaging of classical estimates (bace) approach. Am Econ Rev 94(4):813–835

    Google Scholar 

  • Schomaker M (2012) Shrinkage averaging estimation. Stat Pap 53(4):1015–1034

    MathSciNet  MATH  Google Scholar 

  • Schomaker M (2017a) MAMI: model averaging (and model selection) after multiple imputation. R package version 0.9.10

  • Schomaker M (2017b) Model averaging and model selection after multiple imputation using the R-package MAMI. http://mami.r-forge.r-project.org. Accessed 12 Sept 2018

  • Schomaker M, Heumann C (2014) Model selection and model averaging after multiple imputation. Comput Stat Data Anal 71:758–770

    MathSciNet  MATH  Google Scholar 

  • Schomaker M, Heumann C (2018) Bootstrap inference when using multiple imputation. Stat Med 37(14):2252–2266

    MathSciNet  Google Scholar 

  • Schomaker M, Davies MA, Malateste K, Renner L, Sawry S, N’Gbeche S, Technau K, Eboua FT, Tanser F, Sygnate-Sy H, Phiri S, Amorissani-Folquet M, Cox V, Koueta F, Chimbete C, Lawson-Evi A, Giddy J, Amani-Bosse C, Wood R, Egger M, Leroy V (2016) Growth and mortality outcomes for different antiretroviral therapy initiation criteria in children aged 1–5 years: a causal modelling analysis from West and Southern Africa. Epidemiology 27:237–246

    Google Scholar 

  • Sofrygin O, van der Laan MJ, Neugebauer R (2017) simcausal R package: conducting transparent and reproducible simulation studies of causal effect estimation with complex longitudinal data. J Stat Softw 81(2):1–47

    Google Scholar 

  • Tibsharani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288

    MathSciNet  Google Scholar 

  • Turek D, Fletcher D (2012) Model-averaged wald confidence intervals. Comput Stat Data Anal 56:2809–2815

    MathSciNet  MATH  Google Scholar 

  • Van der Laan M, Petersen M (2007) Statistical learning of origin-specific statistically optimal individualized treatment rules. Int J Biostat 3:3

    MATH  Google Scholar 

  • Van der Laan M, Rose S (2011) Targeted learning. Springer, New York

    MATH  Google Scholar 

  • Van der Laan M, Polley E, Hubbard A (2008) Super learner. Stat Appl Genet Mol Biol 6:25

    MathSciNet  MATH  Google Scholar 

  • Wan ATK, Zhang X, Zou GH (2010) Least squares model averaging by Mallows criterion. J Econometr 156:277–283

    MathSciNet  MATH  Google Scholar 

  • Wang H, Zhou S (2012) Interval estimation by frequentist model averaging. Commun Stat Theory Methods 42(23):4342–4356

    MathSciNet  MATH  Google Scholar 

  • Wang H, Zhang X, Zou G (2009) Frequentist model averaging: a review. J Syst Sci Complex 22:732–748

    MathSciNet  MATH  Google Scholar 

  • Wood SN (2006) Generalized additive models: an introduction with R. Chapman and Hall/CRC, Boca Raton

    MATH  Google Scholar 

  • Yan J (2007) Enjoy the joy of copulas: with package copula. J Stat Softw 21:1–21

    Google Scholar 

  • Zhang X, Liu CA (2017) Inference after model averaging in linear regression models. IEAS working paper: academic research 17-A005. Institute of Economics, Academia Sinica, Taipei, Taiwan. https://ideas.repec.org/p/sin/wpaper/17-a005.html. Accessed 12 Sept 2018

  • Zhang XY, Zou GH, Liang H (2014) Model averaging and weight choice in linear mixed-effects models. Biometrika 101(1):205–218

    MathSciNet  MATH  Google Scholar 

  • Zhang XY, Zou GH, Carroll RJ (2015) Model averaging based on Kullback-Leibler distance. Stat Sin 25(4):1583–1598

    MathSciNet  MATH  Google Scholar 

  • Zhang XY, Ullah A, Zhao SW (2016a) On the dominance of mallows model averaging estimator over ordinary least squares estimator. Econ Lett 142:69–73

    MathSciNet  MATH  Google Scholar 

  • Zhang XY, Yu DL, Zou GH, Liang H (2016b) Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. J Am Stat Assoc 111(516):1775–1790

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Schomaker.

Appendices

Appendix A: Software

We have implemented Mallow’s model averaging, Jackknife model averaging, and Lasso averaging in the R-package MAMI (Schomaker 2017a), available at http://mami.r-forge.r-project.org/. In addition to this, we have implemented several wrappers that make optimal model averaging easily useable for super learning (Polley et al. 2017), and in conjunction with causal inference packages such as tmle (Gruber and van der Laan 2012) and ltmle (Lendle et al. 2017). Available wrappers are explained by calling listSLWrappers(), and examples are given in the documentation (Schomaker 2017b).

Appendix B: Notation and background on the sequential g-formula

Consider a sample of size n of which measurements are available both at baseline (\(t=0\)) and during a series of follow-up times \(t=1,\ldots ,T\). At each point in time we measure the outcome \(Y_t\), the intervention \(A_t\), time-dependent covariates \(\mathbf {L}_t=\{L^1_t,\ldots ,L^q_t\}\), and a censoring indicator \(C_t\). \(\mathbf {{L}}_{t}\) may include baseline variables \(\mathbf {V}=\{L^1_0,\ldots ,L^{q_V}_0\}\) and can potentially contain variables which refer to the outcome variable before time t, for instance \({Y}_{t-1}\). The treatment and covariate history of an individual i up to and including time t is represented as \(\bar{A}_{t,i}=(A_{0,i},\ldots ,A_{t,i})\) and \(\bar{L}^s_{t,i}=(L^s_{0,i},\ldots ,L^s_{t,i})\) respectively. \(C_t\) equals 1 if a subject gets censored in the interval \((t-1,t]\), and 0 otherwise. Therefore, \(\bar{C}_t=0\) is the event that an individual remains uncensored until time t.

The counterfactual outcome \(Y_{t,i}^{\bar{a}_{t}}\) refers to the hypothetical outcome that would have been observed at time t if subject i had received, possibly contrary to the fact, the treatment history \(\bar{A}_{t,i}=\bar{a}_t\). Similarly, \(\mathbf {L}_{t,i}^{\bar{a}_{t}}\) are the counterfactual covariates related to the intervention \(\bar{A}_{t,i}=\bar{a}_t\). The above notation refers to static treatment rules; a treatment rule may, however, depend on covariates, and in this case it is called dynamic. A dynamic rule \({d}_t({\bar{\mathbf {L}}}_{t})\) assigns treatment \({A}_{t,i} \in \{0,1\}\) as a function of the covariate history \({\bar{\mathbf {L}}}_{t,i}\). The vector of decisions \(d_t, t=0,\ldots ,T\), is denoted as \(\bar{d}_T=\bar{d}\). The notation \(\bar{A}_t = \bar{d}\) refers to the treatment history according to the rule \(\bar{d}\). The counterfactual outcome related to a dynamic rule \(\bar{d}\) is \(Y_{t,i}^{\bar{d}}\), and the counterfactual covariates are \(\mathbf {L}_{t,i}^{\bar{d}}\).

In Sect. 3.3 we consider the expected value of Y at time 6, under no censoring, for a given treatment rule \(\bar{d}\) to be the main quantity of interest, that is \(\psi =\mathbb {E}(Y_6^{\bar{d}})\).

The sequential g-formula can estimate this target quantity by sequentially marginalizing the distribution with respect to \(\mathbf {L}\) given the intervention rule of interest. It holds that

$$\begin{aligned} \mathbb {E}(Y_T^{\bar{d}})= & {} \mathbb {E}(\,\mathbb {E}(\,\ldots \mathbb {E}(\,\mathbb {E}(Y_T|\bar{A}_T=\bar{d}_{T}, {\bar{\mathbf{L}}}_T) | \bar{A}_{T-1}=\bar{d}_{T-1}, {\bar{\mathbf{L}}}_{T-1}\,)\ldots |\bar{A}_{0}={d}_{0}, \mathbf {{L}}_{0}\,)|\mathbf {{L}}_{0}\,), \end{aligned}$$

see for example Bang and Robins (2005). Equation (14) is valid under several assumptions: sequential conditional exchangeability, consistency, positivity and the time ordering \(\mathbf {L}_t \rightarrow A_t \rightarrow C_t \rightarrow Y_t\). These assumptions essentially mean that all confounders need to be measured, that the intervention is well-defined and that individuals have a positive probability of continuing to receive treatment according to the assigned treatment rule, given that they have done so thus far and irrespective of the covariate history; see Daniel et al. (2013) and Robins and Hernan (2009) for more details and interpretations. Note that the two interventions defined in Sect. 3 also assign \(C_t=0\) meaning that we are interested in the effect estimate under no censoring.

To estimate \(\psi \) one needs to the the following for \(t=T,\ldots ,0\): (i) use an appropriate model to estimate \(\mathbb {E}(Y_T|{\bar{A}}_{T-1}={\bar{d}}_{T-1}, {\bar{\mathbf{L}}}_T)\). The model is fit on all subjects that are uncensored (until \(T-1\)). Note that the outcome refers to the measured outcome for \(t=T\) and to the prediction (of the conditional outcome) from step (ii) if \(t<T\). Then, (ii) plug in \(\bar{A}_t=\bar{d}_{t}\) to predict \(Y_t\) at time t; (iii) For \(t=0\) the estimate \(\hat{\psi }\) is obtained by calculating the arithmetic mean of the predicted outcome from the second step.

Appendix C: Data generating process in the simulation study

Both baseline data (\(t=0\)) and follow-up data (\(t=1,\ldots ,12\)) were created using structural equations using the R-package simcausal (Sofrygin et al. 2017). The below listed distributions, listed in temporal order, describe the data-generating process motivated by the analysis from Schomaker et al. (2016). Baseline data refers to region, sex, age, CD4 count, CD4%, WAZ and HAZ respectively (\(V^1\), \(V^2\), \(V^3\), \(L^1_0\), \(L^2_0\), \(L^3_0\), \(Y_0\)), see Schomaker et al. (2016) for a full explanation of variables and motivational question. Follow-up data refers to CD4 count, CD4%, WAZ and HAZ (\(L^1_t\), \(L^2_t\), \(L^3_t\), \(Y_t\)), as well as a treatment (\(A_t\)) and censoring (\(C_t\)) indicator. In addition to Bernoulli (B), uniform (U) and normal (N) distributions, we also use truncated normal distributions which are denoted by \(N_{[a,b]}\) where a and b are the truncation levels. Values which are smaller a are replaced by a random draw from a \(U(a_1,a_2)\) distribution and values greater than b are drawn from a \(U(b_1,b_2)\) distribution. Values for \((a_1, a_2, b_1, b_2)\) are (0, 50, 5000, 10000) for \(L^1\), (0.03,0.09,0.7,0.8) for \(L^2\), and \((-10,3,3,10)\) for both \(L^3\) and Y. The notation \(\bar{\mathcal {D}}\) means “conditional on the data that has already been measured (generated) according the the time ordering”.

For \(t=0\):

$$\begin{aligned} V^1\sim & {} B(p=4392/5826) \\ V^2|\bar{\mathcal {D}}\sim & {} \left\{ \begin{array}{ll} B(p=2222/4392) &{} \quad \text{ if } \quad V^1 = 1\\ B(p=758/1434) &{} \quad \text{ if } \quad V^1 = 0\\ \end{array} \right. \\ V^3|\bar{\mathcal {D}}\sim & {} U(1,5) \\ L^1_0|\bar{\mathcal {D}}\sim & {} \left\{ \begin{array}{cl} N_{[0,10000]}(650,350) &{} \quad \text{ if } \quad V^1 = 1\\ N_{[0,10000]}(720,400)) &{} \quad \text{ if } \quad V^1 = 0\\ \end{array} \right. \\ \tilde{L}^1_0|\bar{\mathcal {D}}\sim & {} N((L^1_0-671.7468)/(10\cdot 352.2788)+1,0)\\ L^2_0|\bar{\mathcal {D}}\sim & {} N_{[0.06,0.8]}(0.16+0.05\cdot (L^1_0-650)/650,0.07) \\ \tilde{L}^2_0|\bar{\mathcal {D}}\sim & {} N((L^2_0-0.1648594)/(10\cdot 0.06980332)+1,0)\\ L^3_0|\bar{\mathcal {D}}\sim & {} \left\{ \begin{array}{ll} N_{[-5,5]}(- 1.65 + 0.1 \cdot V^3 + 0.05 \cdot (L^1_0 - 650)/650 \\ +\, 0.05 \cdot (L^2_0 - 16)/16,1) &{} \quad \text{ if } \quad V^1 = 1\\ N_{[-5,5]}( -2.05 + 0.1 \cdot V^3 + 0.05 \cdot (L^1_0 - 650)/650 \\ +\, 0.05 \cdot (L^2_0 - 16)/16,1)) &{} \quad \text{ if } \quad V^1 = 0\\ \end{array} \right. \\ A_0|\bar{\mathcal {D}}\sim & {} B(p=0) \\ C_0|\bar{\mathcal {D}}\sim & {} B(p=0) \\ Y_0|\bar{\mathcal {D}}\sim & {} N_{[-5,5]}(-2.6 + 0.1 \cdot I(V^3 > 2) + 0.3 \cdot I(V^1 = 0) + (L^3_0 + 1.45),1.1) \\ \end{aligned}$$

For \(t>0\):

$$\begin{aligned} L^1_t|\bar{\mathcal {D}}\sim & {} \left\{ \begin{array}{cl} &{} N_{[0,10000]}(13\cdot \log (t \cdot (1034-662)/8) + L^1_{t-1} + 2 \cdot L^2_{t-1}\\ &{} +\, 2 \cdot L^3_{t-1} + 2.5 \cdot A_{t-1},50) \qquad \text{ if } \quad t \in \{1,2,3,4\}\\ &{} N_{[0,10000]}(4\cdot \log (t \cdot (1034-662)/8) + L^1_{t-1} + 2 \cdot L^2_{t-1} \\ &{}+\, 2 \cdot L^3_{t-1} + 2.5 \cdot A_{t-1},50) \qquad \text{ if } \quad t \in \{5,6\}\\ \end{array} \right. \\ L^2_t|\bar{\mathcal {D}}\sim & {} N_{[0.06,0.8]}(L^2_{t-1} + 0.0003 \cdot (L^1_t - L^1_{t-1}) + 0.0005 \cdot (L^3_{t-1}) + 0.0005 \cdot A_{t-1} \cdot \tilde{L}^1_0,0.02) \\ L^3_t|\bar{\mathcal {D}}\sim & {} N_{-5,5}(L^3_{t-1} + 0.0017 \cdot (L^1_t - L^1_{t-1}) + 0.2 \cdot (L^2_t - L^2_{t-1}) + 0.005 \cdot A_{t-1} \cdot \tilde{L}^2_0,0.5) \\ A_t|\bar{\mathcal {D}}\sim & {} \left\{ \begin{array}{ll} &{}B(p=1) \qquad \text{ if } \quad A_{t-1} = 1\\ &{} B(p=1/(1+\exp (-[-2.4 + 0.015 \cdot (750 - L^1_t) + 5 \cdot (0.2 - L^2_t) \\ &{} -\, 0.8 \cdot L^3_t + 0.8 \cdot t]))) \qquad \text{ if } \quad A_{t-1} = 0\\ \end{array} \right. \\ C_t|\bar{\mathcal {D}}\sim & {} B(p=1/(1+\exp (-[-6+ 0.01 \cdot (750 - L^1_t) + 1 \cdot (0.2 - L^2_t) - 0.65 \cdot L^3_t - A_t]))) \\ Y_t|\bar{\mathcal {D}}\sim & {} N_{[-5,5]}(Y_{t-1} + 0.00005 \cdot (L^1_t - L^1_{t-1}) - 0.000001 \cdot \left( (L^1_t - L^1_{t-1})\cdot \sqrt{\tilde{L}^1_0}\right) ^2\\&+\, 0.01 \cdot (L^2_t - L^2_{t-1})- 0.0001 \cdot \left( (L^2_t - L^2_{t-1})\cdot \sqrt{\tilde{L}^2_0}\right) ^2\\&+\, 0.07 \cdot ((L^3_t-L^3_{t-1})\cdot (L^3_0+1.5135)) - 0.001 \cdot ((L^3_t-L^3_{t-1})\cdot (L^3_0+1.5135))^2 \\&+\, 0.005 \cdot A_t + 0.075 \cdot A_{t-1} + 0.05 \cdot A[t] \cdot A[t-1] ,0.01) \\ \end{aligned}$$

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Schomaker, M., Heumann, C. When and when not to use optimal model averaging. Stat Papers 61, 2221–2240 (2020). https://doi.org/10.1007/s00362-018-1048-3

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00362-018-1048-3

Keywords

Navigation