Abstract
Stacking is a model combination technique to improve prediction accuracy. Regularization is usually necessary in stacking because some predictions used in the model combination provide similar predictions. Cross-validation is generally used to select the regularization parameter, but it incurs a high computational cost. This paper proposes two simple low computational cost methods for selecting the regularization parameter. The effectiveness of the methods is examined in numerical experiments. Asymptotic results in a particular setting are also shown.
Similar content being viewed by others
References
Belsley DA, Kuh E, Welsch R (1980) Regression diagnostics. Wiley, New York
Bhatt S, Cameron E, Flaxman SR, Weiss DJ, Smith DL, Gething PW (2017) Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization. J R Soc Interface 14:20170520
Breiman L (1996) Stacked regression. Mach Learn 24:49–64
Breiman L, Friedman JH (1985) Estimating optimal transformations in multiple regression and correlation (with discussion). J Am Stat Assoc 80:580–619
Claeskens G, Hjort NL (2008) Model selection and model averaging. Cambridge University Press, Cambridge
Clarke B (2003) Comparing Bayes model averaging and stacking when model approximation error cannot be ignored. J Mach Learn Res 4:683–712
Hoerl A, Kennard R (1988) Ridge regression. In: Kotz S, Johnson HL, Read CB (eds) Encyclopedia of statistical sciences, vol 8. Wiley, New York, pp 129–136
Konishi S, Kitagawa G (2007) Information criteria and statistical modeling. Springer, New York
LeBlanc M, Tibshirani R (1996) Combining estimates in regression and classification. J Am Stat Assoc 91:1641–1650
Minka T (2002) Bayesian model averaging is not model combination. (https://tminka.github.io/papers/minka-bma-isnt-mc.pdf). Accessed 31 Aug 2020
Nicolas-Alonso LF, Corralejo R, Gomez-Pilar J, Alvarez D, Hornero R (2015) Adaptive stacked generalization for multiclass motor imagery-based brain computer interfaces. IEEE Trans Neural Syst Rehabil Eng 23:702–712
Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge
Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C D, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of E-mail. In: Lee L, Donna H (Eds) Proceedings of EMNLP-01, 6th conference on empirical methods in natural language processing, pp 44–50
Sill J, Takacs G, Mackey L, Lin D (2009) Feature-weighted linear stacking. arXiv:0911.0460v2
Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B 36:111–147
Wolpert D (1992) Stacked generalization. Neural Netw 5:241–259
Xu L, Jiang J-H, Zhou Y-P, Wu H-L, Shen G-L, Yu R-Q (2007) MCCV stacked regression for model combination and fast spectral interval selection in multivariate calibration. Chemometr Intell Lab Syst 87:226–230
Yao Y, Vehtari A, Simpson D, Gelman A (2018) Using stacking to average Bayesian predictive distributions. Bayesian Anal 13:917–944
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Asymptotic Results
Asymptotic Results
We assume that N can be divided by \(K^2\), and \(L_{\alpha }=L=N/K\). In this paper, K is fixed and L goes to infinity when N goes to infinity, but we assume that K is taken to be large enough in advance. Typically, \(K=10\) or 20. In this paper, we assume that the regularization parameter \(\lambda \) is chosen from (0, EN) for a fixed \(E>0\). Each model is parameterized by a finite dimensional \(\theta _m\): \(\{ f_1(x;\theta _1)\} ,\dots ,\{ f_M(x;\theta _M)\}\). The parameters are estimated by M-estimation:
where \(\Psi _m(\mathcal{D};\theta _m)=\sum _{i=1}^N\Psi _m((x_i,y_i);\theta _m)\). For example,
In this section, we use the following notation:
Accordingly,
We assume regularity conditions that make the first-order approximation valid. We will use the following abbreviation:
For asymptotic calculations, we assume that \(\mathcal{D}^{(-\alpha )}\) are devided into \(\mathcal{D}^{(-\alpha ,1)},\dots ,\mathcal{D}^{(-\alpha ,K)}\) as follows. First, \(\mathcal{D}^{(\alpha )}\) are diveded into \(\mathcal{D}^{(\alpha ,1)},\dots ,\mathcal{D}^{(\alpha ,K)}\). Second, \(\mathcal{D}^{(-\alpha ,\beta )}=\cup _{\gamma =1,\gamma \ne \alpha }^K \mathcal{D}^{(\gamma ,\beta )}\). Third, \(\mathcal{D}^{(-\alpha ,-\beta )}= \mathcal{D}^{(-\alpha )}\backslash \mathcal{D}^{(-\alpha ,\beta )}\).
Let
Then,
By the Taylor expansion, we can obtain
The right-hand side of (8) is written as \(h_{m,x}^{(-\alpha )}\).
We denote the elements of \(\mathcal{D}^{(-\alpha )}\) by \((x_1^{(-\alpha )},y_1^{(-\alpha )}),...,(x_{N^{\prime }}^{(-\alpha )},y_{N^{\prime }}^{(-\alpha )})\), where \(N^{\prime }=(K-1)L\). We define \(N^{\prime }\times M\) matrices \(U^{(-\alpha )}, X^{(-\alpha )}, X_0^{(-\alpha )}\) and \(\Delta _0^{(-\alpha )}\) whose (i, j)-th elements are
Let \(y^{(-\alpha )}\) be a vector \((y_1^{(-\alpha )},..., y_{N^{\prime }}^{(-\alpha )})^T\). Then,
Here, \(A_0\) is \(M\times M\) matirx whose (i, j)-th element is \(\text{ E }( f_i(x;\theta _i^0)f_j(x;\theta _j^0))\), \(A(\lambda ,N^{\prime }) = A_0+\lambda /N^{\prime }I\), and \(w(\lambda ,N^{\prime })=A(\lambda ,N^{\prime })^{-1}b\), where b is M-dimensional vector whose i-th element is \(\text{ E }( yf_i(x;\theta _i^0))\).
By expanding \(\text{ ACV}_2\), we can obtain
The first term of (9) is \(\text{ CV }\).
The second term of (9) is
Here, we consider the following expectation:
where \(x_1,...,x_N\) are independent, and \(\text{ E }( b_j(x_j)) =0\) for \(j=1,...,N\) and \(\text{ E }( d_l(x_l)) =0\) for \(l=1,...,N\). Then,
The second term of (10) is bounded by
By using (11), the expectation of the second term of (9) is bounded by
where \(c_1\) is a constant which does not depend on K and \(C_1(K)=\min [ K^2(K-1)^3/(K^2-K+1)^2,K^2(K-1)^2/\{ (K+1)^2(K-2)\} ]\).
By calculating a bound of the expectation of the third term, we can obtain
where \(C_2(K)=(K-1)^3/(K^2-K+1)\) and \(c_2\) is a constant which does not depend on K. Thus, by taking K large in advance, the bias of \(\text{ ACV}_2\) can be close to the bias of \(\text{ CV }\).
Next, we consider \(\text{ ACV}_1\). By using
we can calculate \(\hat{w}_m(\lambda ;\mathcal{D}^{(-\alpha )}) -\hat{v}_m^{(-\alpha )}(\lambda ;\mathcal{D})\). Expanding \(\text{ ACV}_1\) as in (9), we can obtain
where \(c_1\) is a constant which does not depend on K. Thus, we may not make the bias small if we use large K. This result comes from the fact that the coefficient of the term \(\Psi _m^{\prime }(\mathcal{D}^{(\alpha )};\theta _m^0)\) in \(f_m(x;\hat{\theta }_m^{(-\alpha ,-\beta _{(\alpha ,x,y)})})-f_m(x;\theta _m^{(-\alpha _{(x,y)})})\) is \(1/\{ L(K-1)\}\), while the coefficient of the term \(\Psi _m^{\prime }(\mathcal{D}^{(\alpha )};\theta _m^0)\) in \(f_m(x;\hat{\theta }_m^{(-\alpha ,-\beta _{(\alpha ,x,y)})})-g_m^{(-\alpha )}(x)\) is \(1/\{ LK(K-1)\}\).
Rights and permissions
About this article
Cite this article
Fushiki, T. On the Selection of the Regularization Parameter in Stacking. Neural Process Lett 53, 37–48 (2021). https://doi.org/10.1007/s11063-020-10378-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11063-020-10378-6