Skip to main content
Log in

On the Selection of the Regularization Parameter in Stacking

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Stacking is a model combination technique to improve prediction accuracy. Regularization is usually necessary in stacking because some predictions used in the model combination provide similar predictions. Cross-validation is generally used to select the regularization parameter, but it incurs a high computational cost. This paper proposes two simple low computational cost methods for selecting the regularization parameter. The effectiveness of the methods is examined in numerical experiments. Asymptotic results in a particular setting are also shown.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Belsley DA, Kuh E, Welsch R (1980) Regression diagnostics. Wiley, New York

    Book  Google Scholar 

  2. Bhatt S, Cameron E, Flaxman SR, Weiss DJ, Smith DL, Gething PW (2017) Improved prediction accuracy for disease risk mapping using Gaussian process stacked generalization. J R Soc Interface 14:20170520

    Article  Google Scholar 

  3. Breiman L (1996) Stacked regression. Mach Learn 24:49–64

    MATH  Google Scholar 

  4. Breiman L, Friedman JH (1985) Estimating optimal transformations in multiple regression and correlation (with discussion). J Am Stat Assoc 80:580–619

    Article  Google Scholar 

  5. Claeskens G, Hjort NL (2008) Model selection and model averaging. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  6. Clarke B (2003) Comparing Bayes model averaging and stacking when model approximation error cannot be ignored. J Mach Learn Res 4:683–712

    MathSciNet  MATH  Google Scholar 

  7. Hoerl A, Kennard R (1988) Ridge regression. In: Kotz S, Johnson HL, Read CB (eds) Encyclopedia of statistical sciences, vol 8. Wiley, New York, pp 129–136

    Google Scholar 

  8. Konishi S, Kitagawa G (2007) Information criteria and statistical modeling. Springer, New York

    MATH  Google Scholar 

  9. LeBlanc M, Tibshirani R (1996) Combining estimates in regression and classification. J Am Stat Assoc 91:1641–1650

    MathSciNet  MATH  Google Scholar 

  10. Minka T (2002) Bayesian model averaging is not model combination. (https://tminka.github.io/papers/minka-bma-isnt-mc.pdf). Accessed 31 Aug 2020

  11. Nicolas-Alonso LF, Corralejo R, Gomez-Pilar J, Alvarez D, Hornero R (2015) Adaptive stacked generalization for multiclass motor imagery-based brain computer interfaces. IEEE Trans Neural Syst Rehabil Eng 23:702–712

    Article  Google Scholar 

  12. Rasmussen CE, Williams CKI (2006) Gaussian processes for machine learning. MIT Press, Cambridge

    MATH  Google Scholar 

  13. Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos C D, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of E-mail. In: Lee L, Donna H (Eds) Proceedings of EMNLP-01, 6th conference on empirical methods in natural language processing, pp 44–50

  14. Sill J, Takacs G, Mackey L, Lin D (2009) Feature-weighted linear stacking. arXiv:0911.0460v2

  15. Stone M (1974) Cross-validatory choice and assessment of statistical predictions. J R Stat Soc B 36:111–147

    MathSciNet  MATH  Google Scholar 

  16. Wolpert D (1992) Stacked generalization. Neural Netw 5:241–259

    Article  Google Scholar 

  17. Xu L, Jiang J-H, Zhou Y-P, Wu H-L, Shen G-L, Yu R-Q (2007) MCCV stacked regression for model combination and fast spectral interval selection in multivariate calibration. Chemometr Intell Lab Syst 87:226–230

    Article  Google Scholar 

  18. Yao Y, Vehtari A, Simpson D, Gelman A (2018) Using stacking to average Bayesian predictive distributions. Bayesian Anal 13:917–944

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tadayoshi Fushiki.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Asymptotic Results

Asymptotic Results

We assume that N can be divided by \(K^2\), and \(L_{\alpha }=L=N/K\). In this paper, K is fixed and L goes to infinity when N goes to infinity, but we assume that K is taken to be large enough in advance. Typically, \(K=10\) or 20. In this paper, we assume that the regularization parameter \(\lambda \) is chosen from (0, EN) for a fixed \(E>0\). Each model is parameterized by a finite dimensional \(\theta _m\): \(\{ f_1(x;\theta _1)\} ,\dots ,\{ f_M(x;\theta _M)\}\). The parameters are estimated by M-estimation:

$$\begin{aligned} \hat{\theta }_m(\mathcal{D}) = \mathop {\text{ argmin }}_{\theta _m\in \Theta _m}\{\Psi _m(\mathcal{D};\theta _m)\} , \end{aligned}$$

where \(\Psi _m(\mathcal{D};\theta _m)=\sum _{i=1}^N\Psi _m((x_i,y_i);\theta _m)\). For example,

$$\begin{aligned} \Psi _m(\mathcal{D};\theta _m) = \sum _{i=1}^N \{ y_i-f_m(x_i;\theta _m)\}^2. \end{aligned}$$

In this section, we use the following notation:

$$\begin{aligned}&\Psi _m^{\prime }(\mathcal{D};\theta _m) = \nabla _{\theta _m}\Psi _m(\mathcal{D};\theta _m), \\&\Psi _m^{\prime \prime }(\mathcal{D};\theta _m) = \nabla _{\theta _m}\nabla _{\theta _m}^T\Psi _m(\mathcal{D};\theta _m), \\&J_m(\theta _m) = \text{ E }(\Psi _m^{\prime \prime }((x,y);\theta _m)) , \\&\theta _m^0 = \mathop {\text{ argmin }}_{\theta _m\in \Theta _m}\{\text{ E }(\Psi _m((x,y);\theta _m))\} . \end{aligned}$$

Accordingly,

$$\begin{aligned} \hat{\theta }_m(\mathcal{D})-\theta _m^0 \approx -N^{-1}J_m(\theta _m^0)^{-1}\Psi _m^{\prime }(\mathcal{D};\theta _m^0). \end{aligned}$$
(7)

We assume regularity conditions that make the first-order approximation valid. We will use the following abbreviation:

$$\begin{aligned}&\hat{\theta }_m = \hat{\theta }_m(\mathcal{D}),\quad \hat{\theta }_m^{(-\alpha )} \\&= \hat{\theta }_m(\mathcal{D}^{(-\alpha )}),\quad \hat{\theta }_m^{(-\alpha ,-\beta )} = \hat{\theta }_m(\mathcal{D}^{(-\alpha ,-\beta )}). \end{aligned}$$

For asymptotic calculations, we assume that \(\mathcal{D}^{(-\alpha )}\) are devided into \(\mathcal{D}^{(-\alpha ,1)},\dots ,\mathcal{D}^{(-\alpha ,K)}\) as follows. First, \(\mathcal{D}^{(\alpha )}\) are diveded into \(\mathcal{D}^{(\alpha ,1)},\dots ,\mathcal{D}^{(\alpha ,K)}\). Second, \(\mathcal{D}^{(-\alpha ,\beta )}=\cup _{\gamma =1,\gamma \ne \alpha }^K \mathcal{D}^{(\gamma ,\beta )}\). Third, \(\mathcal{D}^{(-\alpha ,-\beta )}= \mathcal{D}^{(-\alpha )}\backslash \mathcal{D}^{(-\alpha ,\beta )}\).

Let

$$\begin{aligned}&\mathcal{D}_1^{(-\alpha ,-\beta _{(\alpha ,x,y)})} = \mathcal{D}^{(-\alpha ,-\beta _{(\alpha ,x,y)})}\cap \mathcal{D}^{(-\alpha _{(x,y)})} , \\&\mathcal{D}_2^{(-\alpha ,-\beta _{(\alpha ,x,y)})} = \mathcal{D}^{(-\alpha ,-\beta _{(\alpha ,x,y)})}\cap \mathcal{D}^{(\alpha _{(x,y)})} , \\&\mathcal{D}_1^{(-\alpha ,\beta _{(\alpha ,x,y)})} = \mathcal{D}^{(-\alpha ,\beta _{(\alpha ,x,y)})}\cap \mathcal{D}^{(-\alpha _{(x,y)})} , \\&\mathcal{D}_2^{(-\alpha ,\beta _{(\alpha ,x,y)})} = \mathcal{D}^{(-\alpha ,\beta _{(\alpha ,x,y)})}\cap \mathcal{D}^{(\alpha _{(x,y)})}. \end{aligned}$$

Then,

$$\begin{aligned}&\mathcal{D}^{(-\alpha ,-\beta _{(\alpha ,x,y)})} = \mathcal{D}_1^{(-\alpha ,-\beta _{(\alpha ,x,y)})}\cup \mathcal{D}_2^{(-\alpha ,-\beta _{(\alpha ,x,y)})} , \\&\mathcal{D}^{(-\alpha )} = \mathcal{D}_1^{(-\alpha ,-\beta _{(\alpha ,x,y)})}\cup \mathcal{D}_2^{(-\alpha ,-\beta _{(\alpha ,x,y)})}\cup \mathcal{D}_1^{(-\alpha ,\beta _{(\alpha ,x,y)})}\cup \mathcal{D}_2^{(-\alpha ,\beta _{(\alpha ,x,y)})} , \\&\mathcal{D}^{(-\alpha _{(x,y)})} = \mathcal{D}^{(\alpha )}\cup \mathcal{D}_1^{(-\alpha ,-\beta _{(\alpha ,x,y)})}\cup \mathcal{D}_1^{(-\alpha ,\beta _{(\alpha ,x,y)})}. \end{aligned}$$

By the Taylor expansion, we can obtain

$$\begin{aligned}&f_m(x;\hat{\theta }_m^{(-\alpha ,-\beta _{(\alpha ,x,y)})})-g_m^{(-\alpha )}(x) \nonumber \\\approx & {} -\nabla _{\theta _m}f_m(x;\theta _m^0)^T(J_m(\theta _m^0))^{-1} \left\{ \frac{1}{LK(K-1)^2}\Psi _m^{\prime }(\mathcal{D}_1^{(-\alpha ,-\beta _{(\alpha ,x,y)})};\theta _m^0)\right. \nonumber \\&\quad \left. +\frac{K^2-K+1}{LK(K-1)^2}\Psi _m^{\prime }(\mathcal{D}_2^{(-\alpha ,-\beta _{(\alpha ,x,y)})};\theta _m^0) \right. \nonumber \\&\left. -\frac{K+1}{LK(K-1)}\Psi _m^{\prime }(\mathcal{D}_1^{(-\alpha ,\beta _{(\alpha ,x,y)})};\theta _m^0) - \frac{1}{LK(K-1)}\Psi _m^{\prime }(\mathcal{D}_2^{(-\alpha ,\beta _{(\alpha ,x,y)})};\theta _m^0)\right. \nonumber \\&\quad \left. -\frac{1}{LK(K-1)}\Psi _m^{\prime }(\mathcal{D}^{(\alpha )};\theta _m^0) \right\} . \end{aligned}$$
(8)

The right-hand side of (8) is written as \(h_{m,x}^{(-\alpha )}\).

We denote the elements of \(\mathcal{D}^{(-\alpha )}\) by \((x_1^{(-\alpha )},y_1^{(-\alpha )}),...,(x_{N^{\prime }}^{(-\alpha )},y_{N^{\prime }}^{(-\alpha )})\), where \(N^{\prime }=(K-1)L\). We define \(N^{\prime }\times M\) matrices \(U^{(-\alpha )}, X^{(-\alpha )}, X_0^{(-\alpha )}\) and \(\Delta _0^{(-\alpha )}\) whose (ij)-th elements are

$$\begin{aligned}&(U^{(-\alpha )})_{ij}=g_j(x_i^{(-\alpha )}),\\&(X^{(-\alpha )})_{ij}=f_j(x_i^{(-\alpha )};\hat{\theta }_j^{(-\alpha ,-\beta _{(\alpha ,x_i^{(-\alpha )},y_i^{(-\alpha )})})}),\\&(X_0^{(-\alpha )})_{ij}=f_j(x_i^{(-\alpha )};\theta _j^0),\\&(\Delta _0^{(-\alpha )})_{ij}=h_{j,x_i^{(-\alpha )}}^{(-\alpha )}. \end{aligned}$$

Let \(y^{(-\alpha )}\) be a vector \((y_1^{(-\alpha )},..., y_{N^{\prime }}^{(-\alpha )})^T\). Then,

$$\begin{aligned}&\hat{u}^{(-\alpha )}(\lambda ;\mathcal{D}) \approx \hat{w}(\lambda ;\mathcal{D}^{(-\alpha )}) + A(\lambda ,N^{\prime })^{-1} [{N^{\prime }}^{-1}{\Delta _0^{(-\alpha )}}^T \{ y^{(-\alpha )}-X_0^{(-\alpha )}w(\lambda ,N^{\prime })\} \\&\quad - {N^{\prime }}^{-1}{X_0^{(-\alpha )}}^T{\Delta _0^{(-\alpha )}}w(\lambda ,N^{\prime })] . \end{aligned}$$

Here, \(A_0\) is \(M\times M\) matirx whose (ij)-th element is \(\text{ E }( f_i(x;\theta _i^0)f_j(x;\theta _j^0))\), \(A(\lambda ,N^{\prime }) = A_0+\lambda /N^{\prime }I\), and \(w(\lambda ,N^{\prime })=A(\lambda ,N^{\prime })^{-1}b\), where b is M-dimensional vector whose i-th element is \(\text{ E }( yf_i(x;\theta _i^0))\).

By expanding \(\text{ ACV}_2\), we can obtain

$$\begin{aligned} \text{ ACV}_2= & {} \frac{1}{N}\sum _{\alpha =1}^K\sum _{(x,y)\in \mathcal{D}^{(\alpha )}} \left\{ y-\sum _{m=1}^M\hat{u}_m^{(-\alpha )}(\lambda ;\mathcal{D})f_m(x;\hat{\theta }_m^{(-\alpha )}) \right\} ^2 \nonumber \\= & {} \frac{1}{N}\sum _{\alpha =1}^K\sum _{(x,y)\in \mathcal{D}^{(\alpha )}} \left\{ y-\sum _{m=1}^M\hat{w}_m(\lambda ;\mathcal{D}^{(-\alpha )})f_m(x;\hat{\theta }_m^{(-\alpha )}) \right\} ^2 \nonumber \\&+ \frac{1}{N}\sum _{\alpha =1}^K\sum _{(x,y)\in \mathcal{D}^{(\alpha )}} \left[ \sum _{m=1}^M \left\{ \hat{w}_m(\lambda ;\mathcal{D}^{(-\alpha )}) - \hat{u}_m^{(-\alpha )}(\lambda ;\mathcal{D})\right\} f_m(x;\hat{\theta }_m^{(-\alpha )}) \right] ^2 \nonumber \\&\quad + \frac{2}{N}\sum _{\alpha =1}^K\sum _{(x,y)\in \mathcal{D}^{(\alpha )}} \sum _{m=1}^M \left\{ \hat{w}_m(\lambda ;\mathcal{D}^{(-\alpha )}) - \hat{u}_m^{(-\alpha )}(\lambda ;\mathcal{D})\right\} f_m(x;\hat{\theta }_m^{(-\alpha )}) \nonumber \\&\quad \times \left\{ y-\sum _{l=1}^M\hat{w}_l(\lambda ;\mathcal{D}^{(-\alpha )})f_l(x;\hat{\theta }_l^{(-\alpha )}) \right\} . \end{aligned}$$
(9)

The first term of (9) is \(\text{ CV }\).

The second term of (9) is

$$\begin{aligned}&\frac{1}{N}\sum _{\alpha =1}^K\sum _{(x,y)\in \mathcal{D}^{(\alpha )}} \sum _{i=1}^M\sum _{j=1}^Mf_i(x;\hat{\theta }_i^{(-\alpha )}) \left\{ \hat{w}_i(\lambda ;\mathcal{D}^{(-\alpha )})-\hat{u}_i^{(-\alpha )}(\lambda ;\mathcal{D})\right\} \\&f_j(x;\hat{\theta }_j^{(-\alpha )}) \left\{ \hat{w}_j(\lambda ;\mathcal{D}^{(-\alpha )})-\hat{u}_j^{(-\alpha )}(\lambda ;\mathcal{D})\right\} \\&\quad \simeq \frac{1}{K}\sum _{\alpha =1}^K\sum _{i=1}^M\sum _{j=1}^M(A(\lambda ,N^{\prime })^{-1}A_0A(\lambda ,N^{\prime })^{-1})_{ij}\\&\times \left( \frac{1}{N^{\prime }}\sum _{k=1}^{N^{\prime }} f_i(x_k^{(-\alpha )};\theta _i^0)\sum _{l=1}^M\left[ h_{l,x_k^{(-\alpha )}}^{(-\alpha )}w_l^0 -h_{i,x_k^{(-\alpha )}}^{(-\alpha )} \left\{ y_k^{(-\alpha )}-\sum _{l=1}^Mw_l^0f_l(x_k^{(-\alpha )};\theta _l^0)\right\} \right] \right) \\&\times \left( \frac{1}{N^{\prime }}\sum _{m=1}^{N^{\prime }} f_j(x_m^{(-\alpha )};\theta _j^0)\sum _{n=1}^M\left[ h_{n,x_m^{(-\alpha )}}^{(-\alpha )}w_n^0 -h_{j,x_m^{(-\alpha )}}^{(-\alpha )} \left\{ y_m^{(-\alpha )}-\sum _{n=1}^Mw_n^0f_n(x_m^{(-\alpha )};\theta _n^0)\right\} \right] \right) . \end{aligned}$$

Here, we consider the following expectation:

$$\begin{aligned} \text{ E }\left( a(x_i)\left\{ \sum _{j=1}^Nb_j(x_j)\right\} c(x_k)\left\{ \sum _{l=1}^Nd_l(x_l)\right\} \right) , \end{aligned}$$

where \(x_1,...,x_N\) are independent, and \(\text{ E }( b_j(x_j)) =0\) for \(j=1,...,N\) and \(\text{ E }( d_l(x_l)) =0\) for \(l=1,...,N\). Then,

$$\begin{aligned}&\text{ E }\left( a(x_i)\left\{ \sum _{j=1}^Nb_j(x_j)\right\} c(x_k)\left\{ \sum _{l=1}^Nd_l(x_l)\right) \right] \nonumber \\&= \text{ E }\left( a(x_i)\left\{ \sum _{j\in \{ i,k\} }b_j(x_j)\right\} c(x_k)\left\{ \sum _{l\in \{ i,k\}}d_l(x_l)\right\} \right) \nonumber \\&\quad +\text{ E }\left( a(x_i)\left\{ \sum _{j\notin \{ i,k\}}b_j(x_j)\right\} c(x_k)\left\{ \sum _{l\notin \{ i,k\}}d_l(x_l)\right\} \right) . \end{aligned}$$
(10)

The second term of (10) is bounded by

$$\begin{aligned}&\text{ E }\left( a(x_i)\left\{ \sum _{j\notin \{ i,k\}}b_j(x_j)\right\} c(x_k)\left\{ \sum _{l\notin \{ i,k\}}d_l(x_l)\right\} \right) \nonumber \\&\le \left| \text{ E }( a(x_i)c(x_k))\right| \left( \text{ E }\left( \left\{ \sum _{j\notin \{ i,k\}}b_j(x_j)\right\} ^2\right) \right) ^{1/2} \left( \text{ E }\left( \left\{ \sum _{l\notin \{ i,k\}}d_l(x_l)\right\} ^2\right) \right) ^{1/2} \nonumber \\&\le \left| \text{ E }( a(x_i)c(x_k))\right| \left[ \text{ E }\left( \sum _{j=1}^Nb_j(x_j)^2\right) \right] ^{1/2} \left[ \text{ E }\left( \sum _{l=1}^Nd_l(x_l)^2\right) \right] ^{1/2}. \end{aligned}$$
(11)

By using (11), the expectation of the second term of (9) is bounded by

$$\begin{aligned} c_1/\{ NC_1(K)\} +o(N^{-1}), \end{aligned}$$
(12)

where \(c_1\) is a constant which does not depend on K and \(C_1(K)=\min [ K^2(K-1)^3/(K^2-K+1)^2,K^2(K-1)^2/\{ (K+1)^2(K-2)\} ]\).

By calculating a bound of the expectation of the third term, we can obtain

$$\begin{aligned} |\text{ E }(\text{ ACV}_2-\text{ CV})| \le c_2/\{ NC_2(K)\} +o(N^{-1}), \end{aligned}$$

where \(C_2(K)=(K-1)^3/(K^2-K+1)\) and \(c_2\) is a constant which does not depend on K. Thus, by taking K large in advance, the bias of \(\text{ ACV}_2\) can be close to the bias of \(\text{ CV }\).

Next, we consider \(\text{ ACV}_1\). By using

$$\begin{aligned}&f_m(x;\hat{\theta }_m^{(-\alpha ,-\beta _{(\alpha ,x,y)})})-f_m(x;\hat{\theta }_m^{(-\alpha _{(x,y)})}) \\&\approx -\nabla _{\theta _m}f_m(x;\theta _m^0)^TJ_m(\theta _m^0)^{-1} \left\{ \frac{1}{L(K-1)^2}\Psi _m^{\prime }(\mathcal{D}_1^{(-\alpha ,-\beta _{(\alpha ,x,y)})};\theta _m^0)\right. \\&\quad \left. +\frac{K}{L(K-1)^2}\Psi _m^{\prime }(\mathcal{D}_2^{(-\alpha ,-\beta _{(\alpha ,x,y)})};\theta _m^0) \right. \\&\quad \left. - \frac{}{L(K-1)}\Psi _m^{\prime }(\mathcal{D}_1^{(-\alpha ,\beta _{(\alpha ,x,y)})};\theta _m^0) - \frac{1}{L(K-1)}\Psi _m^{\prime }(\mathcal{D}^{(\alpha )};\theta _m^0) \right\} , \end{aligned}$$

we can calculate \(\hat{w}_m(\lambda ;\mathcal{D}^{(-\alpha )}) -\hat{v}_m^{(-\alpha )}(\lambda ;\mathcal{D})\). Expanding \(\text{ ACV}_1\) as in (9), we can obtain

$$\begin{aligned} |\text{ E }(\text{ ACV}_1-\text{ CV})| \ge c_1/\{ N(K-1)/K\} +o(N^{-1}), \end{aligned}$$

where \(c_1\) is a constant which does not depend on K. Thus, we may not make the bias small if we use large K. This result comes from the fact that the coefficient of the term \(\Psi _m^{\prime }(\mathcal{D}^{(\alpha )};\theta _m^0)\) in \(f_m(x;\hat{\theta }_m^{(-\alpha ,-\beta _{(\alpha ,x,y)})})-f_m(x;\theta _m^{(-\alpha _{(x,y)})})\) is \(1/\{ L(K-1)\}\), while the coefficient of the term \(\Psi _m^{\prime }(\mathcal{D}^{(\alpha )};\theta _m^0)\) in \(f_m(x;\hat{\theta }_m^{(-\alpha ,-\beta _{(\alpha ,x,y)})})-g_m^{(-\alpha )}(x)\) is \(1/\{ LK(K-1)\}\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Fushiki, T. On the Selection of the Regularization Parameter in Stacking. Neural Process Lett 53, 37–48 (2021). https://doi.org/10.1007/s11063-020-10378-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-020-10378-6

Keywords

Navigation