Skip to main content
Log in

Nonparametric variable selection and its application to additive models

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Variable selection for multivariate nonparametric regression models usually involves parameterized approximation for nonparametric functions in the objective function. However, this parameterized approximation often increases the number of parameters significantly, leading to the “curse of dimensionality” and inaccurate estimation. In this paper, we propose a novel and easily implemented approach to do variable selection in nonparametric models without parameterized approximation, enabling selection consistency to be achieved. The proposed method is applied to do variable selection for additive models. A two-stage procedure with selection and adaptive estimation is proposed, and the properties of this method are investigated. This two-stage algorithm is adaptive to the smoothness of the underlying components, and the estimation consistency can reach a parametric rate if the underlying model is really parametric. Simulation studies are conducted to examine the performance of the proposed method. Furthermore, a real data example is analyzed for illustration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Candés, E., Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$ (with discussion). The Annals of Statistics, 35, 2313–2404.

    Article  MathSciNet  Google Scholar 

  • Chaudhuri, P., Huang, M.-C., Loh, W.-Y., Yao, R. (1994). Piecewise-polynomial regression trees. Statistica Sinica, 4, 143–167.

    MATH  Google Scholar 

  • Cook, R. D. (1998). Regression graphics: Ideas for studying regressions through graphics. New York: Wiley.

  • Cook, R. D., Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 28–33.

    MATH  Google Scholar 

  • Cui, X., Peng, H., Wen, S. Q., Zhu, L. X. (2013). Component selection in the additive regression model. Scandinavian Journal of Statistics, 40(3), 491–510.

    Article  MathSciNet  Google Scholar 

  • Fan, J., Li, R. (2001). Variable selestion via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.

    Article  MathSciNet  Google Scholar 

  • Fan, J., Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society Series B, 70, 849–911.

    Article  MathSciNet  Google Scholar 

  • Fan, J., Feng, Y., Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106, 544–557.

    Article  MathSciNet  Google Scholar 

  • Guan, Y., Xie, C., Zhu, L. (2017). Sufficient dimension reduction with mixture multivariate skew elliptical distributions. Statistica Sinica, 27(1), 335–355.

    MathSciNet  MATH  Google Scholar 

  • Hall, P., Li, K.-C. (1993). On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics, 21, 867–889.

    Article  MathSciNet  Google Scholar 

  • Härdle, W. (1990). Applied nonparametric regression, econometric society monograph series, 19. Cambridge: Cambridge University Press.

    Google Scholar 

  • Härdle, W., Marron, J. S. (1985). Optimal bandwidth selection in nonparametric regression function estimation. Annals of Statistics, 13, 1465–1481.

    Article  MathSciNet  Google Scholar 

  • Hastie, T., Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1, 297–318.

    Article  MathSciNet  Google Scholar 

  • Li, B., Wang, S. (2007). On directional regression for dimesnion reduction. Journal of the American Statistical Association, 102, 997–1008.

    Article  MathSciNet  Google Scholar 

  • Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327.

    Article  MathSciNet  Google Scholar 

  • Li, K.-C., Duan, N. (1989). Regression analysis under link violation. The Annals of Statistics, 17(3), 1009–1052.

    Article  MathSciNet  Google Scholar 

  • Li, K.-C., Lue, H. H., Chen, C. H. (2000). Interactive tree-truncated regression via principal Hessian directions. Journal of the American Statistical Association, 95, 547–560.

    Article  MathSciNet  Google Scholar 

  • Li, R., Zhong, W., Zhu, L. P. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129–1139.

    Article  MathSciNet  Google Scholar 

  • Lin, L., Cui, X., Zhu, L. X. (2009). An adaptive two-stage estimation method for additive models. Scandinavian Journal of Statistics, 36, 248–269.

    Article  MathSciNet  Google Scholar 

  • Lin, L., Sun, J., Zhu, L. X. (2013). Nonparametric feature screening. Computational Statistics and Data Analysis, 36, 162–174.

    Article  MathSciNet  Google Scholar 

  • Lin, Y., Zhang, H. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34, 2272–2297.

    Article  MathSciNet  Google Scholar 

  • Meier, L., Van der Geer, S., Bühlmann, P. (2009). High-dimensional additive modeling. The Annals of Statistics, 37, 3779–3821.

    Article  MathSciNet  Google Scholar 

  • Storlie, C. B., Bonedll, H. D., Reich, B. J., Zhang, H. H. (2011). Surface estimation, variance selection, and the nonparametric oracle property. Statistica Sinica, 21, 679–705.

    Article  MathSciNet  Google Scholar 

  • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.

    MathSciNet  MATH  Google Scholar 

  • Wahba, G. (1990). Spline models for observational data, vol. 59. SIAM. CBMSNSF Regional Conference Series in Applied Mathematics.

  • Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variable. Journal of the Royal Statistical Society, Series B, 68, 49–67.

    Article  MathSciNet  Google Scholar 

  • Zhao, P., Yu, B. (2006). On model selection consisitency of Lasso. Journal of Machine learning Research, 7, 2541–2563.

    MATH  Google Scholar 

  • Zhu, L. P., Wang, T., Zhu, L. X., Ferré, L. (2010). Sufficient dimension reduction through discretization-expectation estimation. Biometrika, 97, 295–304.

    Article  MathSciNet  Google Scholar 

  • Zhu, L. P., Li, L. X., Li, R., Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106, 1464–1474.

    Article  MathSciNet  Google Scholar 

  • Zhu, L. X., Miao, B. Q., Peng, H. (2006). On sliced inverse regression with high dimensional covariates. Journal of the American Statistical Association, 101, 630–643.

    Article  MathSciNet  Google Scholar 

  • Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors thank Dr. Yang Feng in Columbia University for providing their NIS code. Dr. Lixing Zhu’s work was supported by a Grant from the Research Grants Council of Hong Kong and a Faculty Research Grant (FRG) Grant from Hong Kong Baptist University and a Grant (NSFC11671042) from the National natural Science Foundation of China. Dr. Zhenghui Feng’s work was supported by the Natural Science Foundation of Fujian Province of China, Grant No. 2017J01006, and German Research Foundation (DFG) via the International Research Training Group 1792 “High Dimensional Nonstationary Time Series,” Humboldt-University zu Berlin. The authors thank the editor, associate editors, and referees for their constructive suggestions and comments that led to a significant improvement in an early manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lixing Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of Theorem 1

Recall the definition of \({\eta }\), \(\mathbf {Z},\) and \({\mathbf{A}_d }^{T}\mathbf {X}={\eta }^{T}\mathbf {Z}\). \({A}_1\) is a p-dimensional vector whose first d elements are 1, otherwise 0. We have

$$\begin{aligned} \varvec{\Sigma }^{-1}\mathrm {E}(\mathbf {X} h(Y))= & {} \varvec{\Sigma }^{-1/2}({\mathbf{B}}_1,\eta _1/\Vert \eta _1\Vert )({\mathbf{B}}_1, \eta _1/\Vert \eta _1\Vert )^{\top } \mathrm {E}\left( \mathbf {Z} h(Y)\right) \nonumber \\= & {} \varvec{\Sigma }^{-1/2}{\mathbf{B}}_1{\mathbf{B}}_1^{\top }\mathrm {E}\left( \mathbf {Z} h(Y)\right) + \varvec{\Sigma }^{-1/2}\eta _1 \eta _1^{\top } \mathrm {E}\left( \mathbf {Z} h(Y)\right) /\Vert \eta _1\Vert ^2\nonumber \\= & {} \varvec{\Sigma }^{-1/2}{\mathbf{B}}_1{\mathbf{B}}_1^{\top }\mathrm {E}\left( \mathbf {Z} h(Y)\right) +{A}_1 {A}_1^{\top } \mathrm {E}\left( \mathbf {X} h(Y)\right) /\Vert \eta _1\Vert ^2\nonumber \\=: & {} \varvec{\Sigma }^{-1/2}{\mathbf{B}}_1\mathrm {E}\left( \mathrm {E} ({\mathbf{B}}_1^{\top }\mathbf {Z}|Y) h(Y)\right) +c_h{ A}_1. \end{aligned}$$
(24)

It is obvious that the first term is equal to zero when the condition \(\mathrm {E}({\mathbf{B}}_1^{\top }\mathbf {Z}|Y)=0\) almost surely. Thus, (3) implies (4). On the other hand, when (4) holds, the first term in (24) is zero. For any transformation \(h(\cdot )\), \(\varvec{\Sigma }^{-1/2}{\mathbf{B}}_1\mathrm {E}\big (\mathrm {E} ({\mathbf{B}}_1^{\top }\mathbf {Z}|Y) h(Y)\big ) =0\), implies \(\mathrm {E} ({\mathbf{B}}_1^{\top }\mathbf {Z}|Y) =0 \) almost surely. (4) implies (3). When the distribution of \(\mathbf {Z}\) is elliptically symmetric, the Eq. (4) can be proved similarly in Li and Duan (1989). \(\square \)

Proof of Theorem 3

Here, we give the sketch of proof of Theorem 3, which is Theorem 1 in Lin et al. (2009). For details please refer to Lin et al. (2009).

To proof Theorem 3, two lemmas are needed.

Lemma 1

Suppose the conditions of Theorem 3 hold and denote

$$\begin{aligned} \hat{f}_{1M}(x_1) = f_{1M}(x_1) \frac{\sum _{i=1}^n \{Y_i - \mu - \sum _{j=2}^d {f}_{jM}(x_{ij})\}\int _{s_{i-1}}^{s_i}K\left( \frac{t-x_1}{h}\right) f_{1M}(t)\mathrm{d}t}{\int _0^1 K\left( \frac{t-x_1}{h}\right) (f_{1M}(t))^2\mathrm{d}t}, \end{aligned}$$

where \(f_{1M}(x_j)\) are defined in Sect. 3.1. \(s_i, i=0,\ldots ,n\) are defined as \(s_0=0\), \(s_i=(x_{i1}+x_{(i+1)1})/2, i=1,\ldots ,n-1\), \(s_n=1\), \(0\le x_{11}<x_{21}<\cdots <x_{n1}\le 1\) ordered. Then, as \(h\rightarrow 0\) and \(n\rightarrow \infty \), the bias and variance of \(\hat{f}_{1M}(x_1)\) can be expressed, respectively, as

$$\begin{aligned} \hbox {bias}(\hat{f}_{1M}(x_1))= & {} \frac{1}{2}h^2 \sigma _K^2M^{-\gamma _{12}}e_{21}(x_1)+ O(n^{-1})+ o(h^2M^{-\gamma _{12}}) + O(M^{-\gamma _{0}}),\\ \hbox {var}(\hat{f}_{1M}(x_1))= & {} \frac{\sigma ^2J_K}{nhp_1(x_1)}+ O(n^{-1})+O(n^{-2}h^{-2}), \end{aligned}$$

where \(\gamma _0 = \min \{\gamma _{j0}; j=1,2,\ldots ,d\}\), \(e_{21}(x_1)\) is defined in condition C2, and satisfying \(\lim _{M\rightarrow \infty }M^{\gamma _{j2}}r_{jM}''(x_j)=e_{j2}(x_j),j=1,\ldots ,d\).

Lemma 2

Let the conditions of Theorem 3 hold and let

$$\begin{aligned} \check{f}_{1M}(x_1)= & {} f_{1M}(x_1) \\&+ \frac{\sum _{i=1}^n \left\{ Y_i - \mu - \sum _{j=2}^d {f}_{jM}(x_{ij})\right\} \int _{s_{i-1}}^{s_i}K\left( \frac{t-x_1}{h}\right) \mathrm{d}t-\int _0^1K \left( \frac{t-x_1}{h}\right) f_{1M}(t)\mathrm{d}t}{\int _0^1 K\left( \frac{t-x_1}{h}\right) }. \end{aligned}$$

Then as \(h\rightarrow 0\) and \(n\rightarrow \infty \), the bias and variance of \(\check{f}_{1M}(x_1)\) can be expressed, respectively, as

$$\begin{aligned} \hbox {bias}(\check{f}_{1M}(x_1))= & {} \frac{1}{2}h^2 \sigma _K^2M^{-\gamma _{12}}e_{21}(x_1)+ O(n^{-1})+ o(h^2M^{-\gamma _{12}}) + O(M^{-\gamma _{0}}),\\ \hbox {var}(\check{f}_{1M}(x_1))= & {} \frac{\sigma ^2J_K}{nhp_1(x_1)}+ O(n^{-1})+O(n^{-2}h^{-2}), \end{aligned}$$

where \(\gamma _0 = \min \{\gamma _{j0}; j=1,2,\ldots ,d\}\), \(e_{21}(x_1)\) is defined in condition C2, and satisfying \(\lim _{M\rightarrow \infty }M^{\gamma _{j2}}r_{jM}''(x_j)=e_{j2}(x_j),j=1,\ldots ,d\).

To proof Theorem  3, in Sect. 3.2, we defined that \(r_{jM}(x_j)=f_j(x_j)-f_{jM}(x_j)\); here, we define that

$$\begin{aligned} R_M(x)=\sum _{j=1}^d f_j(x_j)-\sum _{j=1}^d \sum _{l=1}^M \beta ^0_{jl}q_l(x_j). \end{aligned}$$

Then, the first stage estimators of \(f_j(x_j)\) can be expressed as

$$\begin{aligned} \tilde{f}_{j}(x_j) = f_{jM}(x_j) + \frac{1}{n}\sum _{i=1}^n PA(x_i)(\varepsilon _i + R_M(x_i)), j= 1,\ldots ,d, \end{aligned}$$

where \(PA(x_i)\) is the summation components in equation (A1) in Lin et al. (2009). And using Taylor expansion, \(\hat{f}_1(x_1)-\hat{f}_{1M}(x_1)\) can be expanded at \(f_{1M}(x_1)\); then, we can get

$$\begin{aligned} \hat{f}_1(x_1)-\hat{f}_{1M}(x_1)= & {} B_1(x_1)(\tilde{f}_1(x_1)-f_{1M}(x_1))\\&+ B_2(x_1)\left\{ \tilde{\mu }-\mu _0 +\sum _{j=2}^d(\tilde{f}_j(x_j)-f_{jM}(x_j))\right\} + o_p(hn^{-1}M), \end{aligned}$$

where \(B_1(x_1)=\eta _1/\eta _2+f_1(x_1)\eta _3/\eta 2-2f_1(x_1)\eta _1/\eta _4^2\), and \(B_2(x_1)=-f_1(x_1)\eta _5/\eta _2\) with \(\eta _1= \sum _{i=1}^n\{Y_i-\mu ^0-\sum _{j=2}^d f_{jM}(x_{ij})\}\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t\), \(\eta _2 = \int _0^1 K(\frac{t-x_1}{h})f^2_{1M}(t)\mathrm{d}t\), \(\eta _3=\sum _{i=1}^n\{Y_i-\mu ^0-\sum _{j=2}^d f_{jM}(x_{ij})\}\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})\mathrm{d}t\), \(\eta _4 = \int _0^1 K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t\), \(\eta _5=\sum _{i=1}^n Y_i\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t\).

From the results above, we have

$$\begin{aligned} E\{\eta _1(\tilde{f}_1(x_1)-f_{1M}(x_1))\} = O(hM^{-\gamma _0+1})+O(hn^{-1}M). \end{aligned}$$

And similarly \(E\{\eta _3(\tilde{f}_1(x_1)-f_{1M}(x_1))\} = O(hM^{-\gamma _0+1})+O(hn^{-1}M)\). So,

$$\begin{aligned} E\{B_1(x_1)(\tilde{f}_1(x_1)-f_{1M}(x_1))\}=O(M^{-\gamma _0+1})+O(n^{-1}M). \end{aligned}$$

And \(E\{B_2(x_1)[\tilde{\mu }-\mu +\sum _{j=2}^d(\tilde{f}_j(x_j)-f_{jM}(x_j))]\} =O(M^{-\gamma _0+1})+O(n^{-1}M)\). Combining these results with Lemma 1 and conditions C1 and C2 leads to the final expression of \(\hbox {bias}(\hat{f}_{1}(x_1))\) and \(\hbox {var}(\hat{f}_{1}(x_1))\) in Theorem 3.

The second part of the proof is similar, and so we omitted it here. \(\square \)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Feng, Z., Lin, L., Zhu, R. et al. Nonparametric variable selection and its application to additive models. Ann Inst Stat Math 72, 827–854 (2020). https://doi.org/10.1007/s10463-019-00711-9

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-019-00711-9

Keywords

Navigation