Nonparametric variable selection and its application to additive models

Feng, Zhenghui; Lin, Lu; Zhu, Ruoqing; Zhu, Lixing

doi:10.1007/s10463-019-00711-9

Nonparametric variable selection and its application to additive models

Published: 29 March 2019

Volume 72, pages 827–854, (2020)
Cite this article

Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Zhenghui Feng¹,
Lu Lin^2,3,
Ruoqing Zhu⁴ &
…
Lixing Zhu^5,6

794 Accesses
1 Citation
Explore all metrics

Abstract

Variable selection for multivariate nonparametric regression models usually involves parameterized approximation for nonparametric functions in the objective function. However, this parameterized approximation often increases the number of parameters significantly, leading to the “curse of dimensionality” and inaccurate estimation. In this paper, we propose a novel and easily implemented approach to do variable selection in nonparametric models without parameterized approximation, enabling selection consistency to be achieved. The proposed method is applied to do variable selection for additive models. A two-stage procedure with selection and adaptive estimation is proposed, and the properties of this method are investigated. This two-stage algorithm is adaptive to the smoothness of the underlying components, and the estimation consistency can reach a parametric rate if the underlying model is really parametric. Simulation studies are conducted to examine the performance of the proposed method. Furthermore, a real data example is analyzed for illustration.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Check your outliers! An introduction to identifying statistical outliers in R with easystats

Article 25 March 2024

Rémi Thériault, Mattan S. Ben-Shachar, … Dominique Makowski

A Guide for Sparse PCA: Model Comparison and Applications

Article Open access 29 June 2021

Rosember Guerra-Urzola, Katrijn Van Deun, … Klaas Sijtsma

Robust estimation in regression and classification methods for large dimensional data

Article 05 July 2023

Chunming Zhang, Lixing Zhu & Yanbo Shen

References

Candés, E., Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$ (with discussion). The Annals of Statistics, 35, 2313–2404.
Article MathSciNet Google Scholar
Chaudhuri, P., Huang, M.-C., Loh, W.-Y., Yao, R. (1994). Piecewise-polynomial regression trees. Statistica Sinica, 4, 143–167.
MATH Google Scholar
Cook, R. D. (1998). Regression graphics: Ideas for studying regressions through graphics. New York: Wiley.
Cook, R. D., Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 28–33.
MATH Google Scholar
Cui, X., Peng, H., Wen, S. Q., Zhu, L. X. (2013). Component selection in the additive regression model. Scandinavian Journal of Statistics, 40(3), 491–510.
Article MathSciNet Google Scholar
Fan, J., Li, R. (2001). Variable selestion via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Article MathSciNet Google Scholar
Fan, J., Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society Series B, 70, 849–911.
Article MathSciNet Google Scholar
Fan, J., Feng, Y., Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106, 544–557.
Article MathSciNet Google Scholar
Guan, Y., Xie, C., Zhu, L. (2017). Sufficient dimension reduction with mixture multivariate skew elliptical distributions. Statistica Sinica, 27(1), 335–355.
MathSciNet MATH Google Scholar
Hall, P., Li, K.-C. (1993). On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics, 21, 867–889.
Article MathSciNet Google Scholar
Härdle, W. (1990). Applied nonparametric regression, econometric society monograph series, 19. Cambridge: Cambridge University Press.
Google Scholar
Härdle, W., Marron, J. S. (1985). Optimal bandwidth selection in nonparametric regression function estimation. Annals of Statistics, 13, 1465–1481.
Article MathSciNet Google Scholar
Hastie, T., Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1, 297–318.
Article MathSciNet Google Scholar
Li, B., Wang, S. (2007). On directional regression for dimesnion reduction. Journal of the American Statistical Association, 102, 997–1008.
Article MathSciNet Google Scholar
Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327.
Article MathSciNet Google Scholar
Li, K.-C., Duan, N. (1989). Regression analysis under link violation. The Annals of Statistics, 17(3), 1009–1052.
Article MathSciNet Google Scholar
Li, K.-C., Lue, H. H., Chen, C. H. (2000). Interactive tree-truncated regression via principal Hessian directions. Journal of the American Statistical Association, 95, 547–560.
Article MathSciNet Google Scholar
Li, R., Zhong, W., Zhu, L. P. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129–1139.
Article MathSciNet Google Scholar
Lin, L., Cui, X., Zhu, L. X. (2009). An adaptive two-stage estimation method for additive models. Scandinavian Journal of Statistics, 36, 248–269.
Article MathSciNet Google Scholar
Lin, L., Sun, J., Zhu, L. X. (2013). Nonparametric feature screening. Computational Statistics and Data Analysis, 36, 162–174.
Article MathSciNet Google Scholar
Lin, Y., Zhang, H. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34, 2272–2297.
Article MathSciNet Google Scholar
Meier, L., Van der Geer, S., Bühlmann, P. (2009). High-dimensional additive modeling. The Annals of Statistics, 37, 3779–3821.
Article MathSciNet Google Scholar
Storlie, C. B., Bonedll, H. D., Reich, B. J., Zhang, H. H. (2011). Surface estimation, variance selection, and the nonparametric oracle property. Statistica Sinica, 21, 679–705.
Article MathSciNet Google Scholar
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
MathSciNet MATH Google Scholar
Wahba, G. (1990). Spline models for observational data, vol. 59. SIAM. CBMSNSF Regional Conference Series in Applied Mathematics.
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variable. Journal of the Royal Statistical Society, Series B, 68, 49–67.
Article MathSciNet Google Scholar
Zhao, P., Yu, B. (2006). On model selection consisitency of Lasso. Journal of Machine learning Research, 7, 2541–2563.
MATH Google Scholar
Zhu, L. P., Wang, T., Zhu, L. X., Ferré, L. (2010). Sufficient dimension reduction through discretization-expectation estimation. Biometrika, 97, 295–304.
Article MathSciNet Google Scholar
Zhu, L. P., Li, L. X., Li, R., Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106, 1464–1474.
Article MathSciNet Google Scholar
Zhu, L. X., Miao, B. Q., Peng, H. (2006). On sliced inverse regression with high dimensional covariates. Journal of the American Statistical Association, 101, 630–643.
Article MathSciNet Google Scholar
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors thank Dr. Yang Feng in Columbia University for providing their NIS code. Dr. Lixing Zhu’s work was supported by a Grant from the Research Grants Council of Hong Kong and a Faculty Research Grant (FRG) Grant from Hong Kong Baptist University and a Grant (NSFC11671042) from the National natural Science Foundation of China. Dr. Zhenghui Feng’s work was supported by the Natural Science Foundation of Fujian Province of China, Grant No. 2017J01006, and German Research Foundation (DFG) via the International Research Training Group 1792 “High Dimensional Nonstationary Time Series,” Humboldt-University zu Berlin. The authors thank the editor, associate editors, and referees for their constructive suggestions and comments that led to a significant improvement in an early manuscript.

Author information

Authors and Affiliations

School of Economics, and the Wang Yanan Institute for Studies in Economics, Xiamen University, Xiamen, 361005, Fujian, China
Zhenghui Feng
Zhongtai Securities Institute for Financial Studies, Shandong University, Jinan, 250100, Shandong, China
Lu Lin
School of Statistics, Qufu Normal University, Qufu, 273165, Shandong, China
Lu Lin
Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, IL, 61820, USA
Ruoqing Zhu
Department of Mathematics, Hong Kong Baptist University, Kowloon Tong, Kowloon, Hong Kong, China
Lixing Zhu
School of Statistics, Beijing Normal University, Beijing, 100875, China
Lixing Zhu

Authors

Zhenghui Feng
View author publications
You can also search for this author in PubMed Google Scholar
Lu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ruoqing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Lixing Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lixing Zhu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of Theorem 1

Recall the definition of ${\eta }$, $\mathbf {Z},$ and ${\mathbf{A}_d }^{T}\mathbf {X}={\eta }^{T}\mathbf {Z}$. ${A}_1$ is a p-dimensional vector whose first d elements are 1, otherwise 0. We have

$$\begin{aligned} \varvec{\Sigma }^{-1}\mathrm {E}(\mathbf {X} h(Y))= & {} \varvec{\Sigma }^{-1/2}({\mathbf{B}}_1,\eta _1/\Vert \eta _1\Vert )({\mathbf{B}}_1, \eta _1/\Vert \eta _1\Vert )^{\top } \mathrm {E}\left( \mathbf {Z} h(Y)\right) \nonumber \\= & {} \varvec{\Sigma }^{-1/2}{\mathbf{B}}_1{\mathbf{B}}_1^{\top }\mathrm {E}\left( \mathbf {Z} h(Y)\right) + \varvec{\Sigma }^{-1/2}\eta _1 \eta _1^{\top } \mathrm {E}\left( \mathbf {Z} h(Y)\right) /\Vert \eta _1\Vert ^2\nonumber \\= & {} \varvec{\Sigma }^{-1/2}{\mathbf{B}}_1{\mathbf{B}}_1^{\top }\mathrm {E}\left( \mathbf {Z} h(Y)\right) +{A}_1 {A}_1^{\top } \mathrm {E}\left( \mathbf {X} h(Y)\right) /\Vert \eta _1\Vert ^2\nonumber \\=: & {} \varvec{\Sigma }^{-1/2}{\mathbf{B}}_1\mathrm {E}\left( \mathrm {E} ({\mathbf{B}}_1^{\top }\mathbf {Z}|Y) h(Y)\right) +c_h{ A}_1. \end{aligned}$$

(24)

It is obvious that the first term is equal to zero when the condition $\mathrm {E}({\mathbf{B}}_1^{\top }\mathbf {Z}|Y)=0$ almost surely. Thus, (3) implies (4). On the other hand, when (4) holds, the first term in (24) is zero. For any transformation $h(\cdot )$, $\varvec{\Sigma }^{-1/2}{\mathbf{B}}_1\mathrm {E}\big (\mathrm {E} ({\mathbf{B}}_1^{\top }\mathbf {Z}|Y) h(Y)\big ) =0$, implies $\mathrm {E} ({\mathbf{B}}_1^{\top }\mathbf {Z}|Y) =0 $ almost surely. (4) implies (3). When the distribution of $\mathbf {Z}$ is elliptically symmetric, the Eq. (4) can be proved similarly in Li and Duan (1989). $\square $

Proof of Theorem 3

Here, we give the sketch of proof of Theorem 3, which is Theorem 1 in Lin et al. (2009). For details please refer to Lin et al. (2009).

To proof Theorem 3, two lemmas are needed.

Lemma 1

Suppose the conditions of Theorem 3 hold and denote

$$\begin{aligned} \hat{f}_{1M}(x_1) = f_{1M}(x_1) \frac{\sum _{i=1}^n \{Y_i - \mu - \sum _{j=2}^d {f}_{jM}(x_{ij})\}\int _{s_{i-1}}^{s_i}K\left( \frac{t-x_1}{h}\right) f_{1M}(t)\mathrm{d}t}{\int _0^1 K\left( \frac{t-x_1}{h}\right) (f_{1M}(t))^2\mathrm{d}t}, \end{aligned}$$

where $f_{1M}(x_j)$ are defined in Sect. 3.1. $s_i, i=0,\ldots ,n$ are defined as $s_0=0$, $s_i=(x_{i1}+x_{(i+1)1})/2, i=1,\ldots ,n-1$, $s_n=1$, $0\le x_{11}<x_{21}<\cdots <x_{n1}\le 1$ ordered. Then, as $h\rightarrow 0$ and $n\rightarrow \infty $, the bias and variance of $\hat{f}_{1M}(x_1)$ can be expressed, respectively, as

$$\begin{aligned} \hbox {bias}(\hat{f}_{1M}(x_1))= & {} \frac{1}{2}h^2 \sigma _K^2M^{-\gamma _{12}}e_{21}(x_1)+ O(n^{-1})+ o(h^2M^{-\gamma _{12}}) + O(M^{-\gamma _{0}}),\\ \hbox {var}(\hat{f}_{1M}(x_1))= & {} \frac{\sigma ^2J_K}{nhp_1(x_1)}+ O(n^{-1})+O(n^{-2}h^{-2}), \end{aligned}$$

where $\gamma _0 = \min \{\gamma _{j0}; j=1,2,\ldots ,d\}$, $e_{21}(x_1)$ is defined in condition C2, and satisfying $\lim _{M\rightarrow \infty }M^{\gamma _{j2}}r_{jM}''(x_j)=e_{j2}(x_j),j=1,\ldots ,d$.

Lemma 2

Let the conditions of Theorem 3 hold and let

$$\begin{aligned} \check{f}_{1M}(x_1)= & {} f_{1M}(x_1) \\&+ \frac{\sum _{i=1}^n \left\{ Y_i - \mu - \sum _{j=2}^d {f}_{jM}(x_{ij})\right\} \int _{s_{i-1}}^{s_i}K\left( \frac{t-x_1}{h}\right) \mathrm{d}t-\int _0^1K \left( \frac{t-x_1}{h}\right) f_{1M}(t)\mathrm{d}t}{\int _0^1 K\left( \frac{t-x_1}{h}\right) }. \end{aligned}$$

Then as $h\rightarrow 0$ and $n\rightarrow \infty $, the bias and variance of $\check{f}_{1M}(x_1)$ can be expressed, respectively, as

$$\begin{aligned} \hbox {bias}(\check{f}_{1M}(x_1))= & {} \frac{1}{2}h^2 \sigma _K^2M^{-\gamma _{12}}e_{21}(x_1)+ O(n^{-1})+ o(h^2M^{-\gamma _{12}}) + O(M^{-\gamma _{0}}),\\ \hbox {var}(\check{f}_{1M}(x_1))= & {} \frac{\sigma ^2J_K}{nhp_1(x_1)}+ O(n^{-1})+O(n^{-2}h^{-2}), \end{aligned}$$

where $\gamma _0 = \min \{\gamma _{j0}; j=1,2,\ldots ,d\}$, $e_{21}(x_1)$ is defined in condition C2, and satisfying $\lim _{M\rightarrow \infty }M^{\gamma _{j2}}r_{jM}''(x_j)=e_{j2}(x_j),j=1,\ldots ,d$.

To proof Theorem 3, in Sect. 3.2, we defined that $r_{jM}(x_j)=f_j(x_j)-f_{jM}(x_j)$; here, we define that

$$\begin{aligned} R_M(x)=\sum _{j=1}^d f_j(x_j)-\sum _{j=1}^d \sum _{l=1}^M \beta ^0_{jl}q_l(x_j). \end{aligned}$$

Then, the first stage estimators of $f_j(x_j)$ can be expressed as

$$\begin{aligned} \tilde{f}_{j}(x_j) = f_{jM}(x_j) + \frac{1}{n}\sum _{i=1}^n PA(x_i)(\varepsilon _i + R_M(x_i)), j= 1,\ldots ,d, \end{aligned}$$

where $PA(x_i)$ is the summation components in equation (A1) in Lin et al. (2009). And using Taylor expansion, $\hat{f}_1(x_1)-\hat{f}_{1M}(x_1)$ can be expanded at $f_{1M}(x_1)$; then, we can get

$$\begin{aligned} \hat{f}_1(x_1)-\hat{f}_{1M}(x_1)= & {} B_1(x_1)(\tilde{f}_1(x_1)-f_{1M}(x_1))\\&+ B_2(x_1)\left\{ \tilde{\mu }-\mu _0 +\sum _{j=2}^d(\tilde{f}_j(x_j)-f_{jM}(x_j))\right\} + o_p(hn^{-1}M), \end{aligned}$$

where $B_1(x_1)=\eta _1/\eta _2+f_1(x_1)\eta _3/\eta 2-2f_1(x_1)\eta _1/\eta _4^2$, and $B_2(x_1)=-f_1(x_1)\eta _5/\eta _2$ with $\eta _1= \sum _{i=1}^n\{Y_i-\mu ^0-\sum _{j=2}^d f_{jM}(x_{ij})\}\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t$, $\eta _2 = \int _0^1 K(\frac{t-x_1}{h})f^2_{1M}(t)\mathrm{d}t$, $\eta _3=\sum _{i=1}^n\{Y_i-\mu ^0-\sum _{j=2}^d f_{jM}(x_{ij})\}\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})\mathrm{d}t$, $\eta _4 = \int _0^1 K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t$, $\eta _5=\sum _{i=1}^n Y_i\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t$.

From the results above, we have

$$\begin{aligned} E\{\eta _1(\tilde{f}_1(x_1)-f_{1M}(x_1))\} = O(hM^{-\gamma _0+1})+O(hn^{-1}M). \end{aligned}$$

And similarly $E\{\eta _3(\tilde{f}_1(x_1)-f_{1M}(x_1))\} = O(hM^{-\gamma _0+1})+O(hn^{-1}M)$. So,

$$\begin{aligned} E\{B_1(x_1)(\tilde{f}_1(x_1)-f_{1M}(x_1))\}=O(M^{-\gamma _0+1})+O(n^{-1}M). \end{aligned}$$

And $E\{B_2(x_1)[\tilde{\mu }-\mu +\sum _{j=2}^d(\tilde{f}_j(x_j)-f_{jM}(x_j))]\} =O(M^{-\gamma _0+1})+O(n^{-1}M)$. Combining these results with Lemma 1 and conditions C1 and C2 leads to the final expression of $\hbox {bias}(\hat{f}_{1}(x_1))$ and $\hbox {var}(\hat{f}_{1}(x_1))$ in Theorem 3.

The second part of the proof is similar, and so we omitted it here. $\square $

About this article

Cite this article

Feng, Z., Lin, L., Zhu, R. et al. Nonparametric variable selection and its application to additive models. Ann Inst Stat Math 72, 827–854 (2020). https://doi.org/10.1007/s10463-019-00711-9

Download citation

Received: 25 February 2016
Revised: 24 January 2019
Published: 29 March 2019
Issue Date: June 2020
DOI: https://doi.org/10.1007/s10463-019-00711-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonparametric variable selection and its application to additive models

Abstract

Access this article

Similar content being viewed by others

Check your outliers! An introduction to identifying statistical outliers in R with easystats

A Guide for Sparse PCA: Model Comparison and Applications

Robust estimation in regression and classification methods for large dimensional data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Proof of Theorem 1

Proof of Theorem 3

Lemma 1

Lemma 2

About this article

Cite this article

Keywords

Navigation

Nonparametric variable selection and its application to additive models

Abstract

Access this article

Similar content being viewed by others

Check your outliers﻿! An introduction to identifying statistical outliers in R with easystats

A Guide for Sparse PCA: Model Comparison and Applications

Robust estimation in regression and classification methods for large dimensional data

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Proof of Theorem 1

Proof of Theorem 3

Lemma 1

Lemma 2

About this article

Cite this article

Share this article

Keywords

Search

Navigation

Check your outliers! An introduction to identifying statistical outliers in R with easystats