Abstract
Variable selection for multivariate nonparametric regression models usually involves parameterized approximation for nonparametric functions in the objective function. However, this parameterized approximation often increases the number of parameters significantly, leading to the “curse of dimensionality” and inaccurate estimation. In this paper, we propose a novel and easily implemented approach to do variable selection in nonparametric models without parameterized approximation, enabling selection consistency to be achieved. The proposed method is applied to do variable selection for additive models. A two-stage procedure with selection and adaptive estimation is proposed, and the properties of this method are investigated. This two-stage algorithm is adaptive to the smoothness of the underlying components, and the estimation consistency can reach a parametric rate if the underlying model is really parametric. Simulation studies are conducted to examine the performance of the proposed method. Furthermore, a real data example is analyzed for illustration.
Similar content being viewed by others
References
Candés, E., Tao, T. (2007). The Dantzig selector: Statistical estimation when $p$ is much larger than $n$ (with discussion). The Annals of Statistics, 35, 2313–2404.
Chaudhuri, P., Huang, M.-C., Loh, W.-Y., Yao, R. (1994). Piecewise-polynomial regression trees. Statistica Sinica, 4, 143–167.
Cook, R. D. (1998). Regression graphics: Ideas for studying regressions through graphics. New York: Wiley.
Cook, R. D., Weisberg, S. (1991). Discussion of ‘Sliced inverse regression for dimension reduction’. Journal of the American Statistical Association, 86, 28–33.
Cui, X., Peng, H., Wen, S. Q., Zhu, L. X. (2013). Component selection in the additive regression model. Scandinavian Journal of Statistics, 40(3), 491–510.
Fan, J., Li, R. (2001). Variable selestion via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Fan, J., Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). Journal of the Royal Statistical Society Series B, 70, 849–911.
Fan, J., Feng, Y., Song, R. (2011). Nonparametric independence screening in sparse ultra-high-dimensional additive models. Journal of the American Statistical Association, 106, 544–557.
Guan, Y., Xie, C., Zhu, L. (2017). Sufficient dimension reduction with mixture multivariate skew elliptical distributions. Statistica Sinica, 27(1), 335–355.
Hall, P., Li, K.-C. (1993). On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics, 21, 867–889.
Härdle, W. (1990). Applied nonparametric regression, econometric society monograph series, 19. Cambridge: Cambridge University Press.
Härdle, W., Marron, J. S. (1985). Optimal bandwidth selection in nonparametric regression function estimation. Annals of Statistics, 13, 1465–1481.
Hastie, T., Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1, 297–318.
Li, B., Wang, S. (2007). On directional regression for dimesnion reduction. Journal of the American Statistical Association, 102, 997–1008.
Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–327.
Li, K.-C., Duan, N. (1989). Regression analysis under link violation. The Annals of Statistics, 17(3), 1009–1052.
Li, K.-C., Lue, H. H., Chen, C. H. (2000). Interactive tree-truncated regression via principal Hessian directions. Journal of the American Statistical Association, 95, 547–560.
Li, R., Zhong, W., Zhu, L. P. (2012). Feature screening via distance correlation learning. Journal of the American Statistical Association, 107, 1129–1139.
Lin, L., Cui, X., Zhu, L. X. (2009). An adaptive two-stage estimation method for additive models. Scandinavian Journal of Statistics, 36, 248–269.
Lin, L., Sun, J., Zhu, L. X. (2013). Nonparametric feature screening. Computational Statistics and Data Analysis, 36, 162–174.
Lin, Y., Zhang, H. (2006). Component selection and smoothing in multivariate nonparametric regression. The Annals of Statistics, 34, 2272–2297.
Meier, L., Van der Geer, S., Bühlmann, P. (2009). High-dimensional additive modeling. The Annals of Statistics, 37, 3779–3821.
Storlie, C. B., Bonedll, H. D., Reich, B. J., Zhang, H. H. (2011). Surface estimation, variance selection, and the nonparametric oracle property. Statistica Sinica, 21, 679–705.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58, 267–288.
Wahba, G. (1990). Spline models for observational data, vol. 59. SIAM. CBMSNSF Regional Conference Series in Applied Mathematics.
Yuan, M., Lin, Y. (2006). Model selection and estimation in regression with grouped variable. Journal of the Royal Statistical Society, Series B, 68, 49–67.
Zhao, P., Yu, B. (2006). On model selection consisitency of Lasso. Journal of Machine learning Research, 7, 2541–2563.
Zhu, L. P., Wang, T., Zhu, L. X., Ferré, L. (2010). Sufficient dimension reduction through discretization-expectation estimation. Biometrika, 97, 295–304.
Zhu, L. P., Li, L. X., Li, R., Zhu, L. X. (2011). Model-free feature screening for ultrahigh-dimensional data. Journal of the American Statistical Association, 106, 1464–1474.
Zhu, L. X., Miao, B. Q., Peng, H. (2006). On sliced inverse regression with high dimensional covariates. Journal of the American Statistical Association, 101, 630–643.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Acknowledgements
The authors thank Dr. Yang Feng in Columbia University for providing their NIS code. Dr. Lixing Zhu’s work was supported by a Grant from the Research Grants Council of Hong Kong and a Faculty Research Grant (FRG) Grant from Hong Kong Baptist University and a Grant (NSFC11671042) from the National natural Science Foundation of China. Dr. Zhenghui Feng’s work was supported by the Natural Science Foundation of Fujian Province of China, Grant No. 2017J01006, and German Research Foundation (DFG) via the International Research Training Group 1792 “High Dimensional Nonstationary Time Series,” Humboldt-University zu Berlin. The authors thank the editor, associate editors, and referees for their constructive suggestions and comments that led to a significant improvement in an early manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Theorem 1
Recall the definition of \({\eta }\), \(\mathbf {Z},\) and \({\mathbf{A}_d }^{T}\mathbf {X}={\eta }^{T}\mathbf {Z}\). \({A}_1\) is a p-dimensional vector whose first d elements are 1, otherwise 0. We have
It is obvious that the first term is equal to zero when the condition \(\mathrm {E}({\mathbf{B}}_1^{\top }\mathbf {Z}|Y)=0\) almost surely. Thus, (3) implies (4). On the other hand, when (4) holds, the first term in (24) is zero. For any transformation \(h(\cdot )\), \(\varvec{\Sigma }^{-1/2}{\mathbf{B}}_1\mathrm {E}\big (\mathrm {E} ({\mathbf{B}}_1^{\top }\mathbf {Z}|Y) h(Y)\big ) =0\), implies \(\mathrm {E} ({\mathbf{B}}_1^{\top }\mathbf {Z}|Y) =0 \) almost surely. (4) implies (3). When the distribution of \(\mathbf {Z}\) is elliptically symmetric, the Eq. (4) can be proved similarly in Li and Duan (1989). \(\square \)
Proof of Theorem 3
Here, we give the sketch of proof of Theorem 3, which is Theorem 1 in Lin et al. (2009). For details please refer to Lin et al. (2009).
To proof Theorem 3, two lemmas are needed.
Lemma 1
Suppose the conditions of Theorem 3 hold and denote
where \(f_{1M}(x_j)\) are defined in Sect. 3.1. \(s_i, i=0,\ldots ,n\) are defined as \(s_0=0\), \(s_i=(x_{i1}+x_{(i+1)1})/2, i=1,\ldots ,n-1\), \(s_n=1\), \(0\le x_{11}<x_{21}<\cdots <x_{n1}\le 1\) ordered. Then, as \(h\rightarrow 0\) and \(n\rightarrow \infty \), the bias and variance of \(\hat{f}_{1M}(x_1)\) can be expressed, respectively, as
where \(\gamma _0 = \min \{\gamma _{j0}; j=1,2,\ldots ,d\}\), \(e_{21}(x_1)\) is defined in condition C2, and satisfying \(\lim _{M\rightarrow \infty }M^{\gamma _{j2}}r_{jM}''(x_j)=e_{j2}(x_j),j=1,\ldots ,d\).
Lemma 2
Let the conditions of Theorem 3 hold and let
Then as \(h\rightarrow 0\) and \(n\rightarrow \infty \), the bias and variance of \(\check{f}_{1M}(x_1)\) can be expressed, respectively, as
where \(\gamma _0 = \min \{\gamma _{j0}; j=1,2,\ldots ,d\}\), \(e_{21}(x_1)\) is defined in condition C2, and satisfying \(\lim _{M\rightarrow \infty }M^{\gamma _{j2}}r_{jM}''(x_j)=e_{j2}(x_j),j=1,\ldots ,d\).
To proof Theorem 3, in Sect. 3.2, we defined that \(r_{jM}(x_j)=f_j(x_j)-f_{jM}(x_j)\); here, we define that
Then, the first stage estimators of \(f_j(x_j)\) can be expressed as
where \(PA(x_i)\) is the summation components in equation (A1) in Lin et al. (2009). And using Taylor expansion, \(\hat{f}_1(x_1)-\hat{f}_{1M}(x_1)\) can be expanded at \(f_{1M}(x_1)\); then, we can get
where \(B_1(x_1)=\eta _1/\eta _2+f_1(x_1)\eta _3/\eta 2-2f_1(x_1)\eta _1/\eta _4^2\), and \(B_2(x_1)=-f_1(x_1)\eta _5/\eta _2\) with \(\eta _1= \sum _{i=1}^n\{Y_i-\mu ^0-\sum _{j=2}^d f_{jM}(x_{ij})\}\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t\), \(\eta _2 = \int _0^1 K(\frac{t-x_1}{h})f^2_{1M}(t)\mathrm{d}t\), \(\eta _3=\sum _{i=1}^n\{Y_i-\mu ^0-\sum _{j=2}^d f_{jM}(x_{ij})\}\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})\mathrm{d}t\), \(\eta _4 = \int _0^1 K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t\), \(\eta _5=\sum _{i=1}^n Y_i\int _{s_{i-1}}^{s_i}K(\frac{t-x_1}{h})f_{1M}(t)\mathrm{d}t\).
From the results above, we have
And similarly \(E\{\eta _3(\tilde{f}_1(x_1)-f_{1M}(x_1))\} = O(hM^{-\gamma _0+1})+O(hn^{-1}M)\). So,
And \(E\{B_2(x_1)[\tilde{\mu }-\mu +\sum _{j=2}^d(\tilde{f}_j(x_j)-f_{jM}(x_j))]\} =O(M^{-\gamma _0+1})+O(n^{-1}M)\). Combining these results with Lemma 1 and conditions C1 and C2 leads to the final expression of \(\hbox {bias}(\hat{f}_{1}(x_1))\) and \(\hbox {var}(\hat{f}_{1}(x_1))\) in Theorem 3.
The second part of the proof is similar, and so we omitted it here. \(\square \)
About this article
Cite this article
Feng, Z., Lin, L., Zhu, R. et al. Nonparametric variable selection and its application to additive models. Ann Inst Stat Math 72, 827–854 (2020). https://doi.org/10.1007/s10463-019-00711-9
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-019-00711-9