Abstract
We consider the variable selection problem in a sparse logistical regression model. Inspired by the square-root Lasso, we develop a weighted score Lasso for logistical regression. The new method yields the estimation \({\ell }_1\) error bound under similar assumptions as introduced in Bach et al. (Electron J Stat 4:384–414, 2010). Compared to standard Lasso, the weighted score Lasso provides a direct choice for the tuning parameter. Both theoretical and simulation results confirm the satisfactory performance of the proposed method. We illustrate our methodology with a real microarray data set.
Similar content being viewed by others
References
Bach F et al (2010) Self-concordant analysis for logistic regression. Electron J Stat 4:384–414
Belloni A, Chernozhukov V, Wang L (2011) Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika 98(4):791–806
Bickel PJ, Ritov Y, Tsybakov AB (2009) Simultaneous analysis of lasso and dantzig selector. Ann Stat 37:1705–1732
Blazere M, Loubes J-M, Gamboa F (2014) Oracle inequalities for a group lasso procedure applied to generalized linear models in high dimension. IEEE Trans Inf Theory 60(4):2303–2318
Bühlmann P, Van De Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, Berlin
Bunea F et al (2008) Honest variable selection in linear and logistic regression models via \(\ell _1\) and \(\ell _1+ \ell _2\) penalization. Electron J Stat 2:1153–1194
Candes E, Tao T (2007) The Dantzig selector: statistical estimation when \(p\) is much larger than \(n\). Ann Stat 35:2313–2351
Dettling M (2004) Bagboosting for tumor classification with gene expression data. Bioinformatics 20(18):3583–3593
Dobson AJ, Barnett A (2008) An introduction to generalized linear models. CRC Press, Boca Raton
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Assoc 96(456):1348–1360
Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Huang Y, Wang C (2001) Consistent functional methods for logistic regression with errors in covariates. J Am Stat Assoc 96(456):1469–1482
Huang J, Ma S, Zhang C-H (2008) The iterated \(l\)asso for high-dimensional logistic regression. The University of Iowa, Department of Statistics and Actuarial Sciences
Kwemou M (2016) Non-asymptotic oracle inequalities for the lasso and group lasso in high dimensional logistic model. ESAIM Probab Stat 20:309–331
Lee SI, Lee H, Abbeel P, Ng AY (2014) Efficient \(l_1\) regularized logistic regression. In: National conference on artificial intelligence
Loh P-L, Wainwright MJ (2013) Regularized \(m\)-estimators with nonconvexity: statistical and algorithmic theory for local optima. In Advances in neural information processing systems, pp 476–484
Meier L, Van De Geer S, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc Ser B (Stat Methodol) 70(1):53–71
Negahban S, Ravikumar P, Wainwright MJ, Yu B (2011) A unified framework for high-dimensional analysis of \(m\)-estimators with decomposable regularizers. In: Advances in neural information processing systems (NIPS)
Negahban SN, Ravikumar P, Wainwright MJ, Yu B et al (2012) A unified framework for high-dimensional analysis of \( m \)-estimators with decomposable regularizers. Stat Sci 27(4):538–557
Sakhanenko AI (1991) Berry-Esseen type estimates for large deviation probabilities. Sib Math J 32(4):647–656
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58:267–288
Van de Geer S (2007) The deterministic lasso. Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich
Van de Geer SA (2008) High-dimensional generalized linear models and the lasso. Ann Stat 36:614–645
Yuan M, Lin Y (2006) Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B (Stat Methodol) 68(1):49–67
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol) 67(2):301–320
Acknowledgements
We thank anonymous referees and an associate editor for their helpful and constructive comments to improve this manuscript a lot. This work was supported by GJJ160927.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
Proof of Theorem 1
Let \(\delta ={\hat{\beta }}-\beta ^0\). Recall that \(I=\{j: \beta ^0_j \ne 0\}\). For convenience, we write \({\ell }_{\omega }\left( \beta \right) \) to denote \({\ell }_{\omega }\left( \beta ; X, Y\right) \), and similarly, \(\nabla {\ell }_{\omega }\left( \beta \right) =\nabla {\ell }_{\omega }\left( \beta ; X, Y\right) \) and \(\nabla ^2{\ell }_{\omega }\left( \beta \right) =\nabla ^2{\ell }_{\omega }\left( \beta ; X, Y\right) \) . By the definition of the estimator \({\hat{\beta }}\), we have
Since \({\ell }_{\omega }\left( \beta \right) \) is a convex function, we obtain
Define the event
Combining (14) and (15), on the event A we have
Therefore, on the event A we have \(\delta \in \triangle _{\alpha }\), for \(\alpha =\frac{1+c}{1-c}\).
Define the function \(g(t)={\ell }_{\omega }(\beta ^0+t\delta )\). Following assumptions (A1) and (A4), we have
and invoke the condition \(\delta \in \triangle _{\alpha }\) to obtain
Denote \({\bar{R}}=c_0R(\alpha +1)\sqrt{s}\), then, by Proposition 1 of Bach et al. (2010), we have
Combining (14) and (18), on the event A we obtain
Adding \((1-c)\lambda \Vert \delta \Vert _1\) to both sides of above inequality and invoking restricted eigenvalue condition (A5), we also have
Using the fact \(\Vert \delta _I\Vert _1\le \sqrt{s}\Vert \delta _I\Vert _2\), we have
With a short calculation, we obtain, for all \(t\in [0, 1)\), that
Set \(t={\bar{R}}\Vert \delta _I\Vert _2/\left( 2+{\bar{R}}\Vert \delta _I\Vert _2\right) \), then we have
This implies using (20) that
If \(\lambda s < \frac{c(1-c)\rho }{4c_0R}\), we have \({\bar{R}}\Vert \delta _I\Vert _2\le \frac{2c}{1-c}\) and consequently
Combining (19) and (21), we obtain
Using the inequality \(2uv \le au^2+v^2/a\), for any \(a>1\), we further obtain
By taking \(a=\frac{2}{(1-c)\rho }\), then we obtain, on the event A, that
To conclude the proof we determine now \(\lambda =(\sqrt{n}c)^{-1}C(\beta ^0)\Phi ^{-1}\left( 1-\frac{\epsilon }{2p}\right) \) such that \({\mathbb {P}}(A^c)\le \epsilon (1+o(1))\).
Denote \(\xi _{ij}=\omega (x^T_i\beta ^0)\left[ F(x_i^T\beta ^0)-Y_i\right] x_{ij}\), then \({\mathbb {E}}(\xi _{ij})=0\) and \({\mathbb {E}}(\xi _{ij}^2)=\omega ^2(x^T_i\beta ^0)F(x_i^T\beta ^0)(1-F(x_i^T\beta ^0)x_{ij}^2\). Denote \(b=\Phi ^{-1}\left( 1-\frac{\epsilon }{2p}\right) \), then, by (16), we have
We now use Lemma 6 to estimate \({\mathbb {P}}\left\{ \left| \sum _{i=1}^{n}\xi _{ij}\right| > \sqrt{n}C(\beta ^0)b\right\} \).
Lemma 6
(Sakhanenko type moderate deviation theorem (Sakhanenko 1991)) Let \(Z_1,\cdots , Z_n\) be independent random variables with \({\mathbb {E}}(Z_i)=0\) and \(\mid Z_i\mid <1\) for all \(1\le i \le n\). Denote \(B_n^2=\sum _{i=1}^{n}{\mathbb {E}}(Z_i^2)\) and \(L_n=\sum _{i=1}^{n}{\mathbb {E}}( \mid Z_i\mid ^3)/B_n^3\). Then there exists a positive constant A such that for all \(x\in [1, \frac{1}{A}\min \{B_n, L_n^{-1/3}\}]\)
Since \(Y_i\in \{0, 1\}\) and \({\mathbb {P}}(Y_i=1)=F(x_i^{T}\beta ^0)\le 1\), then, with assumption A1 we have
with a positive constant \(K_1=\underset{1\le i\le n}{\sup }\omega (x^T_i\beta ^0)\).
Let \(Z_{ij}=\xi _{ij}/(RK_1)\), then we have \({\mathbb {E}}Z_{ij}=0\), \(\mid Z_{ij}\mid \le 1\). Furthermore, with assumption A6 we have
Then, \(B_{nj}=O(\sqrt{n})\) and \(L_{nj}=O(1/\sqrt{n})\). By Lemma 6, we have
where the second to last step follows because \(b=O(\sqrt{\log (2p/\epsilon )})\) (see the proof of Theorem 4 for details).As \(n, p\rightarrow \infty \) with \(n\le p =o(e^{n^{1/3}})\), we have
This concludes the proof. \(\square \)
Proof of Theorem 2
According to the proof of theorem 1, we just need to verify that our chosen weight function (10) can make the loss function \(\ell _{\omega }\) statisfy Assumptions (A3) and (A4). Invoking the weight function (10) to the weighted score function (3), we have
and then, the loss function \(\ell _{\omega }\) is given by
Because \(e^{t}\) and \(e^{-t}\) are both convex three times differentiable functions, the above loss function \(\ell _{\omega }(\beta ; X, Y)\) satisfies Assumption (A3).
Denote \(g(t)=\ell _{\omega }(u+tv; X, Y)\) for \(u, v\in \mathbb {R}^{p}\), we have
Then, for all \(u, v\in \mathbb {R}^{p}\) and for all \(t\in \mathbb {R}\), we have
Therefore, the loss function \(\ell _{\omega }(\beta ; X, Y)\) also satisfies Assumption (A4). \(\square \)
Proof of Theorem 4
We just need to prove that, when \(s(\sqrt{n})^{-1}\sqrt{\log (2p/\epsilon )}\rightarrow 0\), the \(\lambda \) chosen by (8) or (11) satisfies \(\lambda s \le \frac{c(1-c)\rho }{4c_0R}\). Note that, for any \(t > 0\), we have
where \(\phi (\cdot )\) is the density function of standard normal distribution. Let \(t=\Phi ^{-1}\left( 1-\frac{\epsilon }{2p}\right) \), the above inequality then becomes
If \(p/\epsilon > 2\), then \(t> \Phi ^{-1}(3/4)>1/\sqrt{2\pi }\). So it is easy to obtain
and then \(t < \sqrt{2\log (2p/\epsilon )}\). Hence, \(\Phi ^{-1}(1-\frac{\epsilon }{2p})=O(\sqrt{\log (2p/\epsilon )})\) and
If \(s(\sqrt{n})^{-1}\sqrt{\log (2p/\epsilon )}\rightarrow 0\), then \(\lambda s \le \frac{c(1-c)\rho }{4c_0R}\). \(\square \)
Rights and permissions
About this article
Cite this article
Yin, Z. Variable selection for sparse logistic regression. Metrika 83, 821–836 (2020). https://doi.org/10.1007/s00184-020-00764-4
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00184-020-00764-4