Skip to main content
Log in

GSDAR: a fast Newton algorithm for \(\ell _0\) regularized generalized linear models with statistical guarantee

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

We propose a fast Newton algorithm for \(\ell _0\) regularized high-dimensional generalized linear models based on support detection and root finding. We refer to the proposed method as GSDAR. GSDAR is developed based on the KKT conditions for \(\ell _0\)-penalized maximum likelihood estimators and generates a sequence of solutions of the KKT system iteratively. We show that GSDAR can be equivalently formulated as a generalized Newton algorithm. Under a restricted invertibility condition on the likelihood function and a sparsity condition on the regression coefficient, we establish an explicit upper bound on the estimation errors of the solution sequence generated by GSDAR in supremum norm and show that it achieves the optimal order in finite iterations with high probability. Moreover, we show that the oracle estimator can be recovered with high probability if the target signal is above the detectable level. These results directly concern the solution sequence generated from the GSDAR algorithm, instead of a theoretically defined global solution. We conduct simulations and real data analysis to illustrate the effectiveness of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Abramovich F, Benjamini Y, Donoho DL, Johnstone IM et al (2006) Adapting to unknown sparsity by controlling the false discovery rate. Ann Stat 34(2):584–653

    Article  MathSciNet  MATH  Google Scholar 

  • Birgé L, Massart P (2001) Gaussian model selection. J Eur Math Soc 3(3):203–268

    Article  MathSciNet  MATH  Google Scholar 

  • Bogdan M, Ghosh JK, Doerge R (2004) Modifying the schwarz bayesian information criterion to locate multiple interacting quantitative trait loci. Genetics 167(2):989–999

    Article  Google Scholar 

  • Bogdan M, Ghosh JK, Żak-Szatkowska M (2008) Selecting explanatory variables with the modified version of the Bayesian information criterion. Qual Reliab Eng Int 24(6):627–641

    Article  Google Scholar 

  • Breheny P, Huang J (2011) Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Ann Appl Stat 5(1):232

    Article  MathSciNet  MATH  Google Scholar 

  • Breheny P, Huang J (2015) Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Stat Comput 25(2):173–187

    Article  MathSciNet  MATH  Google Scholar 

  • Chen J, Chen Z (2008) Extended bayesian information criteria for model selection with large model spaces. Biometrika 95(3):759–771

    Article  MathSciNet  MATH  Google Scholar 

  • Chen J, Chen Z (2012) Extended bic for small-n-large-p sparse glm. Stat Sinica 22:555–574

    Article  MathSciNet  MATH  Google Scholar 

  • Chen X, Nashed Z, Qi L (2000) Smoothing methods and semismooth methods for nondifferentiable operator equations. SIAM J Numer Anal 38(4):1200–1216

    Article  MathSciNet  MATH  Google Scholar 

  • Chen X, Ge D, Wang Z, Ye Y (2014) Complexity of unconstrained l2-lp minimization. Math Program 143(1–2):371–383

    Article  MathSciNet  MATH  Google Scholar 

  • Dolejsi E, Bodenstorfer B, Frommlet F (2014) Analyzing genome-wide association studies with an fdr controlling modification of the bayesian information criterion. PloS One 9(7):e103322

    Article  Google Scholar 

  • Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am stat Assoc 96(456):1348–1360

    Article  MathSciNet  MATH  Google Scholar 

  • Fan J, Lv J (2008) Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc: Ser B (Stat Methodol) 70(5):849–911

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1

    Article  Google Scholar 

  • Frommlet F, Nuel G (2016) An adaptive ridge procedure for l 0 regularization. PloS One 11(2):e0148620

    Article  Google Scholar 

  • Frommlet F, Ruhaltinger F, Twaróg P, Bogdan M (2012) Modified versions of Bayesian information criterion for genome-wide association studies. Comput Stat Data Anal 56(5):1038–1051

    Article  MathSciNet  Google Scholar 

  • Frommlet F, Bogdan M, Ramsey D (2016) Phenotypes and genotypes. Springer, Berlin

    Book  MATH  Google Scholar 

  • Ge J, Li X, Jiang H, Liu H, Zhang T, Wang M, Zhao T (2019) Picasso: A sparse learning library for high dimensional data analysis in R and Python. J Mach Learn Res 20(44):1–5

    Google Scholar 

  • Huang J, Jiao Y, Liu Y, Lu X (2018) A constructive approach to l 0 penalized regression. J Mach Learn Res 19(1):403–439

    MathSciNet  Google Scholar 

  • Huang J, Jiao Y, Jin B, Liu J, Lu X, Yang C (2021) A unified primal dual active set algorithm for nonconvex sparse recovery. Stat Sci (to appear). arXiv:1310.1147

  • Jiao Y, Jin B, Lu X (2017) Group sparse recovery via the \(\ell ^0(\ell ^2)\) penalty: theory and algorithm. IEEE Trans Signal Process 65(4):998–1012

    Article  MathSciNet  MATH  Google Scholar 

  • Li X, Yang L, Ge J, Haupt J, Zhang T, Zhao T (2017) On quadratic convergence of DC proximal Newton algorithm in nonconvex sparse learning. In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30, Curran Associates, Inc

  • Loh P-L, Wainwright MJ (2015) Regularized m-estimators with nonconvexity: statistical and algorithmic theory for local optima. J Mach Learn Res 16:559–616

    MathSciNet  MATH  Google Scholar 

  • Louizos C, Welling M, Kingma DP (2018) Learning sparse neural networks through \(L_0\) regularization. In: International conference on learning representations, pp 1–13. URL https://openreview.net/forum?id=H1Y8hhg0b

  • Luo S, Chen Z (2014) Sequential lasso cum ebic for feature selection with ultra-high dimensional feature space. J Am Stat Assoc 109(507):1229–1240

    Article  MathSciNet  MATH  Google Scholar 

  • Ma R, Miao J, Niu L, Zhang P (2019) Transformed \(\ell _1\) regularization for learning sparse deep neural networks. URL https://arxiv.org/abs/1901.01021

  • McCullagh P (2019) Generalized linear models. Routledge, Oxfordshire

    Book  MATH  Google Scholar 

  • Meier L, Van De Geer S, Bühlmann P (2008) The group lasso for logistic regression. J R Stat Soc: Ser B (Stat Methodol) 70(1):53–71

    Article  MathSciNet  MATH  Google Scholar 

  • Natarajan BK (1995) Sparse approximate solutions to linear systems. SIAM J Comput 24(2):227–234

    Article  MathSciNet  MATH  Google Scholar 

  • Nelder JA, Wedderburn RW (1972) Generalized linear models. J R Stat Soc: Ser A (General) 135(3):370–384

    Google Scholar 

  • Nesterov Y (2013) Gradient methods for minimizing composite functions. Math Program 140(1):125–161

    Article  MathSciNet  MATH  Google Scholar 

  • Park MY, Hastie T (2007) L1-regularization path algorithm for generalized linear models. J R Stat Soc: Ser B (Stat Methodol) 69(4):659–677

    Article  MathSciNet  Google Scholar 

  • Qi L (1993) Convergence analysis of some algorithms for solving nonsmooth equations. Math Oper Res 18(1):227–244

    Article  MathSciNet  MATH  Google Scholar 

  • Qi L, Sun J (1993) A nonsmooth version of Newton’s method. Math Program 58(1–3):353–367

    Article  MathSciNet  MATH  Google Scholar 

  • Rockafellar RT, Wets RJ-B (2009) Var Anal. Springer Science & Business Media, Berlin

    Google Scholar 

  • Scardapane S, Comminiello D, Hussain A, Uncini A (2017) Group sparse regularization for deep neural networks. Neurocomputing 241:81–89

    Article  Google Scholar 

  • Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Shen J, Li P (2017) On the iteration complexity of support recovery via hard thresholding pursuit. In: Proceedings of the 34th international conference on machine learning-volume 70, pp 3115–3124

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc: Ser B (Methodological) 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  • Van de Geer SA et al (2008) High-dimensional generalized linear models and the lasso. Ann Stat 36(2):614–645

    MathSciNet  MATH  Google Scholar 

  • Wang L, Kim Y, Li R (2013) Calibrating non-convex penalized regression in ultra-high dimension. Ann Stat 41(5):2505

    Article  MATH  Google Scholar 

  • Wang R, Xiu N, Zhou S (2019) Fast newton method for sparse logistic regression. arXiv preprint arXiv:1901.02768

  • Wang Z, Liu H, Zhang T (2014) Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. Ann Stat 42(6):2164

    Article  MathSciNet  MATH  Google Scholar 

  • Ye F, Zhang C-H (2010) Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. J Mach Learn Res 11(Dec):3519–3540

    MathSciNet  MATH  Google Scholar 

  • Yuan X-T, Li P, Zhang T (2017) Gradient hard thresholding pursuit. J Mach Learn Res 18:166

    MathSciNet  MATH  Google Scholar 

  • Żak-Szatkowska M, Bogdan M (2011) Modified versions of the bayesian information criterion for sparse generalized linear models. Comput Stat Data Anal 55(11):2908–2924

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang C-H (2010) Nearly unbiased variable selection under minimax concave penalty. Ann Stat 38(2):894–942

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang C-H, Zhang T et al (2012) A general theory of concave regularization for high-dimensional sparse estimation problems. Stat Sci 27(4):576–593

    Article  MathSciNet  MATH  Google Scholar 

  • Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc: Ser B (Stat Methodol) 67(2):301–320

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We wish to thank two anonymous reviewers for their constructive and helpful comments that led to significant improvements in the paper. The work of J. Huang is supported in part by the U.S. National Science Foundation grant DMS-1916199. The work of Y. Jiao is supported in part by the National Science Foundation of China grant 11871474 and by the research fund of KLATASDSMOE of China. The research of J. Liu is supported by Duke-NUS Graduate Medical School WBS: R-913-200-098-263 and MOE2016-T2-2-029 from Ministry of Eduction, Singapore. The work of Y. Liu is supported in part by the National Science Foundation of China grant 11971362. The work X. Lu is supported by National Science Foundation of China Grants 11471253 and 91630313.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jian Huang or Yanyan Liu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

In the appendix, we prove Lemma 1, Proposition 1 and Theorems 1 and 2.

1.1 Proof of Lemma 1

Proof

Let \(L_\lambda (\varvec{\beta })={\mathcal {L}}(\varvec{\beta })+\lambda \Vert \varvec{\beta }\Vert _0\). Assume \(\widehat{\varvec{\beta }}\) is a global minimizer of \(L_\lambda (\varvec{\beta })\). Then by Theorem 10.1 in Rockafellar and Wets (2009), we have

$$\begin{aligned} \mathbf{0} \in \nabla {\mathcal {L}} (\widehat{\varvec{\beta }}) + \lambda \partial \Vert \widehat{\varvec{\beta }}\Vert _{0}, \end{aligned}$$
(7)

where \(\partial \Vert \widehat{\varvec{\beta }}\Vert _{0}\) denotes the limiting subdifferential (see Definition 8.3 in Rockafellar and Wets (2009)) of \( \Vert \cdot \Vert _{0}\) at \(\widehat{\varvec{\beta }}\). Let \(\widehat{\mathbf{d}}= -\nabla {\mathcal {L}}(\widehat{\varvec{\beta }})\) and define \(G(\varvec{\beta }) = \frac{1}{2}\Vert \varvec{\beta }- (\widehat{\varvec{\beta }}+\widehat{\mathbf{d}})\Vert ^2 + \lambda \Vert \varvec{\beta }\Vert _0\). We recall that, from the definition of the limiting subdifferential of Definition 8.3 in Rockafellar and Wets (2009)), \(\partial \Vert \widehat{\varvec{\beta }}\Vert _{0}\) satisfies that \(\Vert \varvec{\beta }\Vert _0\ge \Vert \widehat{\varvec{\beta }}\Vert _0+\langle \partial \Vert \widehat{\varvec{\beta }}\Vert _{0}, \varvec{\beta }-\widehat{\varvec{\beta }}\rangle +o(\Vert \varvec{\beta }-\widehat{\varvec{\beta }}\Vert )\) for any \(\varvec{\beta }\in {\mathbb {R}}^p\). (7) is equivalent to

$$\begin{aligned} \mathbf{0} \in \widehat{\varvec{\beta }}-(\widehat{\varvec{\beta }}+\widehat{\mathbf{d}}) + \lambda \partial \Vert \widehat{\varvec{\beta }}\Vert _{0}. \end{aligned}$$

Moreover, \({\widetilde{\varvec{\beta }}}\) being the minimizer of \(G(\varvec{\beta })\) is equivalent to \(\mathbf{0} \in \partial G({\widetilde{\varvec{\beta }}})\). Obviously, \(\widehat{\varvec{\beta }}\) satisfies \(\mathbf{0} \in \partial G(\widehat{\varvec{\beta }})\). Thus we deduce that \(\widehat{\varvec{\beta }}\) is a KKT point of \( G(\varvec{\beta })\). Then \(\widehat{\varvec{\beta }}=H_{\lambda }(\widehat{\varvec{\beta }}+\widehat{\mathbf{d}})\) follows from the result that the KKT points of G coincide with its coordinate-wise minimizer (Huang et al. 2021). Conversely, suppose \(\widehat{\varvec{\beta }}\) and \(\widehat{\mathbf{d}}\) satisfy (2), then \(\widehat{\varvec{\beta }}\) is a local minimizer of \(L_{\lambda }(\varvec{\beta })\). To show \(\widehat{\varvec{\beta }}\) is a local minimizer of \(L_{\lambda }(\varvec{\beta })\), we can assume \(\text{ h }\) is small enough and \(\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }\). Then we will show \(L_{\lambda }(\widehat{\varvec{\beta }}+\text{ h})\ge L_{\lambda }(\widehat{\varvec{\beta }})\) in two cases respectively.

First, we denote

$$\begin{aligned} {\widehat{A}}=\{i:|{\widehat{\beta }}_{i}+{\widehat{d}}_{i}|\ge \sqrt{2\lambda }\},\quad {\widehat{I}}=\{i:|{\widehat{\beta }}_{i}+{\widehat{d}}_{i}|<\sqrt{2\lambda }\}. \end{aligned}$$

By the definition of \(H_{\lambda }(\cdot )\) and (2), we can conclude that \(|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }\) when \(i \in {\widehat{A}}\) and \(\widehat{\varvec{\beta }}_{{\widehat{I}}}=0\). Thus it yields that \(\text{ supp }(\widehat{\varvec{\beta }})={\widehat{A}}\). Moreover, we also have \(\widehat{\mathbf{d}}_{{\widehat{A}}}=[-\nabla {\mathcal {L}}(\widehat{\varvec{\beta }})]_{{\widehat{A}}}=0\), which is equivalent to \(\widehat{\varvec{\beta }}_{{\widehat{A}}}\in \underset{\varvec{\beta }_{{\widehat{A}}}}{\text{ argmin }}~\widetilde{{\mathcal {L}}}(\varvec{\beta }_{{\widehat{A}}})\).

Case 1: \(\text{ h}_{{\widehat{I}}}\ne 0\).

$$\begin{aligned}&\displaystyle \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}=\Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0}+\Vert \text{ h}_{{\widehat{I}}}\Vert _{0}, \\&\displaystyle \lambda \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}\Vert _{0}=\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0} +\lambda \Vert \text{ h}_{{\widehat{I}}}\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}\Vert _{0}. \end{aligned}$$

Because \(|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }\) for \(i\in {{\widehat{A}}}\) and \(\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }\), we have

$$\begin{aligned}&\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}\Vert _{0}=0, \\&\lambda \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}\Vert _{0}=\lambda \Vert \text{ h}_{{\widehat{I}}}\Vert _{0}>\lambda . \end{aligned}$$

Therefore, we get

$$\begin{aligned}&L_{\lambda }(\widehat{\varvec{\beta }}+\text{ h})-L_{\lambda }(\widehat{\varvec{\beta }})\\&\quad =\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^T\mathbf{X}\text{ h }+\lambda \Vert \text{ h}_{{\widehat{I}}}\Vert _{0}\\&\quad>\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^T\mathbf{X}\text{ h }+\lambda \\&\quad >0. \end{aligned}$$

Let \(m(\text{ h})=\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^{T}\mathbf{X}\text{ h }\), so \(m(\text{ h})\) is a continuous function about \(\text{ h }\). As \(\text{ h }\) is small enough and \(\Vert \text{ h }\Vert _{\infty }<\sqrt{2\lambda }\), then \(m(\text{ h})+\lambda >0\). Thus the last inequality holds.

Case 2: \(\text{ h}_{{\widehat{I}}}=0\).

$$\begin{aligned} \lambda \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}\Vert _{0}=\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}\Vert _{0}. \end{aligned}$$

As \(|{\widehat{\beta }}_{i}|\ge \sqrt{2\lambda }\) for \(i\in {{\widehat{A}}}\) and \(\Vert \text{ h}_{{\widehat{A}}}\Vert _{\infty }<\sqrt{2\lambda }\), then we have

$$\begin{aligned} \lambda \Vert \widehat{\varvec{\beta }}+\text{ h }\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}\Vert _{0}=\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}\Vert _{0}-\lambda \Vert \widehat{\varvec{\beta }}_{{\widehat{A}}}\Vert _{0}=0, \end{aligned}$$

and

$$\begin{aligned}&L_{\lambda }(\widehat{\varvec{\beta }}+\text{ h})-L_{\lambda }(\widehat{\varvec{\beta }})\\&\quad =\sum _{i=1}^{n}[c(\text{ x}_{i}^{T}(\widehat{\varvec{\beta }}+\text{ h}))-c(\text{ x}_{i}^{T}\widehat{\varvec{\beta }})]-\text{ y}^T\mathbf{X}\text{ h }\\&\quad =\sum _{i=1}^{n}[c(\text{ x}_{i({\widehat{A}})}^{T}(\widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}))-c(\text{ x}_{i({\widehat{A}})}^{T}\widehat{\varvec{\beta }}_{{\widehat{A}}})]-\text{ y}^T\mathbf{X}_{{\widehat{A}}}\text{ h}_{{\widehat{A}}}\\&\quad =\sum _{i=1}^{n}[c(\text{ x}_{i({\widehat{A}})}^{T}(\widehat{\varvec{\beta }}_{{\widehat{A}}}+\text{ h}_{{\widehat{A}}}))]-\text{ y}^T\mathbf{X}_{{\widehat{A}}}(\widehat{\varvec{\beta }}_{{\widehat{A}}}+ \text{ h}_{{\widehat{A}}})\\&\qquad -\sum _{i=1}^{n}[c(\text{ x}_{i({\widehat{A}})}^{T}\widehat{\varvec{\beta }}_{{\widehat{A}}})]+\text{ y}^T\mathbf{X}_{{\widehat{A}}}\widehat{\varvec{\beta }}_{{\widehat{A}}}\\&\quad \ge 0. \end{aligned}$$

As known that \(\widehat{\varvec{\beta }}_{{\widehat{A}}}\in \underset{\varvec{\beta }_{{\widehat{A}}}}{\text{ argmin }}~\widetilde{{\mathcal {L}}}(\varvec{\beta }_{{\widehat{A}}})\), so the last inequality holds. In summary, \(\widehat{\varvec{\beta }}\) is a local minimizer of \(L_{\lambda }(\widehat{\varvec{\beta }})\). \(\square \)

1.2 Proof of Proposition 1

Proof

Denote \(D^{k}=-\left( {\mathcal {H}}^{k}\right) ^{-1} F\left( \text{ w}^{k}\right) \). Then

$$\begin{aligned} \text{ w}^{k+1}=\text{ w}^{k}-\left( {\mathcal {H}}^{k}\right) ^{-1} F\left( \text{ w}^{k}\right) \end{aligned}$$

can be recast as

$$\begin{aligned}&{\mathcal {H}}^{k} D^{k}=-F\left( \text{ w}^{k}\right) , \end{aligned}$$
(8)
$$\begin{aligned}&\text{ w}^{k+1}=\text{ w}^{k}+D^{k}. \end{aligned}$$
(9)

Partition \(\text{ w}^{k}, D^{k}\) and \(F\left( \text{ w}^{k}\right) \) according to \(A^{k}\) and \(I^{k}\) such that

$$\begin{aligned}&\text{ w}^{k}=\left( \begin{array}{c} \varvec{\beta }_{A^{k}}^{k} \\ \varvec{\beta }_{I^{k}}^{k} \\ \mathbf{d}_{A^{k}}^{k} \\ \mathbf{d}_{I^{k}}^{k} \end{array}\right) , \quad D^{k}=\left( \begin{array}{c} D_{A^{k}}^{\varvec{\beta }} \\ D_{I^{k}}^{\varvec{\beta }} \\ D_{A^{k}}^{\mathbf{d}} \\ D_{I^{k}}^{\mathbf{d}} \end{array}\right) , \end{aligned}$$
(10)
$$\begin{aligned}&F\left( \text{ w}^{k}\right) = \left[ \begin{array}{c} -\mathbf{d}_{A^{k}}^{k} \\ \varvec{\beta }_{I^{k}}^{k} \\ n[\nabla {\mathcal {L}}(\varvec{\beta }^k)]_{A^k}+n\mathbf{d}^k_{A^k} \\ n[\nabla {\mathcal {L}}(\varvec{\beta }^k)]_{I^k}+n\mathbf{d}^k_{I^k} \end{array} \right] . \end{aligned}$$
(11)

Substituting (10), (11) and \({\mathcal {H}}^{k}\) into (8), we have

$$\begin{aligned}&\displaystyle \left( \mathbf{d}_{A^{k}}^{k}+D_{A^{k}}^{\mathbf{d}}\right) ={\mathbf {0}}_{A^{k}}, \end{aligned}$$
(12)
$$\begin{aligned}&\displaystyle \varvec{\beta }_{I^{k}}^{k}+D_{I^{k}}^{\varvec{\beta }} ={\mathbf {0}}_{I^{k}}, \end{aligned}$$
(13)
$$\begin{aligned}&\displaystyle n\nabla ^2{\mathcal {L}}(\varvec{\beta }^k)\left( \begin{array}{c} D_{A^{k}}^{\varvec{\beta }} \\ D_{I^{k}}^{\varvec{\beta }} \end{array}\right) +n\left( \begin{array}{c} D_{A^{k}}^{\mathbf{d}} \\ D_{I^{k}}^{\mathbf{d}} \end{array}\right) =n[\nabla {\mathcal {L}}(\varvec{\beta }^k)]+n\mathbf{d}^k. \end{aligned}$$
(14)

It follows from (9) that

$$\begin{aligned} \left( \begin{array}{c} \varvec{\beta }_{A^{k}}^{k+1} \\ \varvec{\beta }_{I^{k}}^{k+1} \\ \mathbf{d}_{A^{k}}^{k+1} \\ \mathbf{d}_{I^{k}}^{k+1} \end{array}\right) =\left( \begin{array}{c} \varvec{\beta }_{A^{k}}^{k}+D_{A^{k}}^{\varvec{\beta }} \\ \varvec{\beta }_{I^{k}}^{k}+D_{I^{k}}^{\varvec{\beta }} \\ \mathbf{d}_{A^{k}}^{k}+D_{A^{k}}^{\mathbf{d}} \\ \mathbf{d}_{I^{k}}^{k}+D_{I^{k}}^{\mathbf{d}} \end{array}\right) . \end{aligned}$$
(15)

Substituting (15) into (12)–(14), we get (4) of Algorithm 1. This completes the proof. \(\square \)

1.3 Preparatory lemmas

The proofs of Theorems 1 and 2 are built on the following lemmas.

Lemma 2

Assume (C1) holds and \(\Vert \varvec{\beta }^{*}\Vert _{0}=K\le T\). Denote \(B^k = A^{k}\backslash A^{k-1}\). Then,

$$\begin{aligned} \Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^{k})\Vert _1\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^{k})\Vert _{\infty }\ge 2L\zeta [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)], \end{aligned}$$

where \(\zeta =\frac{|B^k|}{|B^k|+|A^*\backslash A^{k-1}|}\).

Proof

Obviously, this lemma holds if \(A^{k}=A^{k-1}\) or \({\mathcal {L}}(\varvec{\beta }^k)\le {\mathcal {L}}(\varvec{\beta }^*)\). So we only prove the lemma by assuming \(A^{k}\ne A^{k-1}\) and \({\mathcal {L}}(\varvec{\beta }^k)>{\mathcal {L}}(\varvec{\beta }^*)\). The condition (C1) indicates

$$\begin{aligned}&{\mathcal {L}}(\varvec{\beta }^{*})-{\mathcal {L}}(\varvec{\beta }^k)-\langle \nabla {\mathcal {L}}(\varvec{\beta }^k),{\varvec{\beta }^*-\varvec{\beta }^k}\rangle \\&\quad \ge \frac{L}{2}\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _1\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _{\infty }. \end{aligned}$$

Hence,

$$\begin{aligned}&\langle -\nabla {\mathcal {L}}(\varvec{\beta }^k),{\varvec{\beta }^*-\varvec{\beta }^k}\rangle \\&\quad =\langle \nabla {\mathcal {L}}(\varvec{\beta }^k),-\varvec{\beta }^*\rangle \\&\quad \ge \frac{L}{2}\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _1\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _{\infty }+{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^{*})\\&\quad \ge \sqrt{2L}\sqrt{\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _1\big \Vert \varvec{\beta }^*-\varvec{\beta }^k\big \Vert _{\infty }}\sqrt{{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^{*})}.\\ \end{aligned}$$

From the definition of \(A^{k}\) and \(A^*\), it is known that \(B^k\) contains the first \(|B^k|\)-largest elements (in absolute value) of \(\nabla {\mathcal {L}}(\varvec{\beta }^k)\), and \(\text{ supp }(\nabla {\mathcal {L}}(\varvec{\beta }^k))\bigcap \text{ supp }(\varvec{\beta }^*)=A^{*}\backslash A^{k-1}\). Thus, we have

$$\begin{aligned} \langle \nabla {\mathcal {L}}(\varvec{\beta }^k),-\varvec{\beta }^*\rangle\le & {} \frac{1}{\sqrt{\zeta }}\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _2\Vert \varvec{\beta }_{A^*\backslash A^{k-1}}^{*}\Vert _2\\= & {} \frac{1}{\sqrt{\zeta }}\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _2\Vert (\varvec{\beta }^{*}-\varvec{\beta }^k)_{A^*\backslash A^{k-1}}\Vert _2\\\le & {} \frac{1}{\sqrt{\zeta }}\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _2\Vert \varvec{\beta }^{*}-\varvec{\beta }^k\Vert _2\\\le & {} \frac{1}{\sqrt{\zeta }}\sqrt{\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _1\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _{\infty }}\\&\times \sqrt{\Vert \varvec{\beta }^{*}-\varvec{\beta }^k\Vert _1\Vert \varvec{\beta }^{*}-\varvec{\beta }^k\Vert _{\infty }}. \end{aligned}$$

Therefore,

$$\begin{aligned}&{\sqrt{2L}\sqrt{{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^{*})}}\\&\quad \le \frac{1}{\sqrt{\zeta }} \sqrt{\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^k)\Vert _1\Vert \nabla _{B^{k}} {\mathcal {L}}(\varvec{\beta }^k)\Vert _{\infty }}. \end{aligned}$$

In summary,

$$\begin{aligned} \Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^{k})\Vert _1\Vert \nabla _{B^k}{\mathcal {L}}(\varvec{\beta }^{k})\Vert _{\infty }\ge 2L\zeta [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)]. \end{aligned}$$

\(\square \)

Lemma 3

Assume (C1) holds with \(0<U<\frac{1}{T}\), and \(K\le T\) in Algorithm 1. Then before Algorithm 1 terminates,

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^*)\le \xi [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)], \end{aligned}$$

where \(\xi =1-\frac{2L(1-TU)}{T(1+K)}\in (0,1)\).

Proof

Let \(\Delta ^{k}=\varvec{\beta }^k-\nabla {\mathcal {L}}(\varvec{\beta }^{k})\). The condition of (C1) indicates

$$\begin{aligned}&{\mathcal {L}}(\Delta ^{k+1}|_{A^{k+1}})-{\mathcal {L}}(\varvec{\beta }^{k+1})\le \langle \nabla {\mathcal {L}}(\varvec{\beta }^{k+1}),\Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\rangle \\&\quad \quad \quad \quad +\frac{U}{2}\big \Vert \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\big \Vert _1\big \Vert \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\big \Vert _{\infty }. \end{aligned}$$

On the one hand, by the definition of \(\varvec{\beta }^{k+1}\) and \(\nabla {\mathcal {L}}(\varvec{\beta }^{k+1})\), we have

$$\begin{aligned}&\langle \nabla {\mathcal {L}}(\varvec{\beta }^{k+1}), \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\rangle \\&\quad =\langle \nabla {\mathcal {L}}(\varvec{\beta }^{k+1}), \Delta ^{k+1}|_{A^{k+1}}\rangle \\&\quad =\langle \nabla _{A^{k+1}}{\mathcal {L}}(\varvec{\beta }^{k+1}), \Delta _{A^{k+1}}^{k+1}\rangle \\&\quad =\langle \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1}), \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\rangle . \end{aligned}$$

Further, we also have

$$\begin{aligned}&\big \Vert \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\big \Vert _1\\&\quad =\big \Vert \Delta ^{k+1}|_{A^{k+1}\backslash A^{k}}+\Delta ^{k+1}|_{A^{k+1}\bigcap A^{k}}\\&\quad \quad -\varvec{\beta }^{k+1}|_{A^{k+1}\bigcap A^{k}}-\varvec{\beta }^{k+1}|_{A^{k}\backslash A^{k+1}}\big \Vert _1\\&\quad =\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _1+\big \Vert \Delta _{A^{k+1}\bigcap A^{k}}^{k+1}-\varvec{\beta }_{A^{k+1}\bigcap A^{k}}^{k+1}\big \Vert _1\\&\quad \quad +\big \Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\big \Vert _1\\&\quad =\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _1+\big \Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\big \Vert _1, \end{aligned}$$

and

$$\begin{aligned}&\big \Vert \Delta ^{k+1}|_{A^{k+1}}-\varvec{\beta }^{k+1}\big \Vert _{\infty }\\&\quad =\big \Vert \Delta ^{k+1}|_{A^{k+1}\backslash A^{k}}+\Delta ^{k+1}|_{A^{k+1}\bigcap A^{k}}\\&\quad \quad -\varvec{\beta }^{k+1}|_{A^{k+1}\bigcap A^{k}}-\varvec{\beta }^{k+1}|_{A^{k}\backslash A^{k+1}}\big \Vert _{\infty }\\&\quad =\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _{\infty }\bigvee \big \Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\big \Vert _{\infty }, \end{aligned}$$

where \(a \bigvee b=\max \{a,b\}\). By the definition of \(A^k\), \(A^{k+1}\) and \(\varvec{\beta }^{k+1}\), we know that

$$\begin{aligned} |A^{k}\backslash A^{k+1}|=|A^{k+1}\backslash A^{k}|,\quad \Delta _{A^{k}\backslash A^{k+1}}^{k+1}=\varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}. \end{aligned}$$

By the definition of \(A^{k+1}\), we can conclude that

$$\begin{aligned}&\Vert \Delta _{A^{k}\backslash A^{k+1}}^{k+1}\Vert _1=\Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\Vert _1\le \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\Vert _1, \\&\Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\Vert _{\infty }\bigvee \Vert \varvec{\beta }_{A^{k}\backslash A^{k+1}}^{k+1}\Vert _{\infty }=\Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\Vert _{\infty }. \end{aligned}$$

Due to \(-\nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})=\Delta _{A^{k+1}\backslash A^{k}}^{k+1}\) and \(U<\frac{1}{T}\), hence we can deduce that

$$\begin{aligned}&{{\mathcal {L}}(\Delta ^{k+1}|_{A^{k+1}})-{\mathcal {L}}(\varvec{\beta }^{k+1})}\\&\quad \le \langle \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1}),\Delta _{A^{k+1}\backslash A^{k}}^{k+1}\rangle +U\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _1\big \Vert \Delta _{A^{k+1}\backslash A^{k}}^{k+1}\big \Vert _{\infty }\\&\quad \le -(1/T-U)\big \Vert \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})\big \Vert _1 \times \big \Vert \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})\big \Vert _{\infty }. \end{aligned}$$

By the definition of \(\varvec{\beta }^{k+1}\), we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^{k})\le & {} {\mathcal {L}}(\Delta ^{k}|_{A^{k}})-{\mathcal {L}}(\varvec{\beta }^{k})\\\le & {} -(1/T-U)\big \Vert \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})\big \Vert _1 \times \big \Vert \nabla _{A^{k+1}\backslash A^{k}}{\mathcal {L}}(\varvec{\beta }^{k+1})\big \Vert _{\infty }. \end{aligned}$$

Moreover, \(\frac{|A^*\backslash A^{k-1}|}{|B^k|}\le K\). By Lemma 2, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^{k})\le -\frac{2L(1-TU)}{T(1+K)}[{\mathcal {L}}(\varvec{\beta }^{k})-{\mathcal {L}}(\varvec{\beta }^{*})]. \end{aligned}$$

Therefore, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^*)\le \xi [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)], \end{aligned}$$

where \(\xi =1-\frac{2L(1-TU)}{T(1+K)}\in (0,1)\). \(\square \)

Lemma 4

Assume \({\mathcal {L}}\) satisfies (C1) and

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^{*})\le \xi [{\mathcal {L}}(\varvec{\beta }^{k})-{\mathcal {L}}(\varvec{\beta }^{*})] \end{aligned}$$

for all \(k\ge 0\). Then,

$$\begin{aligned} \begin{aligned} \Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }&\le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^0-\varvec{\beta }^*\Vert _{\infty }\\&\quad +\frac{2}{L}\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }. \end{aligned} \end{aligned}$$
(16)

Proof

If \(\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }< \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}\), then (16) holds, so we only consider the case that \(\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\ge \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}\). On the one hand, \({\mathcal {L}}\) satisfies (C1), then

$$\begin{aligned}&{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)\\&\quad \ge \langle \nabla {\mathcal {L}}(\varvec{\beta }^*),\varvec{\beta }^k-\varvec{\beta }^*\rangle +\frac{L}{2}\big \Vert \varvec{\beta }^k-\varvec{\beta }^*\big \Vert _{1}\big \Vert \varvec{\beta }^k-\varvec{\beta }^*\big \Vert _{\infty }\\&\quad \ge -\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _1+\frac{L}{2}\big \Vert \varvec{\beta }^k-\varvec{\beta }^*\big \Vert _1\big \Vert \varvec{\beta }^k-\varvec{\beta }^*\big \Vert _{\infty }. \end{aligned}$$

Due to \(\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\ge \frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}\), then

$$\begin{aligned} (\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _1-\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty })\left( \frac{L}{2}\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }-\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }\right) \ge 0. \end{aligned}$$

Further, we can get

$$\begin{aligned} \frac{L}{2}\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }^2-\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }-[{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)]\le 0, \end{aligned}$$

which is univariate quadratic inequality about \(\Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\). Thus, by simple computation, we can get

$$\begin{aligned} \Vert \varvec{\beta }^k-\varvec{\beta }^*\Vert _{\infty }\le \sqrt{\frac{2\max \{{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*),0\}}{L}}+\frac{2\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }}{L}. \end{aligned}$$
(17)

On the other hand, because \({\mathcal {L}}\) satisfies (C1), then

$$\begin{aligned}&{\mathcal {L}}(\varvec{\beta }^0)-{\mathcal {L}}(\varvec{\beta }^*)\\&\quad \le \langle \nabla {\mathcal {L}}(\varvec{\beta }^*),\varvec{\beta }^0-\varvec{\beta }^*\rangle +\frac{U}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _1\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _{\infty }\\&\quad \le \big \Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\big \Vert _{\infty }\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _1+\frac{U}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _1\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _{\infty }\\&\quad \le (K+T)\big \Vert \varvec{\beta }^0-\varvec{\beta }^{*}\big \Vert _{\infty }(\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\big \Vert _{\infty }+\frac{U}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _{\infty }). \end{aligned}$$

Then, we can get

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k})-{\mathcal {L}}(\varvec{\beta }^{*})&\le \xi [{\mathcal {L}}(\varvec{\beta }^{k-1})-{\mathcal {L}}(\varvec{\beta }^{*})]\\&\le \xi ^{k}[{\mathcal {L}}(\varvec{\beta }^{0})-{\mathcal {L}}(\varvec{\beta }^{*})]\\&\le \xi ^k(K+T)\big \Vert \varvec{\beta }^0-\varvec{\beta }^{*}\big \Vert _{\infty }\\&\quad \times (\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\big \Vert _{\infty }+\frac{U}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^*\big \Vert _{\infty })\\&\le \frac{\xi ^k (L+U)(K+T)}{2}\big \Vert \varvec{\beta }^0-\varvec{\beta }^{*}\big \Vert _{\infty }^2. \end{aligned}$$

Hence, by (17), we have

$$\begin{aligned} \Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }&\le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^0-\varvec{\beta }^*\Vert _{\infty }\\&\quad +\frac{2}{L}\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }. \end{aligned}$$

\(\square \)

Lemma 5

(Proof of Corollary 2 in Loh and Wainwright (2015)). Assume \(x_{ij}^{,}s\) are sub-Gaussian and \(n > rsim \log (p)\), then there exists universal constants \((c_1,c_2,c_3)\) with \(0<c_i<\infty \), \(i=1,2,3\) such that

$$\begin{aligned} P\left( \Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }\ge c_1\sqrt{\frac{\log (p)}{n}}\right) \le c_2\exp (-c_3\log (p)). \end{aligned}$$

1.4 Proof of Theorem 1

Proof

By Lemma 3, we have

$$\begin{aligned} {\mathcal {L}}(\varvec{\beta }^{k+1})-{\mathcal {L}}(\varvec{\beta }^*)\le \xi [{\mathcal {L}}(\varvec{\beta }^k)-{\mathcal {L}}(\varvec{\beta }^*)], \end{aligned}$$

where

$$\begin{aligned} \xi =1-\frac{2L(1-TU)}{T(1+K)}\in (0,1). \end{aligned}$$

So the conditions of Lemma 4 are satisfied. Taking \(\varvec{\beta }^0 = \mathbf{0} \), we can get

$$\begin{aligned}&\Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }\\&\quad \le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^*\Vert _{\infty } +\frac{2}{L}\Vert \nabla {\mathcal {L}}(\varvec{\beta }^*)\Vert _{\infty }. \end{aligned}$$

By Lemma 5, then there exists universal constants \((c_1,c_2,c_3)\) defined in Lemma 5, with at least probability \(1-c_2\exp (-c_3\log (p))\), we have

$$\begin{aligned}&\Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }\nonumber \\&\quad \le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^*\Vert _{\infty } +\frac{2c_1}{L} \sqrt{\frac{\log (p)}{n}}. \end{aligned}$$
(18)

Some algebra shows that

$$\begin{aligned} \Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }\le {\mathcal {O}}(\sqrt{\frac{\log (p)}{n}}) \end{aligned}$$

by taking \(k \ge {\mathcal {O}}(\log _{\frac{1}{\xi }} \frac{n}{\log (p)} )\) in (18). Then, the proof is complete. \(\square \)

1.5 Proof of Theorem 2

Proof

(18) and assumption (C2) and some algebra shows that that

$$\begin{aligned}&\Vert \varvec{\beta }^{k}-\varvec{\beta }^{*}\Vert _{\infty }\\&\le \sqrt{(K+T)(1+\frac{U}{L})}(\sqrt{\xi })^k\Vert \varvec{\beta }^*\Vert _{\infty } +\frac{2}{3}\Vert \varvec{\beta }^*_{A^*}\Vert _{\min }\\&< \Vert \varvec{\beta }^*_{A^*}\Vert _{\min }, \end{aligned}$$

if \(k>\log _{\frac{1}{\xi }} 9 (T+K)(1+\frac{U}{L})r^2.\) This implies that \(A^* \subseteq A^k\). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Huang, J., Jiao, Y., Kang, L. et al. GSDAR: a fast Newton algorithm for \(\ell _0\) regularized generalized linear models with statistical guarantee. Comput Stat 37, 507–533 (2022). https://doi.org/10.1007/s00180-021-01098-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-021-01098-z

Keywords

Navigation