Skip to main content
Log in

Majorization minimization by coordinate descent for concave penalized generalized linear models

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Recent studies have demonstrated theoretical attractiveness of a class of concave penalties in variable selection, including the smoothly clipped absolute deviation and minimax concave penalties. The computation of the concave penalized solutions in high-dimensional models, however, is a difficult task. We propose a majorization minimization by coordinate descent (MMCD) algorithm for computing the concave penalized solutions in generalized linear models. In contrast to the existing algorithms that use local quadratic or local linear approximation to the penalty function, the MMCD seeks to majorize the negative log-likelihood by a quadratic loss, but does not use any approximation to the penalty. This strategy makes it possible to avoid the computation of a scaling factor in each update of the solutions, which improves the efficiency of coordinate descent. Under certain regularity conditions, we establish theoretical convergence property of the MMCD. We implement this algorithm for a penalized logistic regression model using the SCAD and MCP penalties. Simulation studies and a data example demonstrate that the MMCD works sufficiently fast for the penalized logistic regression in high-dimensional settings where the number of covariates is much larger than the sample size.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Böhning, D., Lindsay, B.: Monotonicity of quadratic-approximation algorithms. Ann. Inst. Stat. Math. 40(4), 641–663 (1988)

    Article  MATH  Google Scholar 

  • Breheny, P., Huang, J.: Coordinate descent algorithms for nonconvex penalized regression, with application to biological feature selection. Ann. Appl. Stat. 5(1), 232–253 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  • Donoho, D.L., Johnstone, J.M.: Ideal spatial adaptation by wavelet shrinkage. Biometrika 81(3), 425–455 (1994)

    Article  MATH  MathSciNet  Google Scholar 

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R.: Least angle regression. Ann. Stat. 32(2), 407–451 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  • Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–13608 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  • Friedman, J., Hastie, T., Höfling, H., Tibshirani, R.: Pathwise coordinate optimization. Ann. Appl. Stat. 1(2), 302–332 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33(1), 1–22 (2010)

    Google Scholar 

  • Hunter, D.R., Lange, K.: A tutorial on MM algorithms. Am. Stat. 58(1), 30–37 (2004)

    Article  MathSciNet  Google Scholar 

  • Hunter, D.R., Li, R.: Variable selection using MM algorithms. Ann. Stat. 33(4), 1617–1642 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  • Jiang, D., Huang, J., Zhang, Y.: The cross-validated AUC for MCP-logistic regression with high-dimensional data. Stat. Methods Med. Res. (2011, accepted). doi:10.1177/0962280211428385

  • Lange, K.: Optimization. Springer, New York (2004)

    Book  MATH  Google Scholar 

  • Lange, K., Hunter, D., Yang, I.: Optimization transfer using surrogate objective functions (with discussion). J. Comput. Graph. Stat. 9(1), 1–59 (2000)

    MathSciNet  Google Scholar 

  • Mazumder, R., Friedman, J., Hastie, T.: SparseNet: coordinate descent with non-convex penalties. J. Am. Stat. Assoc. 106(495), 1125–1138 (2011)

    Article  MATH  MathSciNet  Google Scholar 

  • Ortega, J.M., Rheinbold, W.C.: Iterative Solution of Nonlinear Equations in Several Variables. Academic Press, New York (1970)

    MATH  Google Scholar 

  • Osborne, M.R., Presnell, B., Turlach, B.A.: A new approach to variable selection in least square problems. IMA J. Numer. Anal. 20(3), 389–403 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  • R Development Core Team: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0. http://www.R-project.org

  • Schifano, E.D., Strawderman, R.L., Wells, M.T.: Majorization-minimization algorithms for non-smoothly penalized objective functions. Electron. J. Stat. 4, 1258–1299 (2010)

    Article  MATH  MathSciNet  Google Scholar 

  • Tibshirani, R.: Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)

    MATH  MathSciNet  Google Scholar 

  • Tseng, P.: Convergence of a block coordinate descent method for non-differentiable minimization. J. Optim. Theory Appl. 109(3), 475–494 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  • van de Vijver, M.J., He, Y.D., van’t Veer, L.J., et al.: A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med. 347(25), 1999–2009 (2002)

    Article  Google Scholar 

  • van’t Veer, L.J., Dai, H., van de Vijver, M.J., et al.: Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(31), 530–536 (2002)

    Article  Google Scholar 

  • Warge, J.: Minimizing certain convex functions. SIAM J. Appl. Math. 11(3), 588–593 (1963)

    Article  Google Scholar 

  • Wu, T.T., Lange, K.: Coordinate descent algorithms for Lasso penalized regression. Ann. Appl. Stat. 2(1), 224–244 (2008)

    Article  MATH  MathSciNet  Google Scholar 

  • Zhang, C.H.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    Article  MATH  Google Scholar 

  • Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models. Ann. Stat. 36(4), 1509–1533 (2008)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors thank the reviewers and the associate editor for their helpful comments that led to considerable improvements of the paper. The research of Huang is supported in part by NIH grants R01CA120988, R01CA142774 and NSF grant DMS 1208225.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dingfeng Jiang.

Appendix

Appendix

In the appendix, we prove Theorem 1. The proof follows the basic idea of Mazumder et al. (2011). However, there are also some important differences. In particular, we need to take care of the intercept in Lemma 1 and Theorem 1, the quadratic approximation to the loss function and the coordinate-wise majorization in Theorem 1.

Lemma 1

Suppose the data (y,X) lies on a compact set and the following conditions hold:

  1. 1.

    The loss function (β) is (total) differentiable w.r.t. β for any \(\boldsymbol{\beta} \in\mathbb{R}^{p+1}\).

  2. 2.

    The penalty function ρ(t) is symmetric around 0 and is differentiable on t≥0; ρ′(|t|) is non-negative, continuous and uniformly bounded, where ρ′(|t|) is the derivative of ρ(|t|) w.r.t. |t|, and at t=0, it is the right-sided derivative.

  3. 3.

    The sequence {β k} is bounded.

  4. 4.

    For every convergent subsequence \(\{\boldsymbol{\beta}^{n_{k}}\} \subset\{\boldsymbol{\beta}^{n}\} \), the successive differences converge to zero: \(\boldsymbol{\beta}^{n_{k}} - \boldsymbol{\beta }^{n_{k}-1} \to0\).

Then if β is any limit point of the sequence {β k}, then β is a minimum for the function Q(β); i.e.

$$ \liminf_{\alpha\downarrow0+}\biggl\{ \frac{Q(\boldsymbol {\beta}^{\infty} + \alpha\boldsymbol{\delta}) - Q(\boldsymbol{\beta }^{\infty}) }{\alpha} \biggr\} \geq0, $$
(22)

for any \(\boldsymbol{\delta}=(\delta_{0},\ldots,\delta_{p}) \in\mathbb {R}^{p+1}\).

Proof

For any β=(β 0,…,β p )T and \(\boldsymbol{\delta}_{j}=(0,\ldots,\delta_{j},\ldots,0) \in\mathbb {R}^{p+1}\), we have

(23)

for j∈{1,…,p}, with

$$ \partial\rho(\beta_{j}; \delta_{j})= \left\{ \begin{array}{l@{\quad}l} \rho^{\prime}(|\beta_{j}|) \operatorname {sgn}(\beta_{j})\delta_{j}, & |\beta_{j}| >0; \\\rho^{\prime}(0) |\delta_{j}|, & |\beta_{j}| =0. \end{array} \right. $$
(24)

Assume \(\boldsymbol{\beta}^{n_{k}} \to\boldsymbol{\beta}^{\infty }=(\beta_{0}^{\infty},\ldots,\beta_{p}^{\infty})\), and by assumption 4, as k→∞

(25)

By (24) and (25), we have the results below for j∈{1,…,p}.

$$ \begin{aligned} &\partial\rho\bigl(\beta_{j}^{n_{k}}; \delta_{j}\bigr) \to\partial\rho\bigl(\beta_{j}^{\infty}; \delta_{j}\bigr), \quad \mbox{if}\ \beta_{j}^{\infty} \neq 0; \\&\partial\rho\bigl(\beta_{j}^{\infty};\delta_{j} \bigr) \geq\liminf_{k} \partial\rho\bigl( \beta_{j}^{n_{k}};\delta_{j}\bigr), \quad \mbox{if}\ \beta_{j}^{\infty} = 0. \end{aligned} $$
(26)

By the coordinate-wise minimum of jth coordinate j∈{1,…,p}, we have

$$ \nabla_{j}\ell\bigl(\boldsymbol{ \beta}_{j}^{n_{k}-1}\bigr) \delta_{j} + \partial\rho \bigl(\beta_{j}^{n_{k}};\delta_{j}\bigr) \geq0, \quad \mbox{for\ all}\ k. $$
(27)

Thus (26), (27) implies that for all j∈{1,…,p},

(28)

By (23), (28), for j∈{1,…,p}, we have

$$ \liminf_{\alpha\downarrow0+} \biggl\{ \frac{Q(\boldsymbol{\beta}^{\infty} + \alpha\boldsymbol{\delta}_{j}) - Q(\boldsymbol{\beta}^{\infty}) }{\alpha} \biggr\} \geq0. $$
(29)

Following the above arguments, it is easy to see that for j=0

$$ \nabla_{0}\ell\bigl(\boldsymbol{ \beta}^{\infty}\bigr) \delta_{0} \geq0. $$
(30)

Hence for \(\boldsymbol{\delta}=(\delta_{0},\ldots,\delta_{p}) \in \mathbb{R}^{p+1}\), by the differentiability of (β), we have

(31)

by (29), (30). This completes the proof. □

Proof of Theorem 1

To ease notation, write \(\chi_{\beta_{0},\ldots,\beta_{j-1},\beta _{j+1},\ldots,\beta_{p}}^{j}\equiv\chi(u)\) for Q(β) as a function of the jth coordinate with β l ,lj being fixed. We first deal with the j∈{1,…,p} coordinates, then the intercept (0th coordinate) in the following arguments.

For j∈{1,…,p}th coordinate, observe that

(32)
(33)

with |u | being some number between |u+δ| and |u|. Notation ∇ j (β 0,…,β j−1,u,β j+1,…,β p ) and \(\nabla_{j}^{2}\ell(\beta_{0},\ldots,\beta_{j-1},u,\beta_{j+1},\ldots ,\beta_{p})\)denote the first and second derivative of the function w.r.t. the jth coordinate (assuming to be existed by condition (1)).

We re-write the RHS of (33) as follows:

(34)

On the other hand, the solution of the jth coordinate (j∈{1,…,p}) is to minimize the following function,

(35)

By majorization, we bound \(\nabla_{j}^{2}\ell(\boldsymbol{\beta})\) by a constant M for standardized variables. So the actual function being minimized is

$$ \tilde{Q}_{j}(u|\boldsymbol{\beta})= \ell(\boldsymbol{\beta})+ \nabla_{j}\ell(\boldsymbol{\beta}) (u- \beta_{j}) + \frac{1}{2}M(u-\beta_{j})^{2} + \rho\bigl(|u|\bigr). $$
(36)

Since u minimizes (36), we have, for the jth (j∈{1,…,p}) coordinate,

$$ \nabla_{j}\ell(\boldsymbol{\beta}) + M(u-\beta_{j}) + \rho'\bigl(|u|\bigr) \operatorname {sgn}(u)=0. $$
(37)

Because χ(u) is minimized at u 0, by (37), we have

(38)

if u 0=0 then the above holds true for some value of \(\operatorname {sgn}(u_{0}) \in[-1,1]\).

Observe that ρ′(|x|)≥0, then

(39)

Therefore using (38), (39) in (34) at u 0, we have, for j∈{1,…,p},

(40)

By the condition (b) of the MMCD algorithm inf t ρ″(|t|;λ,γ)>−M and (|u+δ|−|u|)2δ 2. Hence there exist \(\theta_{2}=\frac{1}{2}(M + \mbox{inf}_{x}\rho ^{\prime\prime}(|x|) + o(1)) >0\), such that for the jth coordinate, j∈{1,…,p},

$$ \chi(u_{0}+\delta)- \chi(u_{0}) \geq\theta_{2} \delta^2. $$
(41)

Now consider β 0, observe that

(42)

By similar arguments to (38), we have

(43)

Therefore, by (42), (43), for the first coordinate of β

(44)

Hence there exists a \(\theta_{1}=\frac{1}{2} (M+ o(1)) >0\), such that for the first coordinate of β

$$ \chi(u_{0}+\delta)-\chi(u_{0}) \geq\theta_{1}\delta^{2}. $$
(45)

Let θ=min(θ 1,θ 2), using (41), (45), we have for all the coordinates of β,

$$ \chi(u_{0}+\delta)-\chi(u_{0}) \geq\theta \delta^2. $$
(46)

By (46) we have

(47)

where \(\boldsymbol{\beta}_{j}^{m-1}=(\beta_{1}^{m},\ldots,\beta _{j}^{m},\beta_{j+1}^{m-1},\ldots,\beta_{p}^{m-1})\). The (47) establishes the boundedness of the sequence {β m} for every m>1 since the starting point of \(\{ \boldsymbol{\beta}^{1}\} \in\mathbb{R}^{p+1}\).

Applying (47) over all the coordinates, we have for all m

$$ Q\bigl(\boldsymbol{\beta}^{m}\bigr)-Q\bigl(\boldsymbol{ \beta}^{m+1}\bigr) \geq\theta\bigl\|\boldsymbol{ \beta}^{m+1} - \boldsymbol{\beta}^{m} \bigr\| _{2}^{2}. $$
(48)

Since the (decreasing) sequence Q(β m) converges, (48) shows that the sequence {β k} have a unique limit point. This completes the proof of the convergence of {β k}.

The assumption (3) and (4) in Lemma 1 holds by (48). Hence, the limit point of {β k} is a minimum of Q(β) by Lemma 1. This completes the proof of the theorem. □

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jiang, D., Huang, J. Majorization minimization by coordinate descent for concave penalized generalized linear models. Stat Comput 24, 871–883 (2014). https://doi.org/10.1007/s11222-013-9407-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-013-9407-3

Keywords

Navigation