Skip to main content
Log in

On the superiority of PGMs to PDCAs in nonsmooth nonconvex sparse regression

  • Original Paper
  • Published:
Optimization Letters Aims and scope Submit manuscript

Abstract

This paper conducts a comparative study of proximal gradient methods (PGMs) and proximal DC algorithms (PDCAs) for sparse regression problems which can be cast as Difference-of-two-Convex-functions (DC) optimization problems. It has been shown that for DC optimization problems, both General Iterative Shrinkage and Thresholding algorithm (GIST), a modified version of PGM, and PDCA converge to critical points. Recently some enhanced versions of PDCAs are shown to converge to d-stationary points, which are stronger necessary condition for local optimality than critical points. In this paper we claim that without any modification, PGMs converge to a d-stationary point not only to DC problems but also to more general nonsmooth nonconvex problems under some technical assumptions. While the convergence to d-stationary points is known for the case where the step size is small enough, the finding of this paper is valid also for extended versions such as GIST and its alternating optimization version, which is to be developed in this paper. Numerical results show that among several algorithms in the two categories, modified versions of PGM perform best among those not only in solution quality but also in computation time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Ahn, M., Pang, J.S., Xin, J.: Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM J. Optim. 27(3), 1637–1665 (2017)

    Article  MathSciNet  Google Scholar 

  2. Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized gauss-seidel methods. Math. Program. 137(1–2), 91–129 (2013)

    Article  MathSciNet  Google Scholar 

  3. Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)

    Article  MathSciNet  Google Scholar 

  4. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)

    Article  MathSciNet  Google Scholar 

  5. Candes, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted \(\ell _1\) minimization. J. Fourier Anal. Appl. 14(5–6), 877–905 (2008)

    Article  MathSciNet  Google Scholar 

  6. Cui, Y., Pang, J.S., Sen, B.: Composite difference-max programs for modern statistical estimation problems. SIAM J. Optim. 28(4), 3344–3374 (2018)

    Article  MathSciNet  Google Scholar 

  7. Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)

    Article  MathSciNet  Google Scholar 

  8. Gong, P., Zhang, C., Lu, Z., Huang, J., Ye, J.: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In: International Conference on Machine Learning, pp. 37–45 (2013)

  9. Gotoh, J., Takeda, A., Tono, K.: Dc formulations and algorithms for sparse optimization problems. Math. Program. 169(1), 141–176 (2018)

    Article  MathSciNet  Google Scholar 

  10. Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: Advances in Neural Information Processing Systems, pp. 379–387 (2015)

  11. Li, J., So, A.M.C., Member, S.: Understanding notions of stationarity in non-smooth optimization. IEEE Signal Process. Mag. 37(5), 18–31 (2020)

    Article  Google Scholar 

  12. Liu, T., Pong, T.K., Takeda, A.: A refined convergence analysis of pDCA\(_e\) with applications to simultaneous sparse recovery and outlier detection. Comput. Optim. Appl. 73(1), 69–100 (2019)

    Article  MathSciNet  Google Scholar 

  13. Lu, Z., Li, X.: Sparse recovery via partial regularization: models, theory, and algorithms. Math. Oper. Res. 43(4), 1290–1316 (2018)

    Article  MathSciNet  Google Scholar 

  14. Lu, Z., Zhou, Z.: Nonmonotone enhanced proximal DC algorithms for a class of structured nonsmooth DC programming. SIAM J. Optim. 29(4), 2725–2752 (2019)

    Article  MathSciNet  Google Scholar 

  15. Lu, Z., Zhou, Z., Sun, Z.: Enhanced proximal DC algorithms with extrapolation for a class of structured nonsmooth DC minimization. Math. Program. 176(1–2), 369–401 (2019)

    Article  MathSciNet  Google Scholar 

  16. Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing b-stationary points of nonsmooth DC programs. Math. Oper. Res. 42(1), 95–118 (2017)

    Article  MathSciNet  Google Scholar 

  17. Razaviyayn, M., Hong, M., Luo, Z.Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)

    Article  MathSciNet  Google Scholar 

  18. Rockafellar, R.T., Wets, R.J.B.: Variational Anarysis, 3rd edn. Springer, Berlin (2009)

    Google Scholar 

  19. Wen, B., Chen, X., Pong, T.K.: A proximal difference-of-convex algorithm with extrapolation. Comput. Optim. Appl. 69(2), 297–324 (2018)

    Article  MathSciNet  Google Scholar 

  20. Wright, S.J., Nowak, R.D., Figueiredo, M.A.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)

    Article  MathSciNet  Google Scholar 

  21. Zhang, C.H., et al.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)

    Article  MathSciNet  Google Scholar 

  22. Zhang, T.: Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 11(Mar), 1081–1107 (2010)

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

S. Nakayama is supported in part by JSPS KAKENHI Grant 20K11698 and 20K14986. J. Gotoh is supported in part by JSPS KAKENHI Grant 19H02379, 19H00808, and 20H00285. The authors would like to thank the reviewers for their comments, which improved on the quality of the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun-ya Gotoh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

A Appendix

1.1 A.1 Proof of Lemma 1

Let

$$\begin{aligned} q(x;x_t) := \frac{\eta _t}{2}\left\| x-\left( x_t-\frac{1}{\eta _t}\nabla f(x_t)\right) \right\| ^2. \end{aligned}$$

Since \(x_{t+1}\) is a minimizer of \(\min Q(x;x_t) := g(x) + q(x;x_t)\), we have \(0 \in \hat{\partial }Q(x_{t+1};x_t),\) which yields

$$\begin{aligned} 0&\le \liminf _{y\ne x_{t+1}, y \rightarrow x_{t+1}}\frac{ Q(y;x_t ) - Q(x_{t+1};x_t )}{\Vert y-x_{t+1}\Vert }\\&\le \lim _{\tau \rightarrow +0}\frac{ Q(x_{t+1}+\tau d;x_t ) - Q(x_{t+1};x_t )}{\tau \Vert d\Vert }\\&\le \lim _{\tau \rightarrow +0}\frac{ g(x_{t+1}+\tau d) - g(x_{t+1})}{\tau \Vert d\Vert } +\lim _{\tau \rightarrow +0}\frac{ q(x_{t+1}+\tau d;x_t ) - q(x_{t+1};x_t )}{\tau \Vert d\Vert } \end{aligned}$$

for all \(d \in \mathbb {R}^p\). Therefore, we have

$$\begin{aligned} 0\le g'(x_{t+1};d) + \langle \eta _t(x_{t+1} - x_t ) +\nabla f(x_t),d\rangle , \end{aligned}$$

which implies

$$\begin{aligned} - \langle \epsilon _t,d\rangle \le g'(x_{t+1};d) + \langle \nabla f(x_{t+1}),d\rangle = F'(x_{t+1};d). \end{aligned}$$

Therefore, \( -\Vert \epsilon _t\Vert \Vert d\Vert \le F'(x_{t+1};d).\) \(\Box \)

1.2 A.2 Proof of Theorem 2

Because of the nonnegativity of the functions \(T_K\) and \(T_\kappa \), it is valid that \(f(x,z)+\lambda _1T_K(x)+\lambda _2T_\kappa (z)\ge f(x,z)+\underline{\lambda }_1T_K(x)+\underline{\lambda }_2T_\kappa (z)\) for any (xz) for \(\lambda _1>\underline{\lambda }_1\) and \(\lambda _2>\underline{\lambda }_2\). Accordingly, we have \(S(\lambda _1,\lambda _2)\subset S(\underline{\lambda }_1,\underline{\lambda }_2)\), and \(\Vert x^*\Vert \le C_x\) and \(\Vert z^*\Vert \le C_z\).

In the rest of this proof, we only show the condition for \(\lambda _1\) since we can show that for \(\lambda _2\) in the same manner. We prove the statement by contradiction. Suppose that \(\Vert x^*\Vert _0>K\) and \(x^*_{(K+1)}>0\). It follows from Assumption 2 that for any \(x_1,~x_2\in \mathbb {R}^p\), \(\tilde{z}\in \mathbb {R}^N\)

$$\begin{aligned} \Vert \nabla _x f(x_1,\tilde{z}) - \nabla _xf(x_2,\tilde{z})\Vert \le \left\| \begin{matrix} \nabla _x f(x_1,\tilde{z}) -\nabla _x f(x_2,\tilde{z})\\ \nabla _z f(x_1,\tilde{z})-\nabla _z f(x_2,\tilde{z}) \end{matrix}\right\| \le M \left\| x_1-x_2 \right\| , \end{aligned}$$

which means

$$\begin{aligned} f(x_1,\tilde{z}) \le f(x_2,\tilde{z}) + \nabla _{x}f(x_2,\tilde{z})^\top (x_1-x_2)+\frac{M}{2}\Vert x_2-x_1\Vert ^2. \end{aligned}$$

Let \(\tilde{x}:=x^*-x_i^*e_i\), then the above inequality yields

$$\begin{aligned} f(\tilde{x},z^*) \le f(x^*,z^*) - \nabla _{x}f(x^*,z^*)^\top (x_i^*e_i)+\frac{M}{2}(x_i^*)^2, \end{aligned}$$

where \(i=(K+1)\). Therefore, we obtain

$$\begin{aligned} F(x^*,z^*)-F(\tilde{x},z^*)&=f(x^*,z^*)+\lambda _1T_{K}(x^*)+\lambda _2T_{\kappa }(z^*)\\&\quad - \left( f(\tilde{x},z^*)+\lambda _1T_{K}(\tilde{x})+\lambda _2T_{\kappa }(z^*)\right) \\&\ge \nabla _{x}f(x^*,z^*)^\top (x_i^*e_i)-\frac{M}{2}(x_i^*)^2+\lambda _1|x^*_i|\\&\ge |x^*_i|\left( \lambda _1-\Vert \nabla _{x}f(x^*,z^*)\Vert -\frac{MC_x}{2}\right) . \end{aligned}$$

Noting that

$$\begin{aligned} \Vert \nabla _{x}f(x^*,z^*)\Vert&\le \Vert \nabla _{x}f(0,0)\Vert +\Vert \nabla _{x}f(x^*,z^*)-\nabla _{x}f(0,0)\Vert \\&\le \Vert \nabla _{x}f(0,0)\Vert +M(\Vert x^*\Vert +\Vert z^*\Vert )\\&\le \Vert \nabla _{x}f(0,0)\Vert +M(C_x+C_z), \end{aligned}$$

and (25), we have

$$\begin{aligned} F(x^*,z^*)-F(\tilde{x},\tilde{z}) \ge |x^*_i|[\lambda _1-\Vert \nabla _{x}f(0,0)\Vert -M(\frac{3}{2}C_x+C_z)]>0, \end{aligned}$$

which contradicts the optimality of \(x^*\). Similarly, we can derive the condition for \(\lambda _2\).

\(\Box \)

1.3 A.3 Proof of Lemma 2

Let

$$\begin{aligned} Q_x(x|x_t,z_t)&:=g(x)+q_x(x|x_t,z_t),\\ Q_z(z|x_{t+1},z_t)&:=h(z)+q_z(z|x_{t+1},z_t), \end{aligned}$$

with

$$\begin{aligned} q_x(x|x_t,z_t)&:=\frac{\eta ^x_t}{2}\left\| x-\Big (x_t-\frac{1}{\eta _t^x}\nabla _{x}f(x_t,z_t)\Big )\right\| ^2,\\ q_z(z|x_{t+1},z_t)&:=\frac{\eta ^z_t}{2}\left\| z-\Big (z_t-\frac{1}{\eta _t^z}\nabla _{z}f(x_{t+1},z_t)\Big )\right\| ^2. \end{aligned}$$

From (28) and (29), \(x_{t+1}\) and \(z_{t+1}\) minimize \(q_x(x|x_t,z_t)\) and \(q_z(z|x_{t+1},z_t)\), respectively, and accordingly we have

$$\begin{aligned} 0\in \hat{\partial }_x Q_x(x_{t+1}|x_t,z_t), \quad 0\in \hat{\partial }_z Q_z(z_{t+1}|x_{t+1},z_t). \end{aligned}$$

Similarly to the proof of Lemma 1, we have

$$\begin{aligned} 0&\le g'(x_{t+1};d_x) + \langle \eta ^x_t(x_{t+1} - x_t) +\nabla _x f(x_t,z_t),d_x\rangle ,\\ 0&\le h'(z_{t+1};d_z) + \langle \eta ^z_t(z_{t+1} - z_t) +\nabla _z f(x_{t+1},z_t),d_z\rangle , \end{aligned}$$

which implies

$$\begin{aligned} -\Vert \epsilon ^x_t\Vert \Vert d_x\Vert \le g'(x_{t+1};d_x) + \langle \nabla _{x} f(x_{t+1},z_{t+1}), d_x\rangle ,\\ -\Vert \epsilon ^z_t\Vert \Vert d_z\Vert \le h'(z_{t+1};d_z) + \langle \nabla _{z} f(x_{t+1},z_{t+1}), d_z\rangle . \end{aligned}$$

\(\square \)

1.4 A.4 Proof of Lemma 3

Denoting

$$\begin{aligned} \phi (t) = \underset{j=\max \{0,t-r+1\},...,t}{\mathrm{argmax}}F(x_j,z_j), \end{aligned}$$

we can rewrite (30) as

$$\begin{aligned} F(x_{t+1},z_{t+1}) \le F(x_{\phi (t)},z_{\phi (t)}) -\frac{\sigma _1}{2}\eta _t^x\Vert x_{t+1} - x_t\Vert ^2 - \frac{\sigma _2}{2}\eta _t^z\Vert z_{t+1}-z_t\Vert ^2. \end{aligned}$$
(32)

Then we have

$$\begin{aligned} F(x_{\phi (t+1)},z_{\phi (t+1)})&= \max _{j=0,1,...,\min \{r-1,t+1\}} F(x_{t+1-j},z_{t+1-j}) \\&= \max \left\{ \max _{j=1,...,\min \{r-1,t+1\}} F(x_{t+1-j},z_{t+1-j}),F(x_{t+1},z_{t+1})\right\} \\&\le \max \Big \{F(x_{\phi (t)},z_{\phi (t)}),F(x_{\phi (t)},z_{\phi (t)}) \\&\quad -\frac{\sigma _1}{2}\eta _t^x\Vert x_{t+1} - x_t\Vert ^2 - \frac{\sigma _2}{2}\eta _t^z\Vert z_{t+1}-z_t\Vert ^2\Big \}\\&= F(x_{\phi (t)},z_{\phi (t)}), \end{aligned}$$

which implies that the sequence \(\{ F(x_{\phi (t)},z_{\phi (t)}):t\ge 0\}\) is monotonically decreasing. Therefore, since F is bounded below, there exists \(\bar{F}\) such that

$$\begin{aligned} \lim _{t\rightarrow \infty }F(x_{\phi (t)},z_{\phi (t)})=\bar{F}. \end{aligned}$$
(33)

By applying (32) with t replaced by \(\phi (t)-1\), we obtain

$$\begin{aligned} F(x_{\phi (t)},z_{\phi (t)})&\le F(x_{\phi (t)-1},z_{\phi (t)-1}) -\frac{\sigma _1}{2}\eta _{\phi (t)-1}^x\Vert x_{\phi (t)} - x_{\phi (t)-1}\Vert ^2 \\&\quad - \frac{\sigma _2}{2}\eta _{\phi (t)-1}^z\Vert z_{\phi (t)}-z_{\phi (t)-1}\Vert ^2. \end{aligned}$$

Therefore, it follows from (33) that

$$\begin{aligned} \lim _{t\rightarrow \infty } \sigma _1\eta _{\phi (t)-1}^x\Vert x_{\phi (t)} - x_{\phi (t)-1}\Vert ^2 + \sigma _2\eta _{\phi (t)-1}^z\Vert z_{\phi (t)}-z_{\phi (t)-1}\Vert ^2= 0. \end{aligned}$$

Since \(\eta _{\phi (t)-1}^x\), \(\eta _{\phi (t)-1}^z\ge \underline{\eta }\), we have

$$\begin{aligned} \lim _{t\rightarrow \infty } \Vert x_{\phi (t)} - x_{\phi (t)-1}\Vert ^2 =0 \quad \text {and}\quad \lim _{t\rightarrow \infty } \Vert z_{\phi (t)}-z_{\phi (t)-1}\Vert ^2= 0. \end{aligned}$$
(34)

With (33), (34), the boundedness of the sequence, and continuity of F, we have

$$\begin{aligned} \bar{F}&= \lim _{t\rightarrow \infty }F(x_{\phi (t)},z_{\phi (t)})\\&= \lim _{t\rightarrow \infty }F(x_{\phi (t)-1} + (x_{\phi (t)} - x_{\phi (t)-1}) ,z_{\phi (t)-1}+(z_{\phi (t)}-z_{\phi (t)-1}))\\&=\lim _{t\rightarrow \infty }F(x_{\phi (t)-1},z_{\phi (t)-1}). \end{aligned}$$

Next we prove, by induction, that the following for all \(j\ge 1\),

$$\begin{aligned}&\lim _{t\rightarrow \infty } \Vert x_{\phi (t)} - x_{\phi (t)-j}\Vert ^2 =0 \quad \text {and}\quad \lim _{t\rightarrow \infty } \Vert z_{\phi (t)}-z_{\phi (t)-j}\Vert ^2= 0, \end{aligned}$$
(35)
$$\begin{aligned}&\quad \lim _{t\rightarrow \infty }F(x_{\phi (t)-j},z_{\phi (t)-j})=\bar{F}. \end{aligned}$$
(36)

We have already observed that the results hold for \(j=1\). Suppose that (35) and (36) hold for j. From (32) with t replaced by \(\phi (t)-j-1\), we get

$$\begin{aligned} F(x_{\phi (t)-j},z_{\phi (t)-j})&\le F(x_{\phi (\phi (t)-j-1)},z_{\phi (\phi (t)-j-1)})\\&\quad -\frac{\sigma _1}{2}\eta _{\phi (t)-j-1}^x\Vert x_{\phi (t)-j} - x_{\phi (t)-j-1}\Vert ^2 \\&\quad - \frac{\sigma _2}{2}\eta _{\phi (t)-j-1}^z\Vert z_{\phi (t)-j}-z_{\phi (t)-j-1}\Vert ^2, \end{aligned}$$

which ensures (36). Furthermore, it follows from \(\eta _{l}^x\), \(\eta _{l}^z\ge \underline{\eta }\) for all l that (35). Hence, we have (31). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nakayama, S., Gotoh, Jy. On the superiority of PGMs to PDCAs in nonsmooth nonconvex sparse regression. Optim Lett 15, 2831–2860 (2021). https://doi.org/10.1007/s11590-021-01716-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11590-021-01716-1

Keywords

Navigation