On the superiority of PGMs to PDCAs in nonsmooth nonconvex sparse regression

Nakayama, Shummin; Gotoh, Jun-ya

doi:10.1007/s11590-021-01716-1

On the superiority of PGMs to PDCAs in nonsmooth nonconvex sparse regression

Original Paper
Published: 03 March 2021

Volume 15, pages 2831–2860, (2021)
Cite this article

Optimization Letters Aims and scope Submit manuscript

509 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

This paper conducts a comparative study of proximal gradient methods (PGMs) and proximal DC algorithms (PDCAs) for sparse regression problems which can be cast as Difference-of-two-Convex-functions (DC) optimization problems. It has been shown that for DC optimization problems, both General Iterative Shrinkage and Thresholding algorithm (GIST), a modified version of PGM, and PDCA converge to critical points. Recently some enhanced versions of PDCAs are shown to converge to d-stationary points, which are stronger necessary condition for local optimality than critical points. In this paper we claim that without any modification, PGMs converge to a d-stationary point not only to DC problems but also to more general nonsmooth nonconvex problems under some technical assumptions. While the convergence to d-stationary points is known for the case where the step size is small enough, the finding of this paper is valid also for extended versions such as GIST and its alternating optimization version, which is to be developed in this paper. Numerical results show that among several algorithms in the two categories, modified versions of PGM perform best among those not only in solution quality but also in computation time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GRPDA Revisited: Relaxed Condition and Connection to Chambolle-Pock’s Primal-Dual Algorithm

Article 28 October 2022

Solving nonnegative sparsity-constrained optimization via DC quadratic-piecewise-linear approximations

Article 05 May 2021

Inertial proximal gradient methods with Bregman regularization for a class of nonconvex optimization problems

Article 19 August 2020

References

Ahn, M., Pang, J.S., Xin, J.: Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM J. Optim. 27(3), 1637–1665 (2017)
Article MathSciNet Google Scholar
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized gauss-seidel methods. Math. Program. 137(1–2), 91–129 (2013)
Article MathSciNet Google Scholar
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Article MathSciNet Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Article MathSciNet Google Scholar
Candes, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted $\ell _1$ minimization. J. Fourier Anal. Appl. 14(5–6), 877–905 (2008)
Article MathSciNet Google Scholar
Cui, Y., Pang, J.S., Sen, B.: Composite difference-max programs for modern statistical estimation problems. SIAM J. Optim. 28(4), 3344–3374 (2018)
Article MathSciNet Google Scholar
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Article MathSciNet Google Scholar
Gong, P., Zhang, C., Lu, Z., Huang, J., Ye, J.: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In: International Conference on Machine Learning, pp. 37–45 (2013)
Gotoh, J., Takeda, A., Tono, K.: Dc formulations and algorithms for sparse optimization problems. Math. Program. 169(1), 141–176 (2018)
Article MathSciNet Google Scholar
Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: Advances in Neural Information Processing Systems, pp. 379–387 (2015)
Li, J., So, A.M.C., Member, S.: Understanding notions of stationarity in non-smooth optimization. IEEE Signal Process. Mag. 37(5), 18–31 (2020)
Article Google Scholar
Liu, T., Pong, T.K., Takeda, A.: A refined convergence analysis of pDCA$_e$ with applications to simultaneous sparse recovery and outlier detection. Comput. Optim. Appl. 73(1), 69–100 (2019)
Article MathSciNet Google Scholar
Lu, Z., Li, X.: Sparse recovery via partial regularization: models, theory, and algorithms. Math. Oper. Res. 43(4), 1290–1316 (2018)
Article MathSciNet Google Scholar
Lu, Z., Zhou, Z.: Nonmonotone enhanced proximal DC algorithms for a class of structured nonsmooth DC programming. SIAM J. Optim. 29(4), 2725–2752 (2019)
Article MathSciNet Google Scholar
Lu, Z., Zhou, Z., Sun, Z.: Enhanced proximal DC algorithms with extrapolation for a class of structured nonsmooth DC minimization. Math. Program. 176(1–2), 369–401 (2019)
Article MathSciNet Google Scholar
Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing b-stationary points of nonsmooth DC programs. Math. Oper. Res. 42(1), 95–118 (2017)
Article MathSciNet Google Scholar
Razaviyayn, M., Hong, M., Luo, Z.Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)
Article MathSciNet Google Scholar
Rockafellar, R.T., Wets, R.J.B.: Variational Anarysis, 3rd edn. Springer, Berlin (2009)
Google Scholar
Wen, B., Chen, X., Pong, T.K.: A proximal difference-of-convex algorithm with extrapolation. Comput. Optim. Appl. 69(2), 297–324 (2018)
Article MathSciNet Google Scholar
Wright, S.J., Nowak, R.D., Figueiredo, M.A.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)
Article MathSciNet Google Scholar
Zhang, C.H., et al.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Article MathSciNet Google Scholar
Zhang, T.: Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 11(Mar), 1081–1107 (2010)
MathSciNet MATH Google Scholar

Download references

Acknowledgements

S. Nakayama is supported in part by JSPS KAKENHI Grant 20K11698 and 20K14986. J. Gotoh is supported in part by JSPS KAKENHI Grant 19H02379, 19H00808, and 20H00285. The authors would like to thank the reviewers for their comments, which improved on the quality of the manuscript.

Author information

Authors and Affiliations

Department of Industrial and Systems Engineering, Chuo University, Tokyo, Japan
Shummin Nakayama & Jun-ya Gotoh

Authors

Shummin Nakayama
View author publications
You can also search for this author in PubMed Google Scholar
Jun-ya Gotoh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun-ya Gotoh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

1.1 A.1 Proof of Lemma 1

Let

$$\begin{aligned} q(x;x_t) := \frac{\eta _t}{2}\left\| x-\left( x_t-\frac{1}{\eta _t}\nabla f(x_t)\right) \right\| ^2. \end{aligned}$$

Since $x_{t+1}$ is a minimizer of $\min Q(x;x_t) := g(x) + q(x;x_t)$, we have $0 \in \hat{\partial }Q(x_{t+1};x_t),$ which yields

$$\begin{aligned} 0&\le \liminf _{y\ne x_{t+1}, y \rightarrow x_{t+1}}\frac{ Q(y;x_t ) - Q(x_{t+1};x_t )}{\Vert y-x_{t+1}\Vert }\\&\le \lim _{\tau \rightarrow +0}\frac{ Q(x_{t+1}+\tau d;x_t ) - Q(x_{t+1};x_t )}{\tau \Vert d\Vert }\\&\le \lim _{\tau \rightarrow +0}\frac{ g(x_{t+1}+\tau d) - g(x_{t+1})}{\tau \Vert d\Vert } +\lim _{\tau \rightarrow +0}\frac{ q(x_{t+1}+\tau d;x_t ) - q(x_{t+1};x_t )}{\tau \Vert d\Vert } \end{aligned}$$

for all $d \in \mathbb {R}^p$. Therefore, we have

$$\begin{aligned} 0\le g'(x_{t+1};d) + \langle \eta _t(x_{t+1} - x_t ) +\nabla f(x_t),d\rangle , \end{aligned}$$

which implies

$$\begin{aligned} - \langle \epsilon _t,d\rangle \le g'(x_{t+1};d) + \langle \nabla f(x_{t+1}),d\rangle = F'(x_{t+1};d). \end{aligned}$$

Therefore, $ -\Vert \epsilon _t\Vert \Vert d\Vert \le F'(x_{t+1};d).$ $\Box $

1.2 A.2 Proof of Theorem 2

Because of the nonnegativity of the functions $T_K$ and $T_\kappa $, it is valid that $f(x,z)+\lambda _1T_K(x)+\lambda _2T_\kappa (z)\ge f(x,z)+\underline{\lambda }_1T_K(x)+\underline{\lambda }_2T_\kappa (z)$ for any (x, z) for $\lambda _1>\underline{\lambda }_1$ and $\lambda _2>\underline{\lambda }_2$. Accordingly, we have $S(\lambda _1,\lambda _2)\subset S(\underline{\lambda }_1,\underline{\lambda }_2)$, and $\Vert x^*\Vert \le C_x$ and $\Vert z^*\Vert \le C_z$.

In the rest of this proof, we only show the condition for $\lambda _1$ since we can show that for $\lambda _2$ in the same manner. We prove the statement by contradiction. Suppose that $\Vert x^*\Vert _0>K$ and $x^*_{(K+1)}>0$. It follows from Assumption 2 that for any $x_1,~x_2\in \mathbb {R}^p$, $\tilde{z}\in \mathbb {R}^N$

$$\begin{aligned} \Vert \nabla _x f(x_1,\tilde{z}) - \nabla _xf(x_2,\tilde{z})\Vert \le \left\| \begin{matrix} \nabla _x f(x_1,\tilde{z}) -\nabla _x f(x_2,\tilde{z})\\ \nabla _z f(x_1,\tilde{z})-\nabla _z f(x_2,\tilde{z}) \end{matrix}\right\| \le M \left\| x_1-x_2 \right\| , \end{aligned}$$

which means

$$\begin{aligned} f(x_1,\tilde{z}) \le f(x_2,\tilde{z}) + \nabla _{x}f(x_2,\tilde{z})^\top (x_1-x_2)+\frac{M}{2}\Vert x_2-x_1\Vert ^2. \end{aligned}$$

Let $\tilde{x}:=x^*-x_i^*e_i$, then the above inequality yields

$$\begin{aligned} f(\tilde{x},z^*) \le f(x^*,z^*) - \nabla _{x}f(x^*,z^*)^\top (x_i^*e_i)+\frac{M}{2}(x_i^*)^2, \end{aligned}$$

where $i=(K+1)$. Therefore, we obtain

$$\begin{aligned} F(x^*,z^*)-F(\tilde{x},z^*)&=f(x^*,z^*)+\lambda _1T_{K}(x^*)+\lambda _2T_{\kappa }(z^*)\\&\quad - \left( f(\tilde{x},z^*)+\lambda _1T_{K}(\tilde{x})+\lambda _2T_{\kappa }(z^*)\right) \\&\ge \nabla _{x}f(x^*,z^*)^\top (x_i^*e_i)-\frac{M}{2}(x_i^*)^2+\lambda _1|x^*_i|\\&\ge |x^*_i|\left( \lambda _1-\Vert \nabla _{x}f(x^*,z^*)\Vert -\frac{MC_x}{2}\right) . \end{aligned}$$

Noting that

$$\begin{aligned} \Vert \nabla _{x}f(x^*,z^*)\Vert&\le \Vert \nabla _{x}f(0,0)\Vert +\Vert \nabla _{x}f(x^*,z^*)-\nabla _{x}f(0,0)\Vert \\&\le \Vert \nabla _{x}f(0,0)\Vert +M(\Vert x^*\Vert +\Vert z^*\Vert )\\&\le \Vert \nabla _{x}f(0,0)\Vert +M(C_x+C_z), \end{aligned}$$

and (25), we have

$$\begin{aligned} F(x^*,z^*)-F(\tilde{x},\tilde{z}) \ge |x^*_i|[\lambda _1-\Vert \nabla _{x}f(0,0)\Vert -M(\frac{3}{2}C_x+C_z)]>0, \end{aligned}$$

which contradicts the optimality of $x^*$. Similarly, we can derive the condition for $\lambda _2$.

$\Box $

1.3 A.3 Proof of Lemma 2

Let

$$\begin{aligned} Q_x(x|x_t,z_t)&:=g(x)+q_x(x|x_t,z_t),\\ Q_z(z|x_{t+1},z_t)&:=h(z)+q_z(z|x_{t+1},z_t), \end{aligned}$$

with

$$\begin{aligned} q_x(x|x_t,z_t)&:=\frac{\eta ^x_t}{2}\left\| x-\Big (x_t-\frac{1}{\eta _t^x}\nabla _{x}f(x_t,z_t)\Big )\right\| ^2,\\ q_z(z|x_{t+1},z_t)&:=\frac{\eta ^z_t}{2}\left\| z-\Big (z_t-\frac{1}{\eta _t^z}\nabla _{z}f(x_{t+1},z_t)\Big )\right\| ^2. \end{aligned}$$

From (28) and (29), $x_{t+1}$ and $z_{t+1}$ minimize $q_x(x|x_t,z_t)$ and $q_z(z|x_{t+1},z_t)$, respectively, and accordingly we have

$$\begin{aligned} 0\in \hat{\partial }_x Q_x(x_{t+1}|x_t,z_t), \quad 0\in \hat{\partial }_z Q_z(z_{t+1}|x_{t+1},z_t). \end{aligned}$$

Similarly to the proof of Lemma 1, we have

$$\begin{aligned} 0&\le g'(x_{t+1};d_x) + \langle \eta ^x_t(x_{t+1} - x_t) +\nabla _x f(x_t,z_t),d_x\rangle ,\\ 0&\le h'(z_{t+1};d_z) + \langle \eta ^z_t(z_{t+1} - z_t) +\nabla _z f(x_{t+1},z_t),d_z\rangle , \end{aligned}$$

which implies

$$\begin{aligned} -\Vert \epsilon ^x_t\Vert \Vert d_x\Vert \le g'(x_{t+1};d_x) + \langle \nabla _{x} f(x_{t+1},z_{t+1}), d_x\rangle ,\\ -\Vert \epsilon ^z_t\Vert \Vert d_z\Vert \le h'(z_{t+1};d_z) + \langle \nabla _{z} f(x_{t+1},z_{t+1}), d_z\rangle . \end{aligned}$$

$\square $

1.4 A.4 Proof of Lemma 3

Denoting

$$\begin{aligned} \phi (t) = \underset{j=\max \{0,t-r+1\},...,t}{\mathrm{argmax}}F(x_j,z_j), \end{aligned}$$

we can rewrite (30) as

$$\begin{aligned} F(x_{t+1},z_{t+1}) \le F(x_{\phi (t)},z_{\phi (t)}) -\frac{\sigma _1}{2}\eta _t^x\Vert x_{t+1} - x_t\Vert ^2 - \frac{\sigma _2}{2}\eta _t^z\Vert z_{t+1}-z_t\Vert ^2. \end{aligned}$$

(32)

Then we have

$$\begin{aligned} F(x_{\phi (t+1)},z_{\phi (t+1)})&= \max _{j=0,1,...,\min \{r-1,t+1\}} F(x_{t+1-j},z_{t+1-j}) \\&= \max \left\{ \max _{j=1,...,\min \{r-1,t+1\}} F(x_{t+1-j},z_{t+1-j}),F(x_{t+1},z_{t+1})\right\} \\&\le \max \Big \{F(x_{\phi (t)},z_{\phi (t)}),F(x_{\phi (t)},z_{\phi (t)}) \\&\quad -\frac{\sigma _1}{2}\eta _t^x\Vert x_{t+1} - x_t\Vert ^2 - \frac{\sigma _2}{2}\eta _t^z\Vert z_{t+1}-z_t\Vert ^2\Big \}\\&= F(x_{\phi (t)},z_{\phi (t)}), \end{aligned}$$

which implies that the sequence $\{ F(x_{\phi (t)},z_{\phi (t)}):t\ge 0\}$ is monotonically decreasing. Therefore, since F is bounded below, there exists $\bar{F}$ such that

$$\begin{aligned} \lim _{t\rightarrow \infty }F(x_{\phi (t)},z_{\phi (t)})=\bar{F}. \end{aligned}$$

(33)

By applying (32) with t replaced by $\phi (t)-1$, we obtain

$$\begin{aligned} F(x_{\phi (t)},z_{\phi (t)})&\le F(x_{\phi (t)-1},z_{\phi (t)-1}) -\frac{\sigma _1}{2}\eta _{\phi (t)-1}^x\Vert x_{\phi (t)} - x_{\phi (t)-1}\Vert ^2 \\&\quad - \frac{\sigma _2}{2}\eta _{\phi (t)-1}^z\Vert z_{\phi (t)}-z_{\phi (t)-1}\Vert ^2. \end{aligned}$$

Therefore, it follows from (33) that

$$\begin{aligned} \lim _{t\rightarrow \infty } \sigma _1\eta _{\phi (t)-1}^x\Vert x_{\phi (t)} - x_{\phi (t)-1}\Vert ^2 + \sigma _2\eta _{\phi (t)-1}^z\Vert z_{\phi (t)}-z_{\phi (t)-1}\Vert ^2= 0. \end{aligned}$$

Since $\eta _{\phi (t)-1}^x$, $\eta _{\phi (t)-1}^z\ge \underline{\eta }$, we have

$$\begin{aligned} \lim _{t\rightarrow \infty } \Vert x_{\phi (t)} - x_{\phi (t)-1}\Vert ^2 =0 \quad \text {and}\quad \lim _{t\rightarrow \infty } \Vert z_{\phi (t)}-z_{\phi (t)-1}\Vert ^2= 0. \end{aligned}$$

(34)

With (33), (34), the boundedness of the sequence, and continuity of F, we have

$$\begin{aligned} \bar{F}&= \lim _{t\rightarrow \infty }F(x_{\phi (t)},z_{\phi (t)})\\&= \lim _{t\rightarrow \infty }F(x_{\phi (t)-1} + (x_{\phi (t)} - x_{\phi (t)-1}) ,z_{\phi (t)-1}+(z_{\phi (t)}-z_{\phi (t)-1}))\\&=\lim _{t\rightarrow \infty }F(x_{\phi (t)-1},z_{\phi (t)-1}). \end{aligned}$$

Next we prove, by induction, that the following for all $j\ge 1$,

$$\begin{aligned}&\lim _{t\rightarrow \infty } \Vert x_{\phi (t)} - x_{\phi (t)-j}\Vert ^2 =0 \quad \text {and}\quad \lim _{t\rightarrow \infty } \Vert z_{\phi (t)}-z_{\phi (t)-j}\Vert ^2= 0, \end{aligned}$$

(35)

$$\begin{aligned}&\quad \lim _{t\rightarrow \infty }F(x_{\phi (t)-j},z_{\phi (t)-j})=\bar{F}. \end{aligned}$$

(36)

We have already observed that the results hold for $j=1$. Suppose that (35) and (36) hold for j. From (32) with t replaced by $\phi (t)-j-1$, we get

$$\begin{aligned} F(x_{\phi (t)-j},z_{\phi (t)-j})&\le F(x_{\phi (\phi (t)-j-1)},z_{\phi (\phi (t)-j-1)})\\&\quad -\frac{\sigma _1}{2}\eta _{\phi (t)-j-1}^x\Vert x_{\phi (t)-j} - x_{\phi (t)-j-1}\Vert ^2 \\&\quad - \frac{\sigma _2}{2}\eta _{\phi (t)-j-1}^z\Vert z_{\phi (t)-j}-z_{\phi (t)-j-1}\Vert ^2, \end{aligned}$$

which ensures (36). Furthermore, it follows from $\eta _{l}^x$, $\eta _{l}^z\ge \underline{\eta }$ for all l that (35). Hence, we have (31). $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nakayama, S., Gotoh, Jy. On the superiority of PGMs to PDCAs in nonsmooth nonconvex sparse regression. Optim Lett 15, 2831–2860 (2021). https://doi.org/10.1007/s11590-021-01716-1

Download citation

Received: 17 June 2020
Accepted: 14 February 2021
Published: 03 March 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11590-021-01716-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On the superiority of PGMs to PDCAs in nonsmooth nonconvex sparse regression

Abstract

Access this article

Similar content being viewed by others

GRPDA Revisited: Relaxed Condition and Connection to Chambolle-Pock’s Primal-Dual Algorithm

Solving nonnegative sparsity-constrained optimization via DC quadratic-piecewise-linear approximations

Inertial proximal gradient methods with Bregman regularization for a class of nonconvex optimization problems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Appendix

1.1 A.1 Proof of Lemma 1

1.2 A.2 Proof of Theorem 2

1.3 A.3 Proof of Lemma 2

1.4 A.4 Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Keywords

Navigation

On the superiority of PGMs to PDCAs in nonsmooth nonconvex sparse regression

Abstract

Access this article

Similar content being viewed by others

GRPDA Revisited: Relaxed Condition and Connection to Chambolle-Pock’s Primal-Dual Algorithm

Solving nonnegative sparsity-constrained optimization via DC quadratic-piecewise-linear approximations

Inertial proximal gradient methods with Bregman regularization for a class of nonconvex optimization problems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

A Appendix

A Appendix

1.1 A.1 Proof of Lemma 1

1.2 A.2 Proof of Theorem 2

1.3 A.3 Proof of Lemma 2

1.4 A.4 Proof of Lemma 3

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation