Abstract
This paper conducts a comparative study of proximal gradient methods (PGMs) and proximal DC algorithms (PDCAs) for sparse regression problems which can be cast as Difference-of-two-Convex-functions (DC) optimization problems. It has been shown that for DC optimization problems, both General Iterative Shrinkage and Thresholding algorithm (GIST), a modified version of PGM, and PDCA converge to critical points. Recently some enhanced versions of PDCAs are shown to converge to d-stationary points, which are stronger necessary condition for local optimality than critical points. In this paper we claim that without any modification, PGMs converge to a d-stationary point not only to DC problems but also to more general nonsmooth nonconvex problems under some technical assumptions. While the convergence to d-stationary points is known for the case where the step size is small enough, the finding of this paper is valid also for extended versions such as GIST and its alternating optimization version, which is to be developed in this paper. Numerical results show that among several algorithms in the two categories, modified versions of PGM perform best among those not only in solution quality but also in computation time.
Similar content being viewed by others
References
Ahn, M., Pang, J.S., Xin, J.: Difference-of-convex learning: directional stationarity, optimality, and sparsity. SIAM J. Optim. 27(3), 1637–1665 (2017)
Attouch, H., Bolte, J., Svaiter, B.F.: Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized gauss-seidel methods. Math. Program. 137(1–2), 91–129 (2013)
Barzilai, J., Borwein, J.M.: Two-point step size gradient methods. IMA J. Numer. Anal. 8(1), 141–148 (1988)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Candes, E.J., Wakin, M.B., Boyd, S.P.: Enhancing sparsity by reweighted \(\ell _1\) minimization. J. Fourier Anal. Appl. 14(5–6), 877–905 (2008)
Cui, Y., Pang, J.S., Sen, B.: Composite difference-max programs for modern statistical estimation problems. SIAM J. Optim. 28(4), 3344–3374 (2018)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96(456), 1348–1360 (2001)
Gong, P., Zhang, C., Lu, Z., Huang, J., Ye, J.: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. In: International Conference on Machine Learning, pp. 37–45 (2013)
Gotoh, J., Takeda, A., Tono, K.: Dc formulations and algorithms for sparse optimization problems. Math. Program. 169(1), 141–176 (2018)
Li, H., Lin, Z.: Accelerated proximal gradient methods for nonconvex programming. In: Advances in Neural Information Processing Systems, pp. 379–387 (2015)
Li, J., So, A.M.C., Member, S.: Understanding notions of stationarity in non-smooth optimization. IEEE Signal Process. Mag. 37(5), 18–31 (2020)
Liu, T., Pong, T.K., Takeda, A.: A refined convergence analysis of pDCA\(_e\) with applications to simultaneous sparse recovery and outlier detection. Comput. Optim. Appl. 73(1), 69–100 (2019)
Lu, Z., Li, X.: Sparse recovery via partial regularization: models, theory, and algorithms. Math. Oper. Res. 43(4), 1290–1316 (2018)
Lu, Z., Zhou, Z.: Nonmonotone enhanced proximal DC algorithms for a class of structured nonsmooth DC programming. SIAM J. Optim. 29(4), 2725–2752 (2019)
Lu, Z., Zhou, Z., Sun, Z.: Enhanced proximal DC algorithms with extrapolation for a class of structured nonsmooth DC minimization. Math. Program. 176(1–2), 369–401 (2019)
Pang, J.S., Razaviyayn, M., Alvarado, A.: Computing b-stationary points of nonsmooth DC programs. Math. Oper. Res. 42(1), 95–118 (2017)
Razaviyayn, M., Hong, M., Luo, Z.Q.: A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J. Optim. 23(2), 1126–1153 (2013)
Rockafellar, R.T., Wets, R.J.B.: Variational Anarysis, 3rd edn. Springer, Berlin (2009)
Wen, B., Chen, X., Pong, T.K.: A proximal difference-of-convex algorithm with extrapolation. Comput. Optim. Appl. 69(2), 297–324 (2018)
Wright, S.J., Nowak, R.D., Figueiredo, M.A.: Sparse reconstruction by separable approximation. IEEE Trans. Signal Process. 57(7), 2479–2493 (2009)
Zhang, C.H., et al.: Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 38(2), 894–942 (2010)
Zhang, T.: Analysis of multi-stage convex relaxation for sparse regularization. J. Mach. Learn. Res. 11(Mar), 1081–1107 (2010)
Acknowledgements
S. Nakayama is supported in part by JSPS KAKENHI Grant 20K11698 and 20K14986. J. Gotoh is supported in part by JSPS KAKENHI Grant 19H02379, 19H00808, and 20H00285. The authors would like to thank the reviewers for their comments, which improved on the quality of the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A Appendix
A Appendix
1.1 A.1 Proof of Lemma 1
Let
Since \(x_{t+1}\) is a minimizer of \(\min Q(x;x_t) := g(x) + q(x;x_t)\), we have \(0 \in \hat{\partial }Q(x_{t+1};x_t),\) which yields
for all \(d \in \mathbb {R}^p\). Therefore, we have
which implies
Therefore, \( -\Vert \epsilon _t\Vert \Vert d\Vert \le F'(x_{t+1};d).\) \(\Box \)
1.2 A.2 Proof of Theorem 2
Because of the nonnegativity of the functions \(T_K\) and \(T_\kappa \), it is valid that \(f(x,z)+\lambda _1T_K(x)+\lambda _2T_\kappa (z)\ge f(x,z)+\underline{\lambda }_1T_K(x)+\underline{\lambda }_2T_\kappa (z)\) for any (x, z) for \(\lambda _1>\underline{\lambda }_1\) and \(\lambda _2>\underline{\lambda }_2\). Accordingly, we have \(S(\lambda _1,\lambda _2)\subset S(\underline{\lambda }_1,\underline{\lambda }_2)\), and \(\Vert x^*\Vert \le C_x\) and \(\Vert z^*\Vert \le C_z\).
In the rest of this proof, we only show the condition for \(\lambda _1\) since we can show that for \(\lambda _2\) in the same manner. We prove the statement by contradiction. Suppose that \(\Vert x^*\Vert _0>K\) and \(x^*_{(K+1)}>0\). It follows from Assumption 2 that for any \(x_1,~x_2\in \mathbb {R}^p\), \(\tilde{z}\in \mathbb {R}^N\)
which means
Let \(\tilde{x}:=x^*-x_i^*e_i\), then the above inequality yields
where \(i=(K+1)\). Therefore, we obtain
Noting that
and (25), we have
which contradicts the optimality of \(x^*\). Similarly, we can derive the condition for \(\lambda _2\).
\(\Box \)
1.3 A.3 Proof of Lemma 2
Let
with
From (28) and (29), \(x_{t+1}\) and \(z_{t+1}\) minimize \(q_x(x|x_t,z_t)\) and \(q_z(z|x_{t+1},z_t)\), respectively, and accordingly we have
Similarly to the proof of Lemma 1, we have
which implies
\(\square \)
1.4 A.4 Proof of Lemma 3
Denoting
we can rewrite (30) as
Then we have
which implies that the sequence \(\{ F(x_{\phi (t)},z_{\phi (t)}):t\ge 0\}\) is monotonically decreasing. Therefore, since F is bounded below, there exists \(\bar{F}\) such that
By applying (32) with t replaced by \(\phi (t)-1\), we obtain
Therefore, it follows from (33) that
Since \(\eta _{\phi (t)-1}^x\), \(\eta _{\phi (t)-1}^z\ge \underline{\eta }\), we have
With (33), (34), the boundedness of the sequence, and continuity of F, we have
Next we prove, by induction, that the following for all \(j\ge 1\),
We have already observed that the results hold for \(j=1\). Suppose that (35) and (36) hold for j. From (32) with t replaced by \(\phi (t)-j-1\), we get
which ensures (36). Furthermore, it follows from \(\eta _{l}^x\), \(\eta _{l}^z\ge \underline{\eta }\) for all l that (35). Hence, we have (31). \(\square \)
Rights and permissions
About this article
Cite this article
Nakayama, S., Gotoh, Jy. On the superiority of PGMs to PDCAs in nonsmooth nonconvex sparse regression. Optim Lett 15, 2831–2860 (2021). https://doi.org/10.1007/s11590-021-01716-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11590-021-01716-1