Abstract
Nonsmoothness is often a curse for optimization; but it is sometimes a blessing, in particular for applications in machine learning. In this paper, we present the specific structure of nonsmooth optimization problems appearing in machine learning and illustrate how to leverage this structure in practice, for compression, acceleration, or dimension reduction. We pay a special attention to the presentation to make it concise and easily accessible, with both simple examples and general results.
Similar content being viewed by others
References
Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on matrix manifolds. Princeton University Press, Princeton (2008)
Bach, F.: Consistency of trace norm minimization. J. Mach. Learn. Res. 9, 1019–1048 (2008)
Bach, F., Jenatton, R., Mairal, J., Obozinski, G., et al.: Convex optimization with sparsity-inducing norms. Optim. Mach. Learn. 5, 19–53 (2011)
Bach, F., Jenatton, R., Mairal, J., Obozinski, G., et al.: Optimization with sparsity-inducing penalties. Found. Trends®; Mach. Learn. 4(1), 1–106 (2012)
Bareilles, G., Iutzeler, F.: On the interplay between acceleration and identification for the proximal gradient algorithm. Computational Optimization and Applications (2020)
Bauschke, H.H., Combettes, P.L.: Convex analysis and monotone operator theory in Hilbert spaces. Springer Science & Business Media (2011)
Bauschke, H.H., Goebel, R., Lucet, Y., Wang, X.: The proximal average: basic theory. SIAM J. Optim. 19(2), 766–785 (2008)
Beck, A.: First-order methods in optimization, vol. 25. SIAM (2017)
Benfenati, A., Chouzenoux, E., Pesquet, J.C.: A proximal approach for a class of matrix optimization problems. arXiv:1801.07452 (2018)
Candès, E.J., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 52(2), 489–509 (2006)
Chambolle, A., Dossal, C.: On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 166 (3), 968–982 (2015)
Chartrand, R., Yin, W.: Nonconvex Sparse Regularization and Splitting Algorithms. In: Splitting Methods in Communication, Imaging, Science, and Engineering, pp. 237–249. Springer (2016)
Combettes, P.L., Pesquet, J.C.: Proximal Splitting Methods in Signal Processing. In: Fixed-Point Algorithms for Inverse Problems in Science and Engineering, pp. 185–212. Springer (2011)
Condat, L.: A direct algorithm for 1-d total variation denoising. IEEE Signal Process. Lett. 20(11), 1054–1057 (2013)
Condat, L.: Discrete total variation: New definition and minimization. SIAM J. Imaging Sci. 10(3), 1258–1290 (2017)
Daniilidis, A., Hare, W., Malick, J.: Geometrical interpretation of the predictor-corrector type algorithms in structured optimization problems. Optimization 55(5-6), 481–503 (2006)
Defazio, A., Bach, F., Lacoste-Julien, S.: Saga: a Fast Incremental Gradient Method with Support for Non-Strongly Convex Composite Objectives. In: Advances in Neural Information Processing Systems, pp. 1646–1654 (2014)
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theory 52 (4), 1289–1306 (2006)
Duval, V., Peyré, G.: Sparse regularization on thin grids i: the lasso. Inverse Probl. 33(5), 055008 (2017)
Eckstein, J., Bertsekas, D.P.: On the douglas—rachford splitting method and the proximal point algorithm for maximal monotone operators. Math. Program. 55(1-3), 293–318 (1992)
Fadili, J., Malick, J., Peyré, G.: Sensitivity analysis for mirror-stratifiable convex functions. SIAM J. Optim. 28(4), 2975–3000 (2018)
Fadili, J.M., Garrigos, G., Malick, J., Peyré, G.: Model Consistency for Learning with Mirror-Stratifiable Regularizers. In: International Conference on Artificial Intelligence and Statistics (AISTATS) (2019)
Fercoq, O., Gramfort, A., Salmon, J.: Mind the Duality Gap: Safer Rules for the Lasso. In: International Conference on Machine Learning, pp. 333–342 (2015)
Friedrich, F., Kempe, A., Liebscher, V., Winkler, G.: Complexity penalized m-estimation: fast computation. J. Comput. Graph. Stat. 17(1), 201–224 (2008)
Garrigos, G., Rosasco, L., Villa, S.: Thresholding gradient methods in Hilbert spaces: support identification and linear convergence. arXiv:1801.074521712.00357 (2017)
Ghaoui, L.E., Viallon, V., Rabbani, T.: Safe feature elimination for the lasso and sparse supervised learning problems. Pac. J. Optim. 8(4), 667–698 (2012)
Grishchenko, D., Iutzeler, F., Malick, J.: Proximal gradient methods with adaptive subspace sampling. Mathematics of Operations Research (2020)
Grishchenko, D., Iutzeler, F., Malick, J., Amini, M.R.: Asynchronous distributed learning with sparse communications and identification. arXiv:1812.03871 (2018)
Hare, W., Lewis, A.S.: Identifying active constraints via partial smoothness and prox-regularity. J. Convex Anal. 11(2), 251–266 (2004)
Hare, W., Sagastizábal, C.: Computing proximal points of nonconvex functions. Math. Program. 116(1-2), 221–258 (2009)
Hiriart-Urruty, J.B., Lemaréchal, C.: Convex Analysis and Minimization Algorithms, vol. 2. Springer, Verlag (1993)
Lemaréchal, C., Oustry, F., Sagastizábal, C.: The \(\mathcal {U}\)-Lagrangianof a convex function. Trans. AMS 352(2), 711–729 (2000)
Liang, J., Fadili, J., Peyré, G.: Activity identification and local linear convergence of forward–backward-type methods. SIAM J. Optim. 27(1), 408–437 (2017)
Mairal, J.: Sparse coding for machine learning, image processing and computer vision. Ph.D. Thesis, École Normale supérieure de Cachan (2010)
Massias, M., Salmon, J., Gramfort, A.: Celer: a Fast Solver for the Lasso with Dual Extrapolation. In: International Conference on Machine Learning, pp. 3321–3330 (2018)
Mifflin, R., Sagastizábal, C.: A \(\mathcal {{{VU}}}\)-algorithm for convex minimization. Math. Programm. 104(2-3), 583–608 (2005)
Miller, S. A., Malick, J.: Newton methods for nonsmooth convex minimization: connections among-lagrangian, riemannian newton and sqp methods. Math. Programm. 104(2-3), 609–633 (2005)
Mishchenko, K., Iutzeler, F., Malick, J.: A distributed flexible delay-tolerant proximal gradient algorithm. SIAM J. Optim. 30(1), 933–959 (2020)
Mishchenko, K., Iutzeler, F., Malick, J., Amini, M.R.: A Delay-Tolerant Proximal-Gradient Algorithm for Distributed Learning. In: International Conference on Machine Learning, pp. 3584–3592 (2018)
Ndiaye, E., Fercoq, O., Gramfort, A., Salmon, J.: Gap safe screening rules for sparsity enforcing penalties. J. Mach. Learn. Res. 18(1), 4671–4703 (2017)
Nesterov, Y.E.: A Method for Solving the Convex Programming Problem with Convergence Rate O(1/K2). In: Dokl. Akad. Nauk SSSR, vol. 269, pp. 543–547 (1983)
Nutini, J., Schmidt, M., Hare, W.: “active-set complexity” of proximal gradient: How long does it take to find the sparsity pattern? Optim. Lett. 13(4), 645–655 (2019)
Parikh, N., Boyd, S.P.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Poon, C., Liang, J., Schönlieb, C.B.: Local convergence properties of saga/prox-svrg and acceleration. arXiv:1802.02554 (2018)
Rockafellar, R., Wets, R.B.: Variational Analysis. Springer, Heidelberg (1998)
Rockafellar, R.T.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14(5), 877–898 (1976)
Rudin, L.I., Osher, S., Fatemi, E.: Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenom. 60(1-4), 259–268 (1992)
Shalev-Shwartz, S., Ben-David, S.: Understanding machine learning: From theory to algorithms. Cambridge University Press, Cambridge (2014)
Stewart, G.W.: Perturbation theory for the singular value decomposition. Technical report (1998)
Sun, Y., Jeong, H., Nutini, J., Schmidt, M.: Are We There Yet? Manifold Identification of Gradient-Related Proximal Methods. In: The 22Nd International Conference on Artificial Intelligence and Statistics, pp. 1110–1119 (2019)
Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological):267–288 (1996)
Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., Tibshirani, R.J.: Strong rules for discarding predictors in lasso-type problems. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 74(2), 245–266 (2012)
Vaiter, S., Peyré, G., Fadili, J.: Low Complexity Regularization of Linear Inverse Problems. In: Sampling Theory, a Renaissance, pp. 103–153. Springer (2015)
Weinmann, A., Storath, M., Demaret, L.: The l1-Potts functional for robust jump-sparse reconstruction. SIAM J. Numer. Anal. 53(1), 644–673 (2015)
Weyl, H.: Das asymptotische Verteilungsgesetz der Eigenwerte linearer partieller Differentialgleichungen (mit einer Anwendung auf die Theorie der Hohlraumstrahlung). Math. Ann. 71(4), 441–479 (1912)
Wright, S.J.: Identifiable surfaces in constrained optimization. SIAM J. Control. Optim. 31(4), 1063–1079 (1993)
Yu, Y.L.: On Decomposing the Proximal Map. In: Advances in Neural Information Processing Systems, pp. 91–99 (2013)
Zhao, P., Yu, B.: On model selection consistency of Lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)
Acknowledgments
The authors would like to warmly thank the whole DAO team and especially our PhD students Gilles Bareilles, Mathias Chastan, Sélim Chraibi, Dmitry Grishchenko, Yu-Guan Hsieh, and Yassine Laguel. FI benefited from the support of the ANR JCJC project STROLL (ANR-19-CE23-0008). This work has been partially supported by MIAI@Grenoble Alpes (ANR-19-P3IA-0003).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Iutzeler, F., Malick, J. Nonsmoothness in Machine Learning: Specific Structure, Proximal Identification, and Applications. Set-Valued Var. Anal 28, 661–678 (2020). https://doi.org/10.1007/s11228-020-00561-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11228-020-00561-1