Abstract
Minibatch decomposition methods for empirical risk minimization are commonly analyzed in a stochastic approximation setting, also known as sampling with replacement. On the other hand, modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis is much scarcer. We provide convergence guaranties for the latter variant by analyzing a versatile incremental gradient scheme. For this scheme, we consider constant, decreasing or adaptive step sizes. In the smooth setting, we obtain explicit complexity estimates in terms of epoch counter. In the nonsmooth setting, we prove that the sequence is attracted by solutions of optimality conditions of the problem.
Similar content being viewed by others
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: Symposium on Operating Systems Design and Implementation (2016)
Aubin, J.P., Cellina, A.: Differential Inclusions: Set-valued Maps and Viability Theory, vol. 264. Springer, Berlin (1984)
Barakat, A., Bianchi, P.: Convergence and dynamical behavior of the Adam algorithm for non convex stochastic optimization (2018). arXiv preprint arXiv:1810.02263
Baydin, A., Pearlmutter, B., Radul, A., Siskind, J.: Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res. 18(153) 1–43 (2018)
Benaïm, M., Hirsch, M.W.: Asymptotic pseudotrajectories and chain recurrent flows, with applications. J. Dyn. Differ. Equ. 8(1), 141–176 (1996)
Benaïm, M.: Dynamics of stochastic approximation algorithms. In: Séminaire de probabilités XXXIII (pp. 1–68). Springer, Berlin (1999)
Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1), 328–348 (2005)
Bertsekas, D.P.: A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 7(4), 913–926 (1997)
Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)
Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Bolte, J., Pauwels, E.: Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01501-5
Bolte, J., Pauwels, E.: A mathematical model for automatic differentiation in machine learning. In: Conference on Neural Information Processing Systems, vol. 33, pp. 10809–10819 (2020)
Borkar, V.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, vol. 20, pp. 161–168 (2008)
Bottou L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. Siam Rev. 60(2), 223–311 (2018)
Castera, C., Bolte, J., Févotte, C., Pauwels E.: An inertial Newton algorithm for deep learning (2019). arXiv preprint arXiv:1905.12278
Clarke, F.H.: Optimization and Nonsmooth Analysis. Siam, Philadelphia (1983)
Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20, 119–154 (2018)
Defazio, A., Jelassi, S.: Adaptivity without compromise: a momentumized, adaptive, dual averaged gradient method for stochastic optimization. arXiv preprint arXiv:2101.11075
Défossez, A., Bottou, L., Bach, F., & Usunier, N.: On the convergence of Adam and Adagrad (2020). arXiv preprint arXiv:2003.02395
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
Ekeland, I., Temam, R.: Convex Analysis and Variational Problems. SIAM, Philadelphia (1976)
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. 27(2), 1035–1048 (2017)
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2019)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Griewank, A., Walther, A.: Evaluating Derivatives: Principles And Techniques of Algorithmic Differentiation, vol. 105. SIAM, Philadelphia (2008)
Kakade, S.M., Lee, J.D.: Provably correct automatic sub-differentiation for qualified programs. In: Advances in Neural Information Processing Systems, vol.31, pp. 7125–7135 (2018)
Kushner, H., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
Lan, G., Lee, S., Zhou, Y.: Communication-efficient algorithms for decentralized and stochastic optimization. Math. Program. 180, 237–284 (2018)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. In: International Conference on Artificial Intelligence and Statistics (2019)
Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control 22(4), 551–575 (1977)
Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., Jordan, M.I.: Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27(4), 2202–2229 (2017)
McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017)
Mishchenko, K., Iutzeler, F., Malick, J., Amini, M.R.: A delay-tolerant proximal-gradient algorithm for distributed learning. In: International Conference on Machine Learning, pp. 3587–3595 (2018)
Mishchenko, K., Khaled Ragab Bayoumi, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. In: Advances in Neural Information Processing Systems, vol. 33, p. 33 (2020)
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, vol. 24, pp. 451–459 (2011)
Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Berlin (2004)
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods (2020). arXiv preprint arXiv:2002.08246
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in Pytorch. In: NIPS workshops (2017)
Pu, S., Nedic, A.: Distributed stochastic gradient tracking methods. Math. Program. 1–49 (2020)
Rajput, S., Gupta, A., Papailiopoulos, D.: Closing the convergence gap of SGD without replacement. In: International Conference on Machine Learning, pp. 7964–7973. PMLR (2020)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Fast incremental method for smooth nonconvex optimization. In: IEEE Conference on Decision and Control 1971–1977 (CDC) (2016). https://doi.org/10.1109/CDC.2016.7798553
Recht, B., Ré, C.: Toward a noncommutative arithmetic–geometric mean inequality: conjectures, case-studies, and consequences. In: Conference on Learning Theory (pp. 11-1). JMLR Workshop and Conference Proceedings (2012)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Royden, H.L., Fitzpatrick, P.: Real Analysis. Macmillan, New York (1988)
Rumelhart, E., Hinton, E., Williams, J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Safran, I., Shamir, O.: How good is SGD with random shuffling? In: Conference on Learning Theory, pp. 3250–3284. PMLR (2020)
Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes, from any initialization. In: International Conference on Machine Learning (2019)
Ying, B., Yuan, K., Vlaski, S., Sayed, A.H.: Stochastic learning under random reshuffling with constant step-sizes. IEEE Trans. Signal Process. 67(2), 474–489 (2018)
Zhang, J., Lin, H., Sra, S., Jadbabaie, A.: On complexity of finding stationary points of nonsmooth nonconvex functions (2020). arXiv preprint arXiv:2002.04130
Acknowledgements
The author acknowledges the support of ANR-3IA Artificial and Natural Intelligence Toulouse Institute, Air Force Office of Scientific Research, Air Force Material Command, USAF, under Grant Nos. FA9550-19-1-7026, FA9550-18-1-0226 and ANR MasDol. The author would like to thank anonymous referees for their comments which helped improve the relevance of the paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Gabriel Peyré
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Appendix
This is the appendix for “Incremental Without Replacement Sampling in Nonconvex Optimization.” We begin with the proof of the first claim of the paper.
Proof of Claim 1
We have for all \(K \in \mathbb {N}\) and \(i = 1 \ldots n\), using the recursion in Algorithm 1,
Using Lemma A.1, we obtain
Taking \(i = n\), we obtain the second inequality. The result follows for \(\hat{z}_{K, i-1}\) because it is in \(\mathrm {conv}(z_{K,j})_{j=0}^{i-1}\) and
where the equality in the middle follows because the maximum of a convex function over a polyhedra is achieved at vertices. \(\square \)
A Proofs for the Smooth Setting
For all \(K \in \mathbb {N}\), we let \(\alpha _K = \alpha _{K-1,n}\), with \(\alpha _0 = \delta ^{-1/3} \ge \alpha _{0,1}\).
1.1 A.1 Analysis for Both Step Size Strategies
Claim A.1
We have for all \(K \in \mathbb {N}\),
Proof of Claim A.1
Fix \(K \in \mathbb {N}\), we have
Recall that \(\nabla F(x_K) = \frac{1}{n} \sum _{i=1}^n d_i(x_K)\), combining with (20), we deduce the following
where the first two equalities are properties of the scalar product, the third equality uses (20) to drop canceling terms and the last inequality uses \(\Vert a + b \Vert ^2 \le 2(\Vert a\Vert ^2 + \Vert b\Vert ^2)\). We bound each term separately, first,
where the first step uses the triangle inequality, the second step uses \(L_i\) Lipschicity of \(d_i\), the third step is Hölder inequality, and the fourth step uses Claim 1. Furthermore, we have using the triangle inequality and Cauchy–Schwartz inequality,
Combining (21), (22) and (23), we obtain,
which is (19) \(\square \)
Claim A.2
F has L Lipschitz gradient.
Proof
For any x, y, we have
where we used triangle inequality and \(L_i\) Lipschicity of \(d_i\). \(\square \)
Proof of Claim 2
Using smoothness of F in Claim A.2, we have from the descent Lemma [43, Lemma 1.2.3], for all \(x,y \in \mathbb {R}^p\)
Choosing \(y = x_{K+1}\) and \(x = x_K\) in (24), using Claims A.1 and 1, we obtain
Since \(\alpha _{K,i} \le \alpha _K\) for all \(K \in \mathbb {N}\) and \(i =1 \ldots , n\), we have \(0 \le \alpha _{K,i} / \alpha _K \le 1\), and using \((t-1)^2 \le 1 - t^2\) for all \(t \in [0,1]\)
and the result follows. \(\square \)
B Proofs for the Nonsmooth Setting
Proof of Theorem 1
Fix \(T > 0\), we consider the sequence of functions, for each \(k \in \mathbb {N}\)
From Assumption 4 and Definition 5, it is clear that all functions in the sequence are M Lipschitz. Since the sequence \((x_k)_{k \in \mathbb {N}}\) is bounded, \((\mathbf {w}_k)_{k\in \mathbb {N}}\) is also uniformly bounded, hence by Arzelà–Ascoli theorem [51, Chapter 10, Lemma 2], there is a subsequence converging uniformly, let \(\mathbf {z}:[0,T] \mapsto \mathbb {R}^p\) be any such uniform limit. By discarding terms, we actually have \(\mathbf {w}_k \rightarrow \mathbf {z}\) as \(k \rightarrow \infty \), uniformly on [0, T]. Note that we have for all \(t \in [0,1]\), and all \(\gamma >0\)
For all \(k \in \mathbb {N}\), we set \(\mathbf {v}_k \in L^2([0,T], \mathbb {R}^p)\) such that \(\mathbf {v}_k = \mathbf {w}_k'\) at points where \(\mathbf {w}_k\) is differentiable (almost everywhere since it is piecewise affine). We have for all \(k \in \mathbb {N}\) and all \(s \in [0,T]\)
and from Definition 5, we have for almost all \(t \in [0,T]\),
Hence, the functions \(\mathbf {v}_k\) are uniformly bounded thanks to Assumption 4 and hence the sequence \((\mathbf {v}_k)_{k\in \mathbb {N}}\) is bounded in \(L^2([0,T], \mathbb {R}^p)\) and by Banach–Alaoglu theorem [51, Section 15.1], it has a weak cluster point. Denote by \(\mathbf {v}\) a weak limit of \(\left( \mathbf {v}_k \right) _{k \in \mathbb {N}}\) in \(L^2([0,T], \mathbb {R}^p)\). Discarding terms, we may assume that \(\mathbf {v}_k \rightarrow \mathbf {v}\) weakly in \(L^2([0,T], \mathbb {R}^p)\) as \(k \rightarrow \infty \) and hence, passing to the limit in (26), for all \(s \in [0,T]\),
By Mazur’s Lemma (see for example [26]), there exists a sequence \((N_k)_{k \in \mathbb {N}}\), with \(N_k \ge k\) and a sequence \(\tilde{\mathbf {v}}_{k \in \mathbb {N}}\) such that for each \(k \in \mathbb {N}\), \(\tilde{\mathbf {v}}_k \in \mathrm {conv}\left( \mathbf {v}_k,\ldots , \mathbf {v}_{N_k} \right) \) such that \(\tilde{\mathbf {v}}_k\) converges strongly in \(L^2([0,T], \mathbb {R}^p)\) hence pointwise almost everywhere in [0, T]. Using (27) and the fact that countable intersection of full measure sets has full measure, we have for almost all \(t \in [0,T]\)
where we have used (25), the fact that \(\lim _{\gamma \rightarrow 0} D^\gamma = \frac{1}{n} \sum _{i=1}^n D_i\) pointwise since each \(D_i\) has closed graph and the definition of D. Using (28), this shows that for almost all \(t \in [0,T]\),
Using [7, Theorem 4.1], this shows that \(\mathbf {w}\) is an asymptotic pseudotrajectory. \(\square \)
C Lemmas and Additional Proofs
Lemma A.1
Let \(a_1,\ldots , a_m\) be vectors in \(\mathbb {R}^p\), then
Proof
From the triangle inequality, we have
Hence, it suffices to prove the claim for \(p=1\). Consider the quadratic form on \(\mathbb {R}^m\)
We have
where \(e \in \mathbb {R}^m\) has unit norm and with all entries equal to \(1/\sqrt{m}\). The corresponding matrix is \(m(I - e e^T)\) which is positive semidefinite. This proves the result. \(\square \)
Lemma A.2
Let \((a_k)_{k \in \mathbb {N}}\) be a sequence of positive numbers, and \(b,c>0\). Then, for all \(m \in \mathbb {N}\)
Proof
We have
where the last inequality follows from Lemma 6.2 in [24]. \(\square \)
Rights and permissions
About this article
Cite this article
Pauwels, E. Incremental Without Replacement Sampling in Nonconvex Optimization. J Optim Theory Appl 190, 274–299 (2021). https://doi.org/10.1007/s10957-021-01883-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10957-021-01883-2