Incremental Without Replacement Sampling in Nonconvex Optimization

Pauwels, Edouard

doi:10.1007/s10957-021-01883-2

Incremental Without Replacement Sampling in Nonconvex Optimization

Published: 21 June 2021

Volume 190, pages 274–299, (2021)
Cite this article

Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Edouard Pauwels ORCID: orcid.org/0000-0002-8180-075X¹

244 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Minibatch decomposition methods for empirical risk minimization are commonly analyzed in a stochastic approximation setting, also known as sampling with replacement. On the other hand, modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis is much scarcer. We provide convergence guaranties for the latter variant by analyzing a versatile incremental gradient scheme. For this scheme, we consider constant, decreasing or adaptive step sizes. In the smooth setting, we obtain explicit complexity estimates in terms of epoch counter. In the nonsmooth setting, we prove that the sequence is attracted by solutions of optimality conditions of the problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-index antithetic stochastic gradient algorithm

Article Open access 03 March 2023

Non-Asymptotic Guarantees for Sampling by Stochastic Gradient Descent

Article 01 March 2019

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

Article 13 January 2023

References

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: Symposium on Operating Systems Design and Implementation (2016)
Aubin, J.P., Cellina, A.: Differential Inclusions: Set-valued Maps and Viability Theory, vol. 264. Springer, Berlin (1984)
Book Google Scholar
Barakat, A., Bianchi, P.: Convergence and dynamical behavior of the Adam algorithm for non convex stochastic optimization (2018). arXiv preprint arXiv:1810.02263
Baydin, A., Pearlmutter, B., Radul, A., Siskind, J.: Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res. 18(153) 1–43 (2018)
Benaïm, M., Hirsch, M.W.: Asymptotic pseudotrajectories and chain recurrent flows, with applications. J. Dyn. Differ. Equ. 8(1), 141–176 (1996)
Article MathSciNet Google Scholar
Benaïm, M.: Dynamics of stochastic approximation algorithms. In: Séminaire de probabilités XXXIII (pp. 1–68). Springer, Berlin (1999)
Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1), 328–348 (2005)
Article MathSciNet Google Scholar
Bertsekas, D.P.: A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 7(4), 913–926 (1997)
Article MathSciNet Google Scholar
Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)
Article MathSciNet Google Scholar
Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)
Google Scholar
Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)
MATH Google Scholar
Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)
Article MathSciNet Google Scholar
Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)
Article MathSciNet Google Scholar
Bolte, J., Pauwels, E.: Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01501-5
Bolte, J., Pauwels, E.: A mathematical model for automatic differentiation in machine learning. In: Conference on Neural Information Processing Systems, vol. 33, pp. 10809–10819 (2020)
Borkar, V.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)
MATH Google Scholar
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, vol. 20, pp. 161–168 (2008)
Bottou L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. Siam Rev. 60(2), 223–311 (2018)
Article MathSciNet Google Scholar
Castera, C., Bolte, J., Févotte, C., Pauwels E.: An inertial Newton algorithm for deep learning (2019). arXiv preprint arXiv:1905.12278
Clarke, F.H.: Optimization and Nonsmooth Analysis. Siam, Philadelphia (1983)
MATH Google Scholar
Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20, 119–154 (2018)
Article MathSciNet Google Scholar
Defazio, A., Jelassi, S.: Adaptivity without compromise: a momentumized, adaptive, dual averaged gradient method for stochastic optimization. arXiv preprint arXiv:2101.11075
Défossez, A., Bottou, L., Bach, F., & Usunier, N.: On the convergence of Adam and Adagrad (2020). arXiv preprint arXiv:2003.02395
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Ekeland, I., Temam, R.: Convex Analysis and Variational Problems. SIAM, Philadelphia (1976)
MATH Google Scholar
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. 27(2), 1035–1048 (2017)
Article MathSciNet Google Scholar
Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2019)
Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)
Article MathSciNet Google Scholar
Griewank, A., Walther, A.: Evaluating Derivatives: Principles And Techniques of Algorithmic Differentiation, vol. 105. SIAM, Philadelphia (2008)
Book Google Scholar
Kakade, S.M., Lee, J.D.: Provably correct automatic sub-differentiation for qualified programs. In: Advances in Neural Information Processing Systems, vol.31, pp. 7125–7135 (2018)
Kushner, H., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)
MATH Google Scholar
Lan, G., Lee, S., Zhou, Y.: Communication-efficient algorithms for decentralized and stochastic optimization. Math. Program. 180, 237–284 (2018)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. In: International Conference on Artificial Intelligence and Statistics (2019)
Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control 22(4), 551–575 (1977)
Article MathSciNet Google Scholar
Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., Jordan, M.I.: Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27(4), 2202–2229 (2017)
Article MathSciNet Google Scholar
McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017)
Mishchenko, K., Iutzeler, F., Malick, J., Amini, M.R.: A delay-tolerant proximal-gradient algorithm for distributed learning. In: International Conference on Machine Learning, pp. 3587–3595 (2018)
Mishchenko, K., Khaled Ragab Bayoumi, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. In: Advances in Neural Information Processing Systems, vol. 33, p. 33 (2020)
Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, vol. 24, pp. 451–459 (2011)
Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)
Article MathSciNet Google Scholar
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Berlin (2004)
Book Google Scholar
Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods (2020). arXiv preprint arXiv:2002.08246
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in Pytorch. In: NIPS workshops (2017)
Pu, S., Nedic, A.: Distributed stochastic gradient tracking methods. Math. Program. 1–49 (2020)
Rajput, S., Gupta, A., Papailiopoulos, D.: Closing the convergence gap of SGD without replacement. In: International Conference on Machine Learning, pp. 7964–7973. PMLR (2020)
Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Fast incremental method for smooth nonconvex optimization. In: IEEE Conference on Decision and Control 1971–1977 (CDC) (2016). https://doi.org/10.1109/CDC.2016.7798553
Recht, B., Ré, C.: Toward a noncommutative arithmetic–geometric mean inequality: conjectures, case-studies, and consequences. In: Conference on Learning Theory (pp. 11-1). JMLR Workshop and Conference Proceedings (2012)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Royden, H.L., Fitzpatrick, P.: Real Analysis. Macmillan, New York (1988)
MATH Google Scholar
Rumelhart, E., Hinton, E., Williams, J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)
Article Google Scholar
Safran, I., Shamir, O.: How good is SGD with random shuffling? In: Conference on Learning Theory, pp. 3250–3284. PMLR (2020)
Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes, from any initialization. In: International Conference on Machine Learning (2019)
Ying, B., Yuan, K., Vlaski, S., Sayed, A.H.: Stochastic learning under random reshuffling with constant step-sizes. IEEE Trans. Signal Process. 67(2), 474–489 (2018)
Article MathSciNet Google Scholar
Zhang, J., Lin, H., Sra, S., Jadbabaie, A.: On complexity of finding stationary points of nonsmooth nonconvex functions (2020). arXiv preprint arXiv:2002.04130

Download references

Acknowledgements

The author acknowledges the support of ANR-3IA Artificial and Natural Intelligence Toulouse Institute, Air Force Office of Scientific Research, Air Force Material Command, USAF, under Grant Nos. FA9550-19-1-7026, FA9550-18-1-0226 and ANR MasDol. The author would like to thank anonymous referees for their comments which helped improve the relevance of the paper.

Author information

Authors and Affiliations

IRIT, CNRS, Université de Toulouse, Toulouse, France
Edouard Pauwels

Authors

Edouard Pauwels
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Edouard Pauwels.

Additional information

Communicated by Gabriel Peyré

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

This is the appendix for “Incremental Without Replacement Sampling in Nonconvex Optimization.” We begin with the proof of the first claim of the paper.

Proof of Claim 1

We have for all $K \in \mathbb {N}$ and $i = 1 \ldots n$, using the recursion in Algorithm 1,

$$\begin{aligned} z_{K,i} - x_K = \sum _{j=1}^i \alpha _{K,j} d\left( \hat{z}_{K,j-1} \right) . \end{aligned}$$

Using Lemma A.1, we obtain

$$\begin{aligned} \Vert z_{K,i} - x_K\Vert ^2 \le i \sum _{j=1}^i \alpha _{K,i}^2 \Vert d\left( \hat{z}_{K,i-1} \right) \Vert ^2 \le n \sum _{i=1}^n \alpha _{K,i}^2 \Vert d\left( \hat{z}_{K,i-1} \right) \Vert ^2. \end{aligned}$$

Taking $i = n$, we obtain the second inequality. The result follows for $\hat{z}_{K, i-1}$ because it is in $\mathrm {conv}(z_{K,j})_{j=0}^{i-1}$ and

$$\begin{aligned}&\Vert \hat{z}_{K,i-1} - x_K\Vert ^2 \le \max _{z \in \mathrm {conv}(z_{K,j})_{j=0}^{i-1}} \Vert z - x_K\Vert ^2 \\&\quad = \max _{j = 0, \ldots ,i} \Vert z_{K,j} - x_K\Vert ^2 \le n \sum _{i=1}^n \alpha _{K,i}^2 \Vert d\left( \hat{z}_{K,i-1} \right) \Vert ^2, \end{aligned}$$

where the equality in the middle follows because the maximum of a convex function over a polyhedra is achieved at vertices. $\square $

A Proofs for the Smooth Setting

For all $K \in \mathbb {N}$, we let $\alpha _K = \alpha _{K-1,n}$, with $\alpha _0 = \delta ^{-1/3} \ge \alpha _{0,1}$.

1.1 A.1 Analysis for Both Step Size Strategies

Claim A.1

We have for all $K \in \mathbb {N}$,

$$\begin{aligned}&\left\langle \nabla F(x_K), x_{K+1} - x_{K} \right\rangle + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2\nonumber \\&\quad \le - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \alpha _K L^2n^2 \sum _{j=1}^n \alpha _{K,j}^2\Vert d_j(\hat{z}_{K,j-1})\Vert ^2 \nonumber \\&\qquad + \alpha _K M^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2. \end{aligned}$$

(19)

Proof of Claim A.1

Fix $K \in \mathbb {N}$, we have

$$\begin{aligned} x_{K+1} - x_K = - \sum _{i=1}^n \alpha _{K,i} d_i(\hat{z}_{K,i-1}) = - \alpha _K \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \end{aligned}$$

(20)

Recall that $\nabla F(x_K) = \frac{1}{n} \sum _{i=1}^n d_i(x_K)$, combining with (20), we deduce the following

$$\begin{aligned}&\left\langle \nabla F(x_K), x_{K+1} - x_{K} \right\rangle + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2 \nonumber \\&\quad =\frac{-\alpha _K}{n} \left\langle \sum _{i=1}^n d_i(x_K), \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \right\rangle + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2\nonumber \\&\quad = \frac{\alpha _K}{2n} \left( \left\| \sum _{i=1}^n d_i(x_K) - \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \right\| ^2 - \left\| \sum _{i=1}^n d_i(x_K)\right\| ^2 - \left\| \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \right\| ^2\right) \nonumber \\&\qquad + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2\nonumber \\&\quad = - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \frac{\alpha _K}{2n} \left\| \sum _{i=1}^n d_i(x_K) - \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \right\| ^2 \nonumber \\&\quad \le - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \frac{\alpha _K}{n} \left( \left\| \sum _{i=1}^n d_i(x_K) - \sum _{i=1}^n d_i(\hat{z}_{K,i-1})\right\| ^2\right. \nonumber \\&\qquad + \left. \left\| \sum _{i=1}^n d_i(\hat{z}_{K,i-1}) - \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1})\right\| ^2\right) , \end{aligned}$$

(21)

where the first two equalities are properties of the scalar product, the third equality uses (20) to drop canceling terms and the last inequality uses $\Vert a + b \Vert ^2 \le 2(\Vert a\Vert ^2 + \Vert b\Vert ^2)$. We bound each term separately, first,

$$\begin{aligned} \left\| \sum _{i=1}^n d_i(x_K) - \sum _{i=1}^n d_i(\hat{z}_{K,i-1}) \right\| ^2&\le \left( \sum _{i=1}^n \Vert d_i(x_K) -d_i(\hat{z}_{K,i-1}) \Vert \right) ^2 \nonumber \\&\le \left( \sum _{i=1}^n L_i\Vert x_K -\hat{z}_{K,i-1}\Vert \right) ^2 \nonumber \\&\le \max _{i=1,\ldots , n}\Vert x_K -\hat{z}_{K,i-1}\Vert ^2 \left( \sum _{i=1}^n L_i \right) ^2 \nonumber \\&\le L^2n^3 \sum _{j=1}^n \alpha _{K,j}^2 \Vert d_j(\hat{z}_{K,j-1})\Vert ^2. \end{aligned}$$

(22)

where the first step uses the triangle inequality, the second step uses $L_i$ Lipschicity of $d_i$, the third step is Hölder inequality, and the fourth step uses Claim 1. Furthermore, we have using the triangle inequality and Cauchy–Schwartz inequality,

$$\begin{aligned} \left\| \sum _{i=1}^n d_i(\hat{z}_{K,i-1}) - \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1})\right\| ^2&\le \left( \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) \Vert d_i(\hat{z}_{K,i-1})\Vert \right) ^2 \nonumber \\&\le \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2 \sum _{i=1}^n \Vert d_i(\hat{z}_{K,i-1})\Vert ^2\nonumber \\&\le \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2 \sum _{i=1}^n M_i^2 \nonumber \\&= nM^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2. \end{aligned}$$

(23)

Combining (21), (22) and (23), we obtain,

$$\begin{aligned}&\left\langle \nabla F(x_K), x_{K+1} - x_{K} \right\rangle + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2\nonumber \\&\quad \le - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \alpha _K L^2n^2 \sum _{j=1}^n \alpha _{K,j}^2\Vert d_j (\hat{z}_{K,j-1})\Vert ^2 + \alpha _K M^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2, \end{aligned}$$

which is (19) $\square $

Claim A.2

F has L Lipschitz gradient.

Proof

For any x, y, we have

$$\begin{aligned} \Vert \nabla F(x) - \nabla F(y)\Vert&= \frac{1}{n} \left\| \sum _{i=1}^n d_i(x) - d_i(y) \right\| \le \frac{1}{n} \sum _{i=1}^n \left\| d_i(x) - d_i(y) \right\| \le \frac{1}{n} \sum _{i=1}^n L_i\left\| x - y \right\| \\&= L \Vert x - y \Vert \end{aligned}$$

where we used triangle inequality and $L_i$ Lipschicity of $d_i$. $\square $

Proof of Claim 2

Using smoothness of F in Claim A.2, we have from the descent Lemma [43, Lemma 1.2.3], for all $x,y \in \mathbb {R}^p$

$$\begin{aligned} F(y) \le F(x) + \left\langle \nabla F(x), y - x \right\rangle + \frac{L}{2} \Vert y-x \Vert ^2. \end{aligned}$$

(24)

Choosing $y = x_{K+1}$ and $x = x_K$ in (24), using Claims A.1 and 1, we obtain

$$\begin{aligned} F(x_{K+1})&\le F(x_K) + \left\langle \nabla F(x_K), x_{K+1} - x_K \right\rangle + \frac{L}{2} \Vert x_{K+1} - x_K\Vert ^2\\&\le F(x_K) - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \alpha _K L^2n^2\sum _{j=1}^n \alpha _{K,j}^2\Vert d_j(\hat{z}_{K,j-1})\Vert ^2 \\&\quad + \alpha _K M^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2 + \left( \frac{L}{2} - \frac{1}{2n\alpha _K}\right) \Vert x_{K+1} - x_K\Vert ^2 \\&\le F(x_K) - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \left( \alpha _K L^2n^2+ \frac{Ln}{2} - \frac{1}{2 \alpha _K}\right) \sum _{j=1}^n \alpha _{K,j}^2\Vert d_j(\hat{z}_{K,j-1})\Vert ^2 \\&\quad + \alpha _K M^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2. \end{aligned}$$

Since $\alpha _{K,i} \le \alpha _K$ for all $K \in \mathbb {N}$ and $i =1 \ldots , n$, we have $0 \le \alpha _{K,i} / \alpha _K \le 1$, and using $(t-1)^2 \le 1 - t^2$ for all $t \in [0,1]$

$$\begin{aligned} \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2&\le 1 - \frac{\alpha _{K,i}^2}{\alpha _K^2}\le 1 - \frac{\alpha _{K,i}^3}{\alpha _K^3}, \end{aligned}$$

and the result follows. $\square $

B Proofs for the Nonsmooth Setting

Proof of Theorem 1

Fix $T > 0$, we consider the sequence of functions, for each $k \in \mathbb {N}$

$$\begin{aligned} \mathbf {w}_k :[0,T]&\mapsto \mathbb {R}^p \\ t&\mapsto \mathbf {w}(\tau _k + t). \end{aligned}$$

From Assumption 4 and Definition 5, it is clear that all functions in the sequence are M Lipschitz. Since the sequence $(x_k)_{k \in \mathbb {N}}$ is bounded, $(\mathbf {w}_k)_{k\in \mathbb {N}}$ is also uniformly bounded, hence by Arzelà–Ascoli theorem [51, Chapter 10, Lemma 2], there is a subsequence converging uniformly, let $\mathbf {z}:[0,T] \mapsto \mathbb {R}^p$ be any such uniform limit. By discarding terms, we actually have $\mathbf {w}_k \rightarrow \mathbf {z}$ as $k \rightarrow \infty $, uniformly on [0, T]. Note that we have for all $t \in [0,1]$, and all $\gamma >0$

$$\begin{aligned} D^{\gamma }(\mathbf {w}_k(t)) \subset D^{\gamma + \Vert \mathbf {w}_k - \mathbf {z}\Vert _\infty }(\mathbf {z}(t)). \end{aligned}$$

(25)

For all $k \in \mathbb {N}$, we set $\mathbf {v}_k \in L^2([0,T], \mathbb {R}^p)$ such that $\mathbf {v}_k = \mathbf {w}_k'$ at points where $\mathbf {w}_k$ is differentiable (almost everywhere since it is piecewise affine). We have for all $k \in \mathbb {N}$ and all $s \in [0,T]$

$$\begin{aligned} \mathbf {w}_k(s) - \mathbf {w}_k(0) = \int _{t=0}^{t=s} \mathbf {v}_k(t)\mathrm{d}t, \end{aligned}$$

(26)

and from Definition 5, we have for almost all $t \in [0,T]$,

$$\begin{aligned} \mathbf {v}_k(t) \in - D^{\gamma (\tau _k + t)}(\mathbf {w}_k(t)). \end{aligned}$$

(27)

Hence, the functions $\mathbf {v}_k$ are uniformly bounded thanks to Assumption 4 and hence the sequence $(\mathbf {v}_k)_{k\in \mathbb {N}}$ is bounded in $L^2([0,T], \mathbb {R}^p)$ and by Banach–Alaoglu theorem [51, Section 15.1], it has a weak cluster point. Denote by $\mathbf {v}$ a weak limit of $\left( \mathbf {v}_k \right) _{k \in \mathbb {N}}$ in $L^2([0,T], \mathbb {R}^p)$. Discarding terms, we may assume that $\mathbf {v}_k \rightarrow \mathbf {v}$ weakly in $L^2([0,T], \mathbb {R}^p)$ as $k \rightarrow \infty $ and hence, passing to the limit in (26), for all $s \in [0,T]$,

$$\begin{aligned} \mathbf {z}(s) - \mathbf {z}(0) = \int _{t=0}^{t=s} \mathbf {v}(t)\mathrm{d}t. \end{aligned}$$

(28)

By Mazur’s Lemma (see for example [26]), there exists a sequence $(N_k)_{k \in \mathbb {N}}$, with $N_k \ge k$ and a sequence $\tilde{\mathbf {v}}_{k \in \mathbb {N}}$ such that for each $k \in \mathbb {N}$, $\tilde{\mathbf {v}}_k \in \mathrm {conv}\left( \mathbf {v}_k,\ldots , \mathbf {v}_{N_k} \right) $ such that $\tilde{\mathbf {v}}_k$ converges strongly in $L^2([0,T], \mathbb {R}^p)$ hence pointwise almost everywhere in [0, T]. Using (27) and the fact that countable intersection of full measure sets has full measure, we have for almost all $t \in [0,T]$

$$\begin{aligned} \mathbf {v}(t) = \lim _{k \rightarrow \infty } \tilde{\mathbf {v}}_k(t)&\in \lim _{k \rightarrow \infty } -\mathrm {conv}\left( \cup _{j=k}^{N_k} D^{\gamma (\tau _j + t)}(\mathbf {w}_j(t)) \right) \\&\subset \lim _{k \rightarrow \infty } -\mathrm {conv}\left( \cup _{j=k}^{N_k} D^{\gamma (\tau _j + t) + \Vert \mathbf {w}_j - \mathbf {z}\Vert _\infty }(\mathbf {z}(t)) \right) \\&= - \mathrm {conv}\left( \frac{1}{n} \sum _{i=1}^n D_i(\mathbf {z}(t)) \right) = -D(\mathbf {z}(t)). \end{aligned}$$

where we have used (25), the fact that $\lim _{\gamma \rightarrow 0} D^\gamma = \frac{1}{n} \sum _{i=1}^n D_i$ pointwise since each $D_i$ has closed graph and the definition of D. Using (28), this shows that for almost all $t \in [0,T]$,

$$\begin{aligned} \dot{\mathbf {z}}( t ) = \mathbf {v}(t) \in - D(\mathbf {z}(t)). \end{aligned}$$

Using [7, Theorem 4.1], this shows that $\mathbf {w}$ is an asymptotic pseudotrajectory. $\square $

C Lemmas and Additional Proofs

Lemma A.1

Let $a_1,\ldots , a_m$ be vectors in $\mathbb {R}^p$, then

$$\begin{aligned} \left\| \sum _{i=1}^m a_i \right\| ^2 \le m \sum _{i=1}^m \Vert a_i\Vert ^2. \end{aligned}$$

Proof

From the triangle inequality, we have

$$\begin{aligned} \left\| \sum _{i=1}^m a_i \right\| ^2 \le \left( \sum _{i=1}^m \Vert a_i\Vert \right) ^2. \end{aligned}$$

Hence, it suffices to prove the claim for $p=1$. Consider the quadratic form on $\mathbb {R}^m$

$$\begin{aligned} Q :x \mapsto m \sum _{i=1}^m x_i^2 - \left\| \sum _{i=1}^m x_i \right\| ^2. \end{aligned}$$

We have

$$\begin{aligned} Q(x) = m (\Vert x\Vert ^2 - \left( x^T e \right) ^2), \end{aligned}$$

where $e \in \mathbb {R}^m$ has unit norm and with all entries equal to $1/\sqrt{m}$. The corresponding matrix is $m(I - e e^T)$ which is positive semidefinite. This proves the result. $\square $

Lemma A.2

Let $(a_k)_{k \in \mathbb {N}}$ be a sequence of positive numbers, and $b,c>0$. Then, for all $m \in \mathbb {N}$

$$\begin{aligned} \sum _{i=0}^m \frac{a_i}{b + c \sum _{i=0}^k a_i} \le \frac{1}{c}\log \left( 1 + c \frac{\sum _{i=0}^m a_k}{b} \right) . \end{aligned}$$

Proof

We have

$$\begin{aligned} \sum _{i=0}^m \frac{a_i}{b + c \sum _{i=0}^k a_i}&= \frac{1}{ c} \sum _{i=0}^m \frac{a_i}{\frac{b}{c} + \sum _{i=0}^k a_i} \\&\le \frac{1}{ c} \log \left( 1 + c \frac{\sum _{i=0}^m a_i}{b} \right) \end{aligned}$$

where the last inequality follows from Lemma 6.2 in [24]. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pauwels, E. Incremental Without Replacement Sampling in Nonconvex Optimization. J Optim Theory Appl 190, 274–299 (2021). https://doi.org/10.1007/s10957-021-01883-2

Download citation

Received: 03 August 2020
Accepted: 05 June 2021
Published: 21 June 2021
Issue Date: July 2021
DOI: https://doi.org/10.1007/s10957-021-01883-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Incremental Without Replacement Sampling in Nonconvex Optimization

Abstract

Access this article

Similar content being viewed by others

Multi-index antithetic stochastic gradient algorithm

Non-Asymptotic Guarantees for Sampling by Stochastic Gradient Descent

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Proof of Claim 1

A Proofs for the Smooth Setting

1.1 A.1 Analysis for Both Step Size Strategies

Claim A.1

Proof of Claim A.1

Claim A.2

Proof

Proof of Claim 2

B Proofs for the Nonsmooth Setting

Proof of Theorem 1

C Lemmas and Additional Proofs

Lemma A.1

Proof

Lemma A.2

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Incremental Without Replacement Sampling in Nonconvex Optimization

Abstract

Access this article

Similar content being viewed by others

Multi-index antithetic stochastic gradient algorithm

Non-Asymptotic Guarantees for Sampling by Stochastic Gradient Descent

Nonasymptotic Estimates for Stochastic Gradient Langevin Dynamics Under Local Conditions in Nonconvex Optimization

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Appendix

Proof of Claim 1

A Proofs for the Smooth Setting

1.1 A.1 Analysis for Both Step Size Strategies

Claim A.1

Proof of Claim A.1

Claim A.2

Proof

Proof of Claim 2

B Proofs for the Nonsmooth Setting

Proof of Theorem 1

C Lemmas and Additional Proofs

Lemma A.1

Proof

Lemma A.2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation