Skip to main content
Log in

Incremental Without Replacement Sampling in Nonconvex Optimization

  • Published:
Journal of Optimization Theory and Applications Aims and scope Submit manuscript

Abstract

Minibatch decomposition methods for empirical risk minimization are commonly analyzed in a stochastic approximation setting, also known as sampling with replacement. On the other hand, modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis is much scarcer. We provide convergence guaranties for the latter variant by analyzing a versatile incremental gradient scheme. For this scheme, we consider constant, decreasing or adaptive step sizes. In the smooth setting, we obtain explicit complexity estimates in terms of epoch counter. In the nonsmooth setting, we prove that the sequence is attracted by solutions of optimality conditions of the problem.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., Zheng, X.: Tensorflow: a system for large-scale machine learning. In: Symposium on Operating Systems Design and Implementation (2016)

  2. Aubin, J.P., Cellina, A.: Differential Inclusions: Set-valued Maps and Viability Theory, vol. 264. Springer, Berlin (1984)

    Book  Google Scholar 

  3. Barakat, A., Bianchi, P.: Convergence and dynamical behavior of the Adam algorithm for non convex stochastic optimization (2018). arXiv preprint arXiv:1810.02263

  4. Baydin, A., Pearlmutter, B., Radul, A., Siskind, J.: Automatic differentiation in machine learning: a survey. J. Mach. Learn. Res. 18(153) 1–43 (2018)

  5. Benaïm, M., Hirsch, M.W.: Asymptotic pseudotrajectories and chain recurrent flows, with applications. J. Dyn. Differ. Equ. 8(1), 141–176 (1996)

    Article  MathSciNet  Google Scholar 

  6. Benaïm, M.: Dynamics of stochastic approximation algorithms. In: Séminaire de probabilités XXXIII (pp. 1–68). Springer, Berlin (1999)

  7. Benaïm, M., Hofbauer, J., Sorin, S.: Stochastic approximations and differential inclusions. SIAM J. Control Optim. 44(1), 328–348 (2005)

    Article  MathSciNet  Google Scholar 

  8. Bertsekas, D.P.: A new class of incremental gradient methods for least squares problems. SIAM J. Optim. 7(4), 913–926 (1997)

    Article  MathSciNet  Google Scholar 

  9. Bertsekas, D.P., Tsitsiklis, J.N.: Gradient convergence in gradient methods with errors. SIAM J. Optim. 10(3), 627–642 (2000)

    Article  MathSciNet  Google Scholar 

  10. Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)

    Google Scholar 

  11. Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont (2015)

    MATH  Google Scholar 

  12. Bolte, J., Daniilidis, A., Lewis, A., Shiota, M.: Clarke subgradients of stratifiable functions. SIAM J. Optim. 18(2), 556–572 (2007)

    Article  MathSciNet  Google Scholar 

  13. Bolte, J., Sabach, S., Teboulle, M.: Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Math. Program. 146(1–2), 459–494 (2014)

    Article  MathSciNet  Google Scholar 

  14. Bolte, J., Pauwels, E.: Conservative set valued fields, automatic differentiation, stochastic gradient methods and deep learning. Math. Program. (2020). https://doi.org/10.1007/s10107-020-01501-5

  15. Bolte, J., Pauwels, E.: A mathematical model for automatic differentiation in machine learning. In: Conference on Neural Information Processing Systems, vol. 33, pp. 10809–10819 (2020)

  16. Borkar, V.: Stochastic Approximation: A Dynamical Systems Viewpoint, vol. 48. Springer, Berlin (2009)

    MATH  Google Scholar 

  17. Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Advances in Neural Information Processing Systems, vol. 20, pp. 161–168 (2008)

  18. Bottou L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)

  19. Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. Siam Rev. 60(2), 223–311 (2018)

    Article  MathSciNet  Google Scholar 

  20. Castera, C., Bolte, J., Févotte, C., Pauwels E.: An inertial Newton algorithm for deep learning (2019). arXiv preprint arXiv:1905.12278

  21. Clarke, F.H.: Optimization and Nonsmooth Analysis. Siam, Philadelphia (1983)

    MATH  Google Scholar 

  22. Davis, D., Drusvyatskiy, D., Kakade, S., Lee, J.D.: Stochastic subgradient method converges on tame functions. Found. Comput. Math. 20, 119–154 (2018)

    Article  MathSciNet  Google Scholar 

  23. Defazio, A., Jelassi, S.: Adaptivity without compromise: a momentumized, adaptive, dual averaged gradient method for stochastic optimization. arXiv preprint arXiv:2101.11075

  24. Défossez, A., Bottou, L., Bach, F., & Usunier, N.: On the convergence of Adam and Adagrad (2020). arXiv preprint arXiv:2003.02395

  25. Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)

    MathSciNet  MATH  Google Scholar 

  26. Ekeland, I., Temam, R.: Convex Analysis and Variational Problems. SIAM, Philadelphia (1976)

    MATH  Google Scholar 

  27. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: On the convergence rate of incremental aggregated gradient algorithms. SIAM J. Optim. 27(2), 1035–1048 (2017)

    Article  MathSciNet  Google Scholar 

  28. Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186, 49–84 (2019)

  29. Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

    Article  MathSciNet  Google Scholar 

  30. Griewank, A., Walther, A.: Evaluating Derivatives: Principles And Techniques of Algorithmic Differentiation, vol. 105. SIAM, Philadelphia (2008)

    Book  Google Scholar 

  31. Kakade, S.M., Lee, J.D.: Provably correct automatic sub-differentiation for qualified programs. In: Advances in Neural Information Processing Systems, vol.31, pp. 7125–7135 (2018)

  32. Kushner, H., Yin, G.G.: Stochastic Approximation and Recursive Algorithms and Applications, vol. 35. Springer, Berlin (2003)

    MATH  Google Scholar 

  33. Lan, G., Lee, S., Zhou, Y.: Communication-efficient algorithms for decentralized and stochastic optimization. Math. Program. 180, 237–284 (2018)

  34. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  35. Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. In: International Conference on Artificial Intelligence and Statistics (2019)

  36. Ljung, L.: Analysis of recursive stochastic algorithms. IEEE Trans. Autom. Control 22(4), 551–575 (1977)

    Article  MathSciNet  Google Scholar 

  37. Mania, H., Pan, X., Papailiopoulos, D., Recht, B., Ramchandran, K., Jordan, M.I.: Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim. 27(4), 2202–2229 (2017)

    Article  MathSciNet  Google Scholar 

  38. McMahan, B., Moore, E., Ramage, D., Hampson, S., Arcas, B.A.: Communication-efficient learning of deep networks from decentralized data. In: Artificial Intelligence and Statistics, pp. 1273–1282 (2017)

  39. Mishchenko, K., Iutzeler, F., Malick, J., Amini, M.R.: A delay-tolerant proximal-gradient algorithm for distributed learning. In: International Conference on Machine Learning, pp. 3587–3595 (2018)

  40. Mishchenko, K., Khaled Ragab Bayoumi, A., Richtárik, P.: Random reshuffling: simple analysis with vast improvements. In: Advances in Neural Information Processing Systems, vol. 33, p. 33 (2020)

  41. Moulines, E., Bach, F.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, vol. 24, pp. 451–459 (2011)

  42. Nedic, A., Bertsekas, D.P.: Incremental subgradient methods for nondifferentiable optimization. SIAM J. Optim. 12(1), 109–138 (2001)

    Article  MathSciNet  Google Scholar 

  43. Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Springer, Berlin (2004)

    Book  Google Scholar 

  44. Nguyen, L.M., Tran-Dinh, Q., Phan, D.T., Nguyen, P.H., van Dijk, M.: A unified convergence analysis for shuffling-type gradient methods (2020). arXiv preprint arXiv:2002.08246

  45. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in Pytorch. In: NIPS workshops (2017)

  46. Pu, S., Nedic, A.: Distributed stochastic gradient tracking methods. Math. Program. 1–49 (2020)

  47. Rajput, S., Gupta, A., Papailiopoulos, D.: Closing the convergence gap of SGD without replacement. In: International Conference on Machine Learning, pp. 7964–7973. PMLR (2020)

  48. Reddi, S.J., Sra, S., Poczos, B., Smola, A.J.: Fast incremental method for smooth nonconvex optimization. In: IEEE Conference on Decision and Control 1971–1977 (CDC) (2016). https://doi.org/10.1109/CDC.2016.7798553

  49. Recht, B., Ré, C.: Toward a noncommutative arithmetic–geometric mean inequality: conjectures, case-studies, and consequences. In: Conference on Learning Theory (pp. 11-1). JMLR Workshop and Conference Proceedings (2012)

  50. Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)

  51. Royden, H.L., Fitzpatrick, P.: Real Analysis. Macmillan, New York (1988)

    MATH  Google Scholar 

  52. Rumelhart, E., Hinton, E., Williams, J.: Learning representations by back-propagating errors. Nature 323, 533–536 (1986)

    Article  Google Scholar 

  53. Safran, I., Shamir, O.: How good is SGD with random shuffling? In: Conference on Learning Theory, pp. 3250–3284. PMLR (2020)

  54. Ward, R., Wu, X., Bottou, L.: Adagrad stepsizes: sharp convergence over nonconvex landscapes, from any initialization. In: International Conference on Machine Learning (2019)

  55. Ying, B., Yuan, K., Vlaski, S., Sayed, A.H.: Stochastic learning under random reshuffling with constant step-sizes. IEEE Trans. Signal Process. 67(2), 474–489 (2018)

    Article  MathSciNet  Google Scholar 

  56. Zhang, J., Lin, H., Sra, S., Jadbabaie, A.: On complexity of finding stationary points of nonsmooth nonconvex functions (2020). arXiv preprint arXiv:2002.04130

Download references

Acknowledgements

The author acknowledges the support of ANR-3IA Artificial and Natural Intelligence Toulouse Institute, Air Force Office of Scientific Research, Air Force Material Command, USAF, under Grant Nos. FA9550-19-1-7026, FA9550-18-1-0226 and ANR MasDol. The author would like to thank anonymous referees for their comments which helped improve the relevance of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Edouard Pauwels.

Additional information

Communicated by Gabriel Peyré

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

This is the appendix for “Incremental Without Replacement Sampling in Nonconvex Optimization.” We begin with the proof of the first claim of the paper.

Proof of Claim 1

We have for all \(K \in \mathbb {N}\) and \(i = 1 \ldots n\), using the recursion in Algorithm 1,

$$\begin{aligned} z_{K,i} - x_K = \sum _{j=1}^i \alpha _{K,j} d\left( \hat{z}_{K,j-1} \right) . \end{aligned}$$

Using Lemma A.1, we obtain

$$\begin{aligned} \Vert z_{K,i} - x_K\Vert ^2 \le i \sum _{j=1}^i \alpha _{K,i}^2 \Vert d\left( \hat{z}_{K,i-1} \right) \Vert ^2 \le n \sum _{i=1}^n \alpha _{K,i}^2 \Vert d\left( \hat{z}_{K,i-1} \right) \Vert ^2. \end{aligned}$$

Taking \(i = n\), we obtain the second inequality. The result follows for \(\hat{z}_{K, i-1}\) because it is in \(\mathrm {conv}(z_{K,j})_{j=0}^{i-1}\) and

$$\begin{aligned}&\Vert \hat{z}_{K,i-1} - x_K\Vert ^2 \le \max _{z \in \mathrm {conv}(z_{K,j})_{j=0}^{i-1}} \Vert z - x_K\Vert ^2 \\&\quad = \max _{j = 0, \ldots ,i} \Vert z_{K,j} - x_K\Vert ^2 \le n \sum _{i=1}^n \alpha _{K,i}^2 \Vert d\left( \hat{z}_{K,i-1} \right) \Vert ^2, \end{aligned}$$

where the equality in the middle follows because the maximum of a convex function over a polyhedra is achieved at vertices. \(\square \)

A Proofs for the Smooth Setting

For all \(K \in \mathbb {N}\), we let \(\alpha _K = \alpha _{K-1,n}\), with \(\alpha _0 = \delta ^{-1/3} \ge \alpha _{0,1}\).

1.1 A.1 Analysis for Both Step Size Strategies

Claim A.1

We have for all \(K \in \mathbb {N}\),

$$\begin{aligned}&\left\langle \nabla F(x_K), x_{K+1} - x_{K} \right\rangle + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2\nonumber \\&\quad \le - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \alpha _K L^2n^2 \sum _{j=1}^n \alpha _{K,j}^2\Vert d_j(\hat{z}_{K,j-1})\Vert ^2 \nonumber \\&\qquad + \alpha _K M^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2. \end{aligned}$$
(19)

Proof of Claim A.1

Fix \(K \in \mathbb {N}\), we have

$$\begin{aligned} x_{K+1} - x_K = - \sum _{i=1}^n \alpha _{K,i} d_i(\hat{z}_{K,i-1}) = - \alpha _K \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \end{aligned}$$
(20)

Recall that \(\nabla F(x_K) = \frac{1}{n} \sum _{i=1}^n d_i(x_K)\), combining with (20), we deduce the following

$$\begin{aligned}&\left\langle \nabla F(x_K), x_{K+1} - x_{K} \right\rangle + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2 \nonumber \\&\quad =\frac{-\alpha _K}{n} \left\langle \sum _{i=1}^n d_i(x_K), \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \right\rangle + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2\nonumber \\&\quad = \frac{\alpha _K}{2n} \left( \left\| \sum _{i=1}^n d_i(x_K) - \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \right\| ^2 - \left\| \sum _{i=1}^n d_i(x_K)\right\| ^2 - \left\| \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \right\| ^2\right) \nonumber \\&\qquad + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2\nonumber \\&\quad = - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \frac{\alpha _K}{2n} \left\| \sum _{i=1}^n d_i(x_K) - \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1}) \right\| ^2 \nonumber \\&\quad \le - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \frac{\alpha _K}{n} \left( \left\| \sum _{i=1}^n d_i(x_K) - \sum _{i=1}^n d_i(\hat{z}_{K,i-1})\right\| ^2\right. \nonumber \\&\qquad + \left. \left\| \sum _{i=1}^n d_i(\hat{z}_{K,i-1}) - \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1})\right\| ^2\right) , \end{aligned}$$
(21)

where the first two equalities are properties of the scalar product, the third equality uses (20) to drop canceling terms and the last inequality uses \(\Vert a + b \Vert ^2 \le 2(\Vert a\Vert ^2 + \Vert b\Vert ^2)\). We bound each term separately, first,

$$\begin{aligned} \left\| \sum _{i=1}^n d_i(x_K) - \sum _{i=1}^n d_i(\hat{z}_{K,i-1}) \right\| ^2&\le \left( \sum _{i=1}^n \Vert d_i(x_K) -d_i(\hat{z}_{K,i-1}) \Vert \right) ^2 \nonumber \\&\le \left( \sum _{i=1}^n L_i\Vert x_K -\hat{z}_{K,i-1}\Vert \right) ^2 \nonumber \\&\le \max _{i=1,\ldots , n}\Vert x_K -\hat{z}_{K,i-1}\Vert ^2 \left( \sum _{i=1}^n L_i \right) ^2 \nonumber \\&\le L^2n^3 \sum _{j=1}^n \alpha _{K,j}^2 \Vert d_j(\hat{z}_{K,j-1})\Vert ^2. \end{aligned}$$
(22)

where the first step uses the triangle inequality, the second step uses \(L_i\) Lipschicity of \(d_i\), the third step is Hölder inequality, and the fourth step uses Claim 1. Furthermore, we have using the triangle inequality and Cauchy–Schwartz inequality,

$$\begin{aligned} \left\| \sum _{i=1}^n d_i(\hat{z}_{K,i-1}) - \sum _{i=1}^n \frac{\alpha _{K,i}}{\alpha _{K}} d_i(\hat{z}_{K,i-1})\right\| ^2&\le \left( \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) \Vert d_i(\hat{z}_{K,i-1})\Vert \right) ^2 \nonumber \\&\le \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2 \sum _{i=1}^n \Vert d_i(\hat{z}_{K,i-1})\Vert ^2\nonumber \\&\le \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2 \sum _{i=1}^n M_i^2 \nonumber \\&= nM^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2. \end{aligned}$$
(23)

Combining (21), (22) and (23), we obtain,

$$\begin{aligned}&\left\langle \nabla F(x_K), x_{K+1} - x_{K} \right\rangle + \frac{1}{2n \alpha _K} \Vert x_{K+1} - x_K\Vert ^2\nonumber \\&\quad \le - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \alpha _K L^2n^2 \sum _{j=1}^n \alpha _{K,j}^2\Vert d_j (\hat{z}_{K,j-1})\Vert ^2 + \alpha _K M^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2, \end{aligned}$$

which is (19) \(\square \)

Claim A.2

F has L Lipschitz gradient.

Proof

For any xy, we have

$$\begin{aligned} \Vert \nabla F(x) - \nabla F(y)\Vert&= \frac{1}{n} \left\| \sum _{i=1}^n d_i(x) - d_i(y) \right\| \le \frac{1}{n} \sum _{i=1}^n \left\| d_i(x) - d_i(y) \right\| \le \frac{1}{n} \sum _{i=1}^n L_i\left\| x - y \right\| \\&= L \Vert x - y \Vert \end{aligned}$$

where we used triangle inequality and \(L_i\) Lipschicity of \(d_i\). \(\square \)

Proof of Claim 2

Using smoothness of F in Claim A.2, we have from the descent Lemma [43, Lemma 1.2.3], for all \(x,y \in \mathbb {R}^p\)

$$\begin{aligned} F(y) \le F(x) + \left\langle \nabla F(x), y - x \right\rangle + \frac{L}{2} \Vert y-x \Vert ^2. \end{aligned}$$
(24)

Choosing \(y = x_{K+1}\) and \(x = x_K\) in (24), using Claims A.1 and 1, we obtain

$$\begin{aligned} F(x_{K+1})&\le F(x_K) + \left\langle \nabla F(x_K), x_{K+1} - x_K \right\rangle + \frac{L}{2} \Vert x_{K+1} - x_K\Vert ^2\\&\le F(x_K) - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \alpha _K L^2n^2\sum _{j=1}^n \alpha _{K,j}^2\Vert d_j(\hat{z}_{K,j-1})\Vert ^2 \\&\quad + \alpha _K M^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2 + \left( \frac{L}{2} - \frac{1}{2n\alpha _K}\right) \Vert x_{K+1} - x_K\Vert ^2 \\&\le F(x_K) - \frac{n \alpha _K}{2} \Vert \nabla F(x_K) \Vert ^2 + \left( \alpha _K L^2n^2+ \frac{Ln}{2} - \frac{1}{2 \alpha _K}\right) \sum _{j=1}^n \alpha _{K,j}^2\Vert d_j(\hat{z}_{K,j-1})\Vert ^2 \\&\quad + \alpha _K M^2 \sum _{i=1}^n \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2. \end{aligned}$$

Since \(\alpha _{K,i} \le \alpha _K\) for all \(K \in \mathbb {N}\) and \(i =1 \ldots , n\), we have \(0 \le \alpha _{K,i} / \alpha _K \le 1\), and using \((t-1)^2 \le 1 - t^2\) for all \(t \in [0,1]\)

$$\begin{aligned} \left( \frac{\alpha _{K,i}}{\alpha _{K}} - 1 \right) ^2&\le 1 - \frac{\alpha _{K,i}^2}{\alpha _K^2}\le 1 - \frac{\alpha _{K,i}^3}{\alpha _K^3}, \end{aligned}$$

and the result follows. \(\square \)

B Proofs for the Nonsmooth Setting

Proof of Theorem 1

Fix \(T > 0\), we consider the sequence of functions, for each \(k \in \mathbb {N}\)

$$\begin{aligned} \mathbf {w}_k :[0,T]&\mapsto \mathbb {R}^p \\ t&\mapsto \mathbf {w}(\tau _k + t). \end{aligned}$$

From Assumption 4 and Definition 5, it is clear that all functions in the sequence are M Lipschitz. Since the sequence \((x_k)_{k \in \mathbb {N}}\) is bounded, \((\mathbf {w}_k)_{k\in \mathbb {N}}\) is also uniformly bounded, hence by Arzelà–Ascoli theorem [51, Chapter 10, Lemma 2], there is a subsequence converging uniformly, let \(\mathbf {z}:[0,T] \mapsto \mathbb {R}^p\) be any such uniform limit. By discarding terms, we actually have \(\mathbf {w}_k \rightarrow \mathbf {z}\) as \(k \rightarrow \infty \), uniformly on [0, T]. Note that we have for all \(t \in [0,1]\), and all \(\gamma >0\)

$$\begin{aligned} D^{\gamma }(\mathbf {w}_k(t)) \subset D^{\gamma + \Vert \mathbf {w}_k - \mathbf {z}\Vert _\infty }(\mathbf {z}(t)). \end{aligned}$$
(25)

For all \(k \in \mathbb {N}\), we set \(\mathbf {v}_k \in L^2([0,T], \mathbb {R}^p)\) such that \(\mathbf {v}_k = \mathbf {w}_k'\) at points where \(\mathbf {w}_k\) is differentiable (almost everywhere since it is piecewise affine). We have for all \(k \in \mathbb {N}\) and all \(s \in [0,T]\)

$$\begin{aligned} \mathbf {w}_k(s) - \mathbf {w}_k(0) = \int _{t=0}^{t=s} \mathbf {v}_k(t)\mathrm{d}t, \end{aligned}$$
(26)

and from Definition 5, we have for almost all \(t \in [0,T]\),

$$\begin{aligned} \mathbf {v}_k(t) \in - D^{\gamma (\tau _k + t)}(\mathbf {w}_k(t)). \end{aligned}$$
(27)

Hence, the functions \(\mathbf {v}_k\) are uniformly bounded thanks to Assumption 4 and hence the sequence \((\mathbf {v}_k)_{k\in \mathbb {N}}\) is bounded in \(L^2([0,T], \mathbb {R}^p)\) and by Banach–Alaoglu theorem [51, Section 15.1], it has a weak cluster point. Denote by \(\mathbf {v}\) a weak limit of \(\left( \mathbf {v}_k \right) _{k \in \mathbb {N}}\) in \(L^2([0,T], \mathbb {R}^p)\). Discarding terms, we may assume that \(\mathbf {v}_k \rightarrow \mathbf {v}\) weakly in \(L^2([0,T], \mathbb {R}^p)\) as \(k \rightarrow \infty \) and hence, passing to the limit in (26), for all \(s \in [0,T]\),

$$\begin{aligned} \mathbf {z}(s) - \mathbf {z}(0) = \int _{t=0}^{t=s} \mathbf {v}(t)\mathrm{d}t. \end{aligned}$$
(28)

By Mazur’s Lemma (see for example [26]), there exists a sequence \((N_k)_{k \in \mathbb {N}}\), with \(N_k \ge k\) and a sequence \(\tilde{\mathbf {v}}_{k \in \mathbb {N}}\) such that for each \(k \in \mathbb {N}\), \(\tilde{\mathbf {v}}_k \in \mathrm {conv}\left( \mathbf {v}_k,\ldots , \mathbf {v}_{N_k} \right) \) such that \(\tilde{\mathbf {v}}_k\) converges strongly in \(L^2([0,T], \mathbb {R}^p)\) hence pointwise almost everywhere in [0, T]. Using (27) and the fact that countable intersection of full measure sets has full measure, we have for almost all \(t \in [0,T]\)

$$\begin{aligned} \mathbf {v}(t) = \lim _{k \rightarrow \infty } \tilde{\mathbf {v}}_k(t)&\in \lim _{k \rightarrow \infty } -\mathrm {conv}\left( \cup _{j=k}^{N_k} D^{\gamma (\tau _j + t)}(\mathbf {w}_j(t)) \right) \\&\subset \lim _{k \rightarrow \infty } -\mathrm {conv}\left( \cup _{j=k}^{N_k} D^{\gamma (\tau _j + t) + \Vert \mathbf {w}_j - \mathbf {z}\Vert _\infty }(\mathbf {z}(t)) \right) \\&= - \mathrm {conv}\left( \frac{1}{n} \sum _{i=1}^n D_i(\mathbf {z}(t)) \right) = -D(\mathbf {z}(t)). \end{aligned}$$

where we have used (25), the fact that \(\lim _{\gamma \rightarrow 0} D^\gamma = \frac{1}{n} \sum _{i=1}^n D_i\) pointwise since each \(D_i\) has closed graph and the definition of D. Using (28), this shows that for almost all \(t \in [0,T]\),

$$\begin{aligned} \dot{\mathbf {z}}( t ) = \mathbf {v}(t) \in - D(\mathbf {z}(t)). \end{aligned}$$

Using [7, Theorem 4.1], this shows that \(\mathbf {w}\) is an asymptotic pseudotrajectory. \(\square \)

C Lemmas and Additional Proofs

Lemma A.1

Let \(a_1,\ldots , a_m\) be vectors in \(\mathbb {R}^p\), then

$$\begin{aligned} \left\| \sum _{i=1}^m a_i \right\| ^2 \le m \sum _{i=1}^m \Vert a_i\Vert ^2. \end{aligned}$$

Proof

From the triangle inequality, we have

$$\begin{aligned} \left\| \sum _{i=1}^m a_i \right\| ^2 \le \left( \sum _{i=1}^m \Vert a_i\Vert \right) ^2. \end{aligned}$$

Hence, it suffices to prove the claim for \(p=1\). Consider the quadratic form on \(\mathbb {R}^m\)

$$\begin{aligned} Q :x \mapsto m \sum _{i=1}^m x_i^2 - \left\| \sum _{i=1}^m x_i \right\| ^2. \end{aligned}$$

We have

$$\begin{aligned} Q(x) = m (\Vert x\Vert ^2 - \left( x^T e \right) ^2), \end{aligned}$$

where \(e \in \mathbb {R}^m\) has unit norm and with all entries equal to \(1/\sqrt{m}\). The corresponding matrix is \(m(I - e e^T)\) which is positive semidefinite. This proves the result. \(\square \)

Lemma A.2

Let \((a_k)_{k \in \mathbb {N}}\) be a sequence of positive numbers, and \(b,c>0\). Then, for all \(m \in \mathbb {N}\)

$$\begin{aligned} \sum _{i=0}^m \frac{a_i}{b + c \sum _{i=0}^k a_i} \le \frac{1}{c}\log \left( 1 + c \frac{\sum _{i=0}^m a_k}{b} \right) . \end{aligned}$$

Proof

We have

$$\begin{aligned} \sum _{i=0}^m \frac{a_i}{b + c \sum _{i=0}^k a_i}&= \frac{1}{ c} \sum _{i=0}^m \frac{a_i}{\frac{b}{c} + \sum _{i=0}^k a_i} \\&\le \frac{1}{ c} \log \left( 1 + c \frac{\sum _{i=0}^m a_i}{b} \right) \end{aligned}$$

where the last inequality follows from Lemma 6.2 in [24]. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pauwels, E. Incremental Without Replacement Sampling in Nonconvex Optimization. J Optim Theory Appl 190, 274–299 (2021). https://doi.org/10.1007/s10957-021-01883-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10957-021-01883-2

Keywords

Navigation