Stochastic Gradient Descent with Polyak’s Learning Rate

Prazeres, Mariana; Oberman, Adam M.

doi:10.1007/s10915-021-01628-3

Stochastic Gradient Descent with Polyak’s Learning Rate

Published: 08 September 2021

Volume 89, article number 25, (2021)
Cite this article

Journal of Scientific Computing Aims and scope Submit manuscript

682 Accesses
9 Citations
1 Altmetric
Explore all metrics

Abstract

Stochastic gradient descent (SGD) for strongly convex functions converges at the rate \(\mathcal {O}(1/k)\). However, achieving good results in practice requires tuning the parameters (for example the learning rate) of the algorithm. In this paper we propose a generalization of the Polyak step size, used for subgradient methods, to stochastic gradient descent. We prove a non-asymptotic convergence at the rate \(\mathcal {O}(1/k)\) with a rate constant which can be better than the corresponding rate constant for optimally scheduled SGD. We demonstrate that the method is effective in practice, and on convex optimization problems and on training deep neural networks, and compare to the theoretical rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recent Theoretical Advances in Non-Convex Optimization

Online and Stochastic Universal Gradient Methods for Minimizing Regularized Hölder Continuous Finite Sums in Machine Learning

Minimizing finite sums with the stochastic average gradient

Article 14 June 2016

Mark Schmidt, Nicolas Le Roux & Francis Bach

Availability of Data and Materials

The datasets analysed during the current study are available in https://www.cs.toronto.edu/~kriz/cifar.html.

Code Availability

Available at https://github.com/marianapraz/polyakSGD.

Notes

https://github.com/marianapraz/polyakSGD.

References

Agarwal, A., Wainwright, M.J., Bartlett, P.L., Ravikumar, P.K.: Information–theoretic lower bounds on the oracle complexity of convex optimization. In: Advances in Neural Information Processing Systems, pp. 1–9 (2009)
Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 (2016)
Beck, A.: First-Order Methods in Optimization, vol. 25. SIAM, Philadelphia (2017)
Book Google Scholar
Boyd, S., Mutapcic, A.: Subgradient methods. Lecture notes of EE364b, Stanford University, Winter Quarter, 2013 (2013)
Bottou, L.: Stochastic gradient learning in neural networks. Proc. Neuro-Nımes 91(8), 12 (1991)
Google Scholar
Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011)
MathSciNet MATH Google Scholar
Hardt, M., Recht, B., Singer, Y.: Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240 (2015)
Hinton, G., Srivastava, N., Swersky, K.: RmsProp: divide the gradient by a running average of its recent magnitude. Neural networks for machine learning, Coursera lecture 6e (2012)
Johnson, R., Zhang, T.: Accelerating stochastic gradient descent using predictive variance reduction. In: Advances in Neural Information Processing Systems, pp. 315–323 (2013)
Kim, S., Ahn, H., Cho, S.-C.: Variable target value subgradient method. Math. Program. 49(1–3), 359–369 (1990)
Article MathSciNet Google Scholar
Kalman, R.E.: A new approach to linear filtering and prediction problems. J. Basic Eng. 82(1), 35–45 (1960)
Article MathSciNet Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lacoste-Julien, S., Schmidt, M., Bach, F.: A simpler approach to obtaining an o(1/t) convergence rate for the projected stochastic subgradient method. arXiv preprint arXiv:1212.2002 (2012)
Li, X., Orabona, F.: On the convergence of stochastic gradient descent with adaptive stepsizes. In: Chaudhuri, K., Sugiyama, M. (eds.) Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, Volume 89 of Proceedings of Machine Learning Research, pp. 983–992. PMLR, 16–18 April 2019
Laborde, M., Oberman, A.: A Lyapunov analysis for accelerated gradient methods: from deterministic to stochastic case. In: International Conference on Artificial Intelligence and Statistics, pp. 602–612 (2020)
Laborde, M., Oberman, A.M.: Nesterov’s method with decreasing learning rate leads to accelerated stochastic gradient descent. arXiv preprint arXiv:1908.07861 (2019)
Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course, vol. 87. Springer, Berlin (2013)
MATH Google Scholar
Osher, S., Wang, B., Yin, P., Luo, X., Barekat, F., Pham, M., Lin, A.: Laplacian smoothing gradient descent. arXiv preprint arXiv:1806.06317 (2018)
Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Comput. Math. Math. Phys. 4(5), 1–17 (1964)
Article Google Scholar
Polyak, R.A.: Introduction to Continuous Optimization, Springer Optimization and Its Applications Series, vol. 172 (2021)
Qian, X., Richtarik, P., Gower, R., Sailanbayev, A., Loizou, N., Shulgin, E.: SGD with arbitrary sampling: general analysis and improved rates. In: International Conference on Machine Learning, pp. 5200–5209 (2019)
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)
Article MathSciNet Google Scholar
Rolinek, M., Martius, G.: L4: practical loss-based stepsize adaptation for deep learning. In: Advances in Neural Information Processing Systems, pp. 6434–6444 (2018)
Raginsky, M., Rakhlin, A.: Information-based complexity, feedback and dynamics in convex programming. IEEE Trans. Inf. Theory 57(10), 7036–7056 (2011)
Article MathSciNet Google Scholar
Rakhlin, A., Shamir, O., Sridharan, K.: Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647 (2011)
Springenberg, J.T., Dosovitskiy, A., Brox, T., Riedmiller, M.: Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806 (2014)
Shor, N.Z.: Minimization Methods for Non-differentiable Functions, vol. 3. Springer, Berlin (2012)
Google Scholar
Wilson, A.C., Roelofs, R., Stern, M., Srebro, N., Recht, B.: The marginal value of adaptive gradient methods in machine learning. In: Advances in Neural Information Processing Systems, pp. 4148–4158 (2017)
Wu, X., Ward, R., Bottou, L.: Wngrad: learn the learning rate in gradient descent. arXiv preprint arXiv:1803.02865 (2018)
Wang, B., Zou, D., Gu, Q., Osher, S.J.: Laplacian smoothing stochastic gradient Markov chain Monte Carlo. SIAM J. Sci. Comput. 43(1), A26–A53 (2021)
Article MathSciNet Google Scholar
Yan, Y., Yang, Ti., Li, Z., Lin, Q., Yi, Y.: A unified analysis of stochastic momentum methods for deep learning. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI18, pp. 2955–2961. AAAI Press (2018)
Zhou, D., Chen, J., Cao, Y., Tang, Y., Yang, Z., Gu, Q.: On the convergence of adaptive gradient methods for nonconvex optimization (2020)
Zeiler, M.D.: Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701 (2012)
Zhang, L., Mahdavi, M., Jin, R.: Linear convergence with condition number independent access of full gradients. In: Advances in Neural Information Processing Systems, pp. 980–988 (2013)

Download references

Funding

Adam Oberman was supported by the Air Force Office of Scientific Research under award number FA9550-18-1-0167 and by IVADO. Mariana Prazeres was supported by the FCT scholarship SFRH/BD/145075/2019 (M.P.).

Author information

Authors and Affiliations

Department of Mathematics and Statistics, McGill University, Montreal, Canada
Mariana Prazeres & Adam M. Oberman

Authors

Mariana Prazeres
View author publications
You can also search for this author in PubMed Google Scholar
Adam M. Oberman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mariana Prazeres.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Prazeres, M., Oberman, A.M. Stochastic Gradient Descent with Polyak’s Learning Rate. J Sci Comput 89, 25 (2021). https://doi.org/10.1007/s10915-021-01628-3

Download citation

Received: 03 January 2021
Revised: 15 May 2021
Accepted: 09 August 2021
Published: 08 September 2021
DOI: https://doi.org/10.1007/s10915-021-01628-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Stochastic Gradient Descent with Polyak’s Learning Rate

Abstract

Access this article

Similar content being viewed by others

Recent Theoretical Advances in Non-Convex Optimization

Online and Stochastic Universal Gradient Methods for Minimizing Regularized Hölder Continuous Finite Sums in Machine Learning

Minimizing finite sums with the stochastic average gradient

Availability of Data and Materials

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Abstract

Access this article

Similar content being viewed by others

Recent Theoretical Advances in Non-Convex Optimization

Online and Stochastic Universal Gradient Methods for Minimizing Regularized Hölder Continuous Finite Sums in Machine Learning

Minimizing finite sums with the stochastic average gradient

Availability of Data and Materials

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation