Skip to main content
Log in

Adam revisited: a weighted past gradients perspective

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Adaptive learning rate methods have been successfully applied in many fields, especially in training deep neural networks. Recent results have shown that adaptive methods with exponential increasing weights on squared past gradients (i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the non-convergence issues, achieving a data-dependent regret bound similar to or better than ADAGRAD is still a challenge to these methods. In this paper, we propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues. Unlike AMSGRAD and ADAMNC, we consider using a milder growing weighting strategy on squared past gradient, in which weights grow linearly. Based on this idea, we propose weighted adaptive gradient method framework (WAGMF) and implement WADA algorithm on this framework. Moreover, we prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly. This bound may partially explain the good performance of ADAM in practice. Finally, extensive experiments demonstrate the effectiveness of WADA and its variants in comparison with several variants of ADAM on training convex problems and deep neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22(3): 400–407

    Article  MathSciNet  Google Scholar 

  2. Duchi J C, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12(Jul): 2121–2159

    MathSciNet  MATH  Google Scholar 

  3. Tieleman T, Hinton G. Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012, 4(2): 26–31

    Google Scholar 

  4. Zeiler M D. Adadelta: an adaptive learning rate method. 2012, arXiv preprint arXiv:1212.5701

  5. Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations. 2015

  6. Yin Y, Huang Z, Chen E, Liu Q, Zhang F, Xie X, Hu G. Transcribing content from structural images with spotlight mechanism. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2643–2652

  7. Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1097–1105

  8. Su Y, Liu Q, Liu Q, Huang Z, Yin Y, Chen E, Ding C, Wei S, Hu G. Exercise-enhanced sequential modeling for student performance prediction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 2435–2443

  9. Liu Q, Huang Z, Huang Z, Liu C, Chen E, Su Y, Hu G. Finding similar exercises in online education systems. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 1821–1830

  10. Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988, 24(5): 513–523

    Article  Google Scholar 

  11. LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278–2324

    Article  Google Scholar 

  12. Hazan E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016, 2(3–4): 157–325

    Article  Google Scholar 

  13. Reddi S J, Kale S, Kumar S. On the convergence of adam and beyond. In: Proceedings of International Conference on Learning Representations. 2018

  14. Mukkamala M C, Hein M. Variants of RMSProp and Adagrad with logarithmic regret bounds. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2545–2553

  15. Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1571–1578

  16. Shamir O, Zhang T. Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 71–79

  17. Lacoste-Julien S, Schmidt M, Bach F. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. 2012, arXiv preprint arXiv:1212.2002

  18. Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le Q V, Ng A Y. Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1223–1231

  19. Chen Z, Xu Y, Chen E, Yang T. SADAGRAD: strongly adaptive stochastic gradient methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 912–920

  20. Huang H, Wang C, Dong B. Nostalgic adam: weighing more of the past gradients when designing the adaptive learning rate. 2018, arXiv preprint arXiv:1805.07557

  21. Chen J, Gu Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. 2018, arXiv preprint arXiv:1806.06763

  22. Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning. 2003, 928–936

  23. Cesa-Bianchi N, Conconi A, Gentile C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 2004, 50(9): 2050–2057

    Article  MathSciNet  Google Scholar 

  24. Bernstein J, Wang Y, Azizzadenesheli K, Anandkumar A. SIGNSGD: compressed optimisation for non-convex problems. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 559–568

  25. Dozat T. Incorporating nesterov momentum into adam. In: Proceedings of International Conference on Learning Representations, Workshop Track. 2016

  26. Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Technical Report, Citeseer, 2009

  27. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778

  28. Yang T, Yan Y, Yuan Z, Jin R. Why does stagewise training accelerate convergence of testing error over SGD? 2018, arXiv preprint arXiv:1812.03934

  29. Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. 2017, arXiv preprint arXiv:1712.04621

  30. Glorot X, Bengio Y. Understanding the difficulty of training deep feed-forward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010, 249–256

  31. Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 807–814

  32. Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1): 1929–1958

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments and discussions. This research was partially supported by grants from the National Key Research and Development Program of China (2018YFB1004300) and the National Natural Science Foundation of China (Grant Nos. 61703386, 61727809, and U1605251).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Enhong Chen.

Additional information

Hui Zhong received the BS degree in Computer Science and Technology in 2016 from the University of Science and Technology of China (USTC), China. He is currently a ME student in the School of Computer Science and Technology at USTC, China. His major research interests include machine learning and optimization.

Zaiyi Chen received the PhD degree from University of Science and Technology of China (USTC), China in 2018. His major research interests include machine learning, optimization and sampling. He has published several papers in refereed conference proceedings, such as ICML’18, ICDM’16, SDM’15.

Chuan Qin received the BS degree in Computer Science and Technology from the University of Science and Technology of China (USTC), China in 2015. He is currently working toward the PhD degree in the School of Computer Science and Technology, University of Science and Technology of China, China. His current research interests include natural language processing and recommender system.

Zai Huang received the BS degree in Computer Science and Technology from University of Science and Technology of China (USTC), China in 2016. He is currently pursuing the MS degree in Computer Application Technology from USTC, China. His current research interests include data mining and machine learning.

Vincent W. Zheng is an Adjunct Senior Research Scientist at Advanced Digital Sciences Center (ADSC), Singapore. He received his PhD degree from the Hong Kong University of Science and Technology, China in 2011. His research interests focus on mining with heterogeneous and structured data. He is the associate editor of Cognitive Computation. He has served as PCs in many leading data mining and artificial intelligence conferences such as KDD, IJCAI, AAAI, WWW, WSDM. He has published over 60 papers in the refereed conferences, journals and book chapters. He is a member of AAAI and ACM.

Tong Xu received the PhD degree in University of Science and Technology of China (USTC), China in 2016. He is currently working as an associate researcher of the Anhui Province Key Laboratory of Big Data Analysis and Application, USTC, China. He has authored 20+ journal and conference papers in the fields of social network and social media analysis, including KDD, AAAI, ICDM, SDM, etc.

Enhong Chen is a professor and vice dean of the School of Computer Science at USTC, China. He received the PhD degree from USTC, China. His general area of research includes data mining and machine learning, social network analysis and recommender systems. He has published more than 100 papers in refereed conferences and journals, including IEEE Trans. KDE, IEEE Trans. MC, KDD, ICDM, NIPS, and CIKM. He was on program committees of numerous conferences including KDD, ICDM, SDM. His research is supported by the National Science Foundation for Distinguished Young Scholars of China. He is a senior member of the IEEE.

Electronic Supplementary Material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhong, H., Chen, Z., Qin, C. et al. Adam revisited: a weighted past gradients perspective. Front. Comput. Sci. 14, 145309 (2020). https://doi.org/10.1007/s11704-019-8457-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-019-8457-x

Keywords

Navigation