Adam revisited: a weighted past gradients perspective

Zhong, Hui; Chen, Zaiyi; Qin, Chuan; Huang, Zai; Zheng, Vincent W.; Xu, Tong; Chen, Enhong

doi:10.1007/s11704-019-8457-x

Adam revisited: a weighted past gradients perspective

Research Article
Published: 03 January 2020

Volume 14, article number 145309, (2020)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Hui Zhong¹,
Zaiyi Chen²,
Chuan Qin¹,
Zai Huang¹,
Vincent W. Zheng³,
Tong Xu¹ &
…
Enhong Chen¹

241 Accesses
19 Citations
2 Altmetric
Explore all metrics

Abstract

Adaptive learning rate methods have been successfully applied in many fields, especially in training deep neural networks. Recent results have shown that adaptive methods with exponential increasing weights on squared past gradients (i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the non-convergence issues, achieving a data-dependent regret bound similar to or better than ADAGRAD is still a challenge to these methods. In this paper, we propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues. Unlike AMSGRAD and ADAMNC, we consider using a milder growing weighting strategy on squared past gradient, in which weights grow linearly. Based on this idea, we propose weighted adaptive gradient method framework (WAGMF) and implement WADA algorithm on this framework. Moreover, we prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly. This bound may partially explain the good performance of ADAM in practice. Finally, extensive experiments demonstrate the effectiveness of WADA and its variants in comparison with several variants of ADAM on training convex problems and deep neural networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Article 18 August 2021

Iqbal H. Sarker

Machine learning and deep learning

Article Open access 08 April 2021

Christian Janiesch, Patrick Zschech & Kai Heinrich

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Laith Alzubaidi, Jinglan Zhang, … Laith Farhan

References

Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22(3): 400–407
Article MathSciNet Google Scholar
Duchi J C, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12(Jul): 2121–2159
MathSciNet MATH Google Scholar
Tieleman T, Hinton G. Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012, 4(2): 26–31
Google Scholar
Zeiler M D. Adadelta: an adaptive learning rate method. 2012, arXiv preprint arXiv:1212.5701
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations. 2015
Yin Y, Huang Z, Chen E, Liu Q, Zhang F, Xie X, Hu G. Transcribing content from structural images with spotlight mechanism. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2643–2652
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1097–1105
Su Y, Liu Q, Liu Q, Huang Z, Yin Y, Chen E, Ding C, Wei S, Hu G. Exercise-enhanced sequential modeling for student performance prediction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 2435–2443
Liu Q, Huang Z, Huang Z, Liu C, Chen E, Su Y, Hu G. Finding similar exercises in online education systems. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 1821–1830
Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988, 24(5): 513–523
Article Google Scholar
LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278–2324
Article Google Scholar
Hazan E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016, 2(3–4): 157–325
Article Google Scholar
Reddi S J, Kale S, Kumar S. On the convergence of adam and beyond. In: Proceedings of International Conference on Learning Representations. 2018
Mukkamala M C, Hein M. Variants of RMSProp and Adagrad with logarithmic regret bounds. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2545–2553
Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1571–1578
Shamir O, Zhang T. Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 71–79
Lacoste-Julien S, Schmidt M, Bach F. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. 2012, arXiv preprint arXiv:1212.2002
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le Q V, Ng A Y. Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1223–1231
Chen Z, Xu Y, Chen E, Yang T. SADAGRAD: strongly adaptive stochastic gradient methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 912–920
Huang H, Wang C, Dong B. Nostalgic adam: weighing more of the past gradients when designing the adaptive learning rate. 2018, arXiv preprint arXiv:1805.07557
Chen J, Gu Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. 2018, arXiv preprint arXiv:1806.06763
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning. 2003, 928–936
Cesa-Bianchi N, Conconi A, Gentile C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 2004, 50(9): 2050–2057
Article MathSciNet Google Scholar
Bernstein J, Wang Y, Azizzadenesheli K, Anandkumar A. SIGNSGD: compressed optimisation for non-convex problems. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 559–568
Dozat T. Incorporating nesterov momentum into adam. In: Proceedings of International Conference on Learning Representations, Workshop Track. 2016
Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Technical Report, Citeseer, 2009
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778
Yang T, Yan Y, Yuan Z, Jin R. Why does stagewise training accelerate convergence of testing error over SGD? 2018, arXiv preprint arXiv:1812.03934
Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. 2017, arXiv preprint arXiv:1712.04621
Glorot X, Bengio Y. Understanding the difficulty of training deep feed-forward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010, 249–256
Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 807–814
Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1): 1929–1958
MathSciNet MATH Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their insightful comments and discussions. This research was partially supported by grants from the National Key Research and Development Program of China (2018YFB1004300) and the National Natural Science Foundation of China (Grant Nos. 61703386, 61727809, and U1605251).

Author information

Authors and Affiliations

Anhui Province Key Laboratory of Big Data Analysis and Application, University of Science and Technology of China, Hefei, 230027, China
Hui Zhong, Chuan Qin, Zai Huang, Tong Xu & Enhong Chen
Zhejiang Cainiao Supply Chain Management Co. Ltd, Hangzhou, 311122, China
Zaiyi Chen
Advanced Digital Sciences Center, Singapore, 138602, Singapore
Vincent W. Zheng

Authors

Hui Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Zaiyi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chuan Qin
View author publications
You can also search for this author in PubMed Google Scholar
Zai Huang
View author publications
You can also search for this author in PubMed Google Scholar
Vincent W. Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Tong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Enhong Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Enhong Chen.

Additional information

Hui Zhong received the BS degree in Computer Science and Technology in 2016 from the University of Science and Technology of China (USTC), China. He is currently a ME student in the School of Computer Science and Technology at USTC, China. His major research interests include machine learning and optimization.

Zaiyi Chen received the PhD degree from University of Science and Technology of China (USTC), China in 2018. His major research interests include machine learning, optimization and sampling. He has published several papers in refereed conference proceedings, such as ICML’18, ICDM’16, SDM’15.

Chuan Qin received the BS degree in Computer Science and Technology from the University of Science and Technology of China (USTC), China in 2015. He is currently working toward the PhD degree in the School of Computer Science and Technology, University of Science and Technology of China, China. His current research interests include natural language processing and recommender system.

Zai Huang received the BS degree in Computer Science and Technology from University of Science and Technology of China (USTC), China in 2016. He is currently pursuing the MS degree in Computer Application Technology from USTC, China. His current research interests include data mining and machine learning.

Vincent W. Zheng is an Adjunct Senior Research Scientist at Advanced Digital Sciences Center (ADSC), Singapore. He received his PhD degree from the Hong Kong University of Science and Technology, China in 2011. His research interests focus on mining with heterogeneous and structured data. He is the associate editor of Cognitive Computation. He has served as PCs in many leading data mining and artificial intelligence conferences such as KDD, IJCAI, AAAI, WWW, WSDM. He has published over 60 papers in the refereed conferences, journals and book chapters. He is a member of AAAI and ACM.

Tong Xu received the PhD degree in University of Science and Technology of China (USTC), China in 2016. He is currently working as an associate researcher of the Anhui Province Key Laboratory of Big Data Analysis and Application, USTC, China. He has authored 20+ journal and conference papers in the fields of social network and social media analysis, including KDD, AAAI, ICDM, SDM, etc.

Enhong Chen is a professor and vice dean of the School of Computer Science at USTC, China. He received the PhD degree from USTC, China. His general area of research includes data mining and machine learning, social network analysis and recommender systems. He has published more than 100 papers in refereed conferences and journals, including IEEE Trans. KDE, IEEE Trans. MC, KDD, ICDM, NIPS, and CIKM. He was on program committees of numerous conferences including KDD, ICDM, SDM. His research is supported by the National Science Foundation for Distinguished Young Scholars of China. He is a senior member of the IEEE.

Electronic Supplementary Material