Abstract
Adaptive learning rate methods have been successfully applied in many fields, especially in training deep neural networks. Recent results have shown that adaptive methods with exponential increasing weights on squared past gradients (i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the non-convergence issues, achieving a data-dependent regret bound similar to or better than ADAGRAD is still a challenge to these methods. In this paper, we propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues. Unlike AMSGRAD and ADAMNC, we consider using a milder growing weighting strategy on squared past gradient, in which weights grow linearly. Based on this idea, we propose weighted adaptive gradient method framework (WAGMF) and implement WADA algorithm on this framework. Moreover, we prove that WADA can achieve a weighted data-dependent regret bound, which could be better than the original regret bound of ADAGRAD when the gradients decrease rapidly. This bound may partially explain the good performance of ADAM in practice. Finally, extensive experiments demonstrate the effectiveness of WADA and its variants in comparison with several variants of ADAM on training convex problems and deep neural networks.
Similar content being viewed by others
References
Robbins H, Monro S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951, 22(3): 400–407
Duchi J C, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011, 12(Jul): 2121–2159
Tieleman T, Hinton G. Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012, 4(2): 26–31
Zeiler M D. Adadelta: an adaptive learning rate method. 2012, arXiv preprint arXiv:1212.5701
Kingma D P, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations. 2015
Yin Y, Huang Z, Chen E, Liu Q, Zhang F, Xie X, Hu G. Transcribing content from structural images with spotlight mechanism. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 2643–2652
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1097–1105
Su Y, Liu Q, Liu Q, Huang Z, Yin Y, Chen E, Ding C, Wei S, Hu G. Exercise-enhanced sequential modeling for student performance prediction. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 2435–2443
Liu Q, Huang Z, Huang Z, Liu C, Chen E, Su Y, Hu G. Finding similar exercises in online education systems. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2018, 1821–1830
Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 1988, 24(5): 513–523
LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86(11): 2278–2324
Hazan E. Introduction to online convex optimization. Foundations and Trends in Optimization, 2016, 2(3–4): 157–325
Reddi S J, Kale S, Kumar S. On the convergence of adam and beyond. In: Proceedings of International Conference on Learning Representations. 2018
Mukkamala M C, Hein M. Variants of RMSProp and Adagrad with logarithmic regret bounds. In: Proceedings of the 34th International Conference on Machine Learning. 2017, 2545–2553
Rakhlin A, Shamir O, Sridharan K. Making gradient descent optimal for strongly convex stochastic optimization. In: Proceedings of the 29th International Conference on Machine Learning. 2012, 1571–1578
Shamir O, Zhang T. Stochastic gradient descent for non-smooth optimization: convergence results and optimal averaging schemes. In: Proceedings of the 30th International Conference on Machine Learning. 2013, 71–79
Lacoste-Julien S, Schmidt M, Bach F. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method. 2012, arXiv preprint arXiv:1212.2002
Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, Le Q V, Ng A Y. Large scale distributed deep networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012, 1223–1231
Chen Z, Xu Y, Chen E, Yang T. SADAGRAD: strongly adaptive stochastic gradient methods. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 912–920
Huang H, Wang C, Dong B. Nostalgic adam: weighing more of the past gradients when designing the adaptive learning rate. 2018, arXiv preprint arXiv:1805.07557
Chen J, Gu Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. 2018, arXiv preprint arXiv:1806.06763
Zinkevich M. Online convex programming and generalized infinitesimal gradient ascent. In: Proceedings of the 20th International Conference on Machine Learning. 2003, 928–936
Cesa-Bianchi N, Conconi A, Gentile C. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 2004, 50(9): 2050–2057
Bernstein J, Wang Y, Azizzadenesheli K, Anandkumar A. SIGNSGD: compressed optimisation for non-convex problems. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 559–568
Dozat T. Incorporating nesterov momentum into adam. In: Proceedings of International Conference on Learning Representations, Workshop Track. 2016
Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images. Technical Report, Citeseer, 2009
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770–778
Yang T, Yan Y, Yuan Z, Jin R. Why does stagewise training accelerate convergence of testing error over SGD? 2018, arXiv preprint arXiv:1812.03934
Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. 2017, arXiv preprint arXiv:1712.04621
Glorot X, Bengio Y. Understanding the difficulty of training deep feed-forward neural networks. In: Proceedings of the 13th International Conference on Artificial Intelligence and Statistics. 2010, 249–256
Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 807–814
Srivastava N, Hinton G E, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1): 1929–1958
Acknowledgements
We thank the anonymous reviewers for their insightful comments and discussions. This research was partially supported by grants from the National Key Research and Development Program of China (2018YFB1004300) and the National Natural Science Foundation of China (Grant Nos. 61703386, 61727809, and U1605251).
Author information
Authors and Affiliations
Corresponding author
Additional information
Hui Zhong received the BS degree in Computer Science and Technology in 2016 from the University of Science and Technology of China (USTC), China. He is currently a ME student in the School of Computer Science and Technology at USTC, China. His major research interests include machine learning and optimization.
Zaiyi Chen received the PhD degree from University of Science and Technology of China (USTC), China in 2018. His major research interests include machine learning, optimization and sampling. He has published several papers in refereed conference proceedings, such as ICML’18, ICDM’16, SDM’15.
Chuan Qin received the BS degree in Computer Science and Technology from the University of Science and Technology of China (USTC), China in 2015. He is currently working toward the PhD degree in the School of Computer Science and Technology, University of Science and Technology of China, China. His current research interests include natural language processing and recommender system.
Zai Huang received the BS degree in Computer Science and Technology from University of Science and Technology of China (USTC), China in 2016. He is currently pursuing the MS degree in Computer Application Technology from USTC, China. His current research interests include data mining and machine learning.
Vincent W. Zheng is an Adjunct Senior Research Scientist at Advanced Digital Sciences Center (ADSC), Singapore. He received his PhD degree from the Hong Kong University of Science and Technology, China in 2011. His research interests focus on mining with heterogeneous and structured data. He is the associate editor of Cognitive Computation. He has served as PCs in many leading data mining and artificial intelligence conferences such as KDD, IJCAI, AAAI, WWW, WSDM. He has published over 60 papers in the refereed conferences, journals and book chapters. He is a member of AAAI and ACM.
Tong Xu received the PhD degree in University of Science and Technology of China (USTC), China in 2016. He is currently working as an associate researcher of the Anhui Province Key Laboratory of Big Data Analysis and Application, USTC, China. He has authored 20+ journal and conference papers in the fields of social network and social media analysis, including KDD, AAAI, ICDM, SDM, etc.
Enhong Chen is a professor and vice dean of the School of Computer Science at USTC, China. He received the PhD degree from USTC, China. His general area of research includes data mining and machine learning, social network analysis and recommender systems. He has published more than 100 papers in refereed conferences and journals, including IEEE Trans. KDE, IEEE Trans. MC, KDD, ICDM, NIPS, and CIKM. He was on program committees of numerous conferences including KDD, ICDM, SDM. His research is supported by the National Science Foundation for Distinguished Young Scholars of China. He is a senior member of the IEEE.
Electronic Supplementary Material
Rights and permissions
About this article
Cite this article
Zhong, H., Chen, Z., Qin, C. et al. Adam revisited: a weighted past gradients perspective. Front. Comput. Sci. 14, 145309 (2020). https://doi.org/10.1007/s11704-019-8457-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-019-8457-x