当前位置: X-MOL 学术Sci. China Inf. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On the convergence and improvement of stochastic normalized gradient descent
Science China Information Sciences ( IF 7.3 ) Pub Date : 2021-02-08 , DOI: 10.1007/s11432-020-3023-7
Shen-Yi Zhao , Yin-Peng Xie , Wu-Jun Li

Non-convex models, like deep neural networks, have been widely used in machine learning applications. Training non-convex models is a difficult task owing to the saddle points of models. Recently, stochastic normalized gradient descent (SNGD), which updates the model parameter by a normalized gradient in each iteration, has attracted much attention. Existing results show that SNGD can achieve better performance on escaping saddle points than classical training methods like stochastic gradient descent (SGD). However, none of the existing studies has provided theoretical proof about the convergence of SNGD for non-convex problems. In this paper, we firstly prove the convergence of SNGD for non-convex problems. Particularly, we prove that SNGD can achieve the same computation complexity as SGD. In addition, based on our convergence proof of SNGD, we find that SNGD needs to adopt a small constant learning rate for convergence guarantee. This makes SNGD do not perform well on training large non-convex models in practice. Hence, we propose a new method, called stagewise SNGD (S-SNGD), to improve the performance of SNGD. Different from SNGD in which a small constant learning rate is necessary for convergence guarantee, S-SNGD can adopt a large initial learning rate and reduce the learning rate by stage. The convergence of S-SNGD can also be theoretically proved for non-convex problems. Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy.



中文翻译:

随机归一化梯度下降的收敛性和改进

像深度神经网络一样,非凸模型已广泛用于机器学习应用程序中。由于模型的鞍点,训练非凸模型是一项艰巨的任务。最近,随机归一化梯度下降(SNGD)引起了人们的极大关注,该序列在每次迭代中都通过归一化梯度来更新模型参数。现有结果表明,SNGD在逃避鞍点方面比传统训练方法(如随机梯度下降(SGD))具有更好的性能。但是,现有的研究都没有提供关于非凸问题的SNGD收敛性的理论证明。在本文中,我们首先证明了SNGD在非凸问题上的收敛性。特别是,我们证明了SNGD可以实现与SGD相同的计算复杂度。此外,根据我们对SNGD的收敛证明,我们发现,SNGD需要采用小的恒定学习率来保证收敛。这使得SNGD在实践中在训练大型非凸模型时效果不佳。因此,我们提出了一种新的方法,称为分阶段SNGD(S-SNGD),以提高SNGD的性能。与SNGD不同,S-SNGD可以采用较大的初始学习速率,并逐步降低学习速率,在SNGD中,需要小的恒定学习速率以确保收敛。理论上,对于非凸问题,S-SNGD的收敛性也可以得到证明。深度神经网络的经验结果表明,就训练损失和测试准确性而言,S-SNGD的性能均优于SNGD。这使得SNGD在实践中在训练大型非凸模型时效果不佳。因此,我们提出了一种新的方法,称为分阶段SNGD(S-SNGD),以提高SNGD的性能。与SNGD不同,S-SNGD可以采用较大的初始学习速率,并逐步降低学习速率,在SNGD中,需要小的恒定学习速率以确保收敛。理论上,对于非凸问题,S-SNGD的收敛性也可以得到证明。深度神经网络的经验结果表明,就训练损失和测试准确性而言,S-SNGD的性能均优于SNGD。这使得SNGD在实践中在训练大型非凸模型时效果不佳。因此,我们提出了一种新的方法,称为分阶段SNGD(S-SNGD),以提高SNGD的性能。与SNGD不同,S-SNGD可以采用较大的初始学习速率,并逐步降低学习速率,在SNGD中,需要小的恒定学习速率以确保收敛。理论上,对于非凸问题,S-SNGD的收敛性也可以得到证明。深度神经网络的经验结果表明,就训练损失和测试准确性而言,S-SNGD的性能均优于SNGD。S-SNGD可以采用较大的初始学习率,并逐步降低学习率。理论上,对于非凸问题,S-SNGD的收敛性也可以得到证明。深度神经网络的经验结果表明,就训练损失和测试准确性而言,S-SNGD的性能均优于SNGD。S-SNGD可以采用较大的初始学习率,并逐步降低学习率。理论上,对于非凸问题,S-SNGD的收敛性也可以得到证明。深度神经网络的经验结果表明,就训练损失和测试准确性而言,S-SNGD的性能均优于SNGD。

更新日期:2021-02-15
down
wechat
bug