Non-convergence of stochastic gradient descent in the training of deep neural networks,Journal of Complexity

当前位置： X-MOL 学术 › J. Complex. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Non-convergence of stochastic gradient descent in the training of deep neural networks
Journal of Complexity ( IF 1.8 ) Pub Date : 2020-11-27 , DOI: 10.1016/j.jco.2020.101540
Patrick Cheridito , Arnulf Jentzen , Florian Rossmannek

Deep neural networks have successfully been trained in various application areas with stochastic gradient descent. However, there exists no rigorous mathematical explanation why this works so well. The training of neural networks with stochastic gradient descent has four different discretization parameters: (i) the network architecture; (ii) the amount of training data; (iii) the number of gradient steps; and (iv) the number of randomly initialized gradient trajectories. While it can be shown that the approximation error converges to zero if all four parameters are sent to infinity in the right order, we demonstrate in this paper that stochastic gradient descent fails to converge for ReLU networks if their depth is much larger than their width and the number of random initializations does not increase to infinity fast enough.

中文翻译：

深度神经网络训练中随机梯度下降的非收敛性

深度神经网络已经在随机梯度下降的各种应用领域中得到了成功的训练。但是，没有严格的数学解释可以说明为什么如此有效。具有随机梯度下降的神经网络训练具有四个不同的离散化参数：（i）网络体系结构；（ii）培训数据的数量；（iii）梯度步数；（iv）随机初始化的梯度轨迹的数量。虽然可以证明如果将所有四个参数以正确的顺序发送到无穷大，则近似误差收敛到零，但是我们在本文中证明，如果ReLU网络的深度远大于其宽度，则随机梯度下降无法收敛。随机初始化的数量没有足够快地增加到无穷大。

更新日期：2020-11-27

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11