当前位置: X-MOL 学术Mach. Learn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Gradient descent optimizes over-parameterized deep ReLU networks
Machine Learning ( IF 7.5 ) Pub Date : 2019-10-23 , DOI: 10.1007/s10994-019-05839-6
Difan Zou , Yuan Cao , Dongruo Zhou , Quanquan Gu

We study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a ; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a ) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks.

中文翻译:

梯度下降优化过度参数化的深度 ReLU 网络

我们研究了使用 Rectified Linear Unit (ReLU) 激活函数和交叉熵损失函数训练深度全连接神经网络以使用梯度下降进行二元分类的问题。我们表明,在训练数据的某些假设下,通过适当的随机权重初始化,梯度下降可以找到过度参数化的深度 ReLU 网络的训练损失的全局最小值。我们证明的关键思想是高斯随机初始化和梯度下降产生一系列迭代,这些迭代保持在以初始权重为中心的小扰动区域内,其中深度 ReLU 网络的训练损失函数具有良好的局部曲率特性,确保梯度下降的全局收敛。我们证明技术的核心是(1)对训练数据的一个更温和的假设;(2) 对梯度下降轨迹长度的敏锐分析;(3) 对扰动区域大小的更精细表征。与沿着这条线的并行工作(Allen-Zhu 等人在 A convergence theory for deep learning via over-parameterization, 2018a 中;Du 等人在梯度下降中找到深度神经网络的全局最小值,2018a 中)相比,我们的结果依赖于在神经网络宽度上更温和的过参数化条件下,并享有更快的梯度下降全局收敛速度,用于训练深度神经网络。在通过过度参数化进行深度学习的收敛理论,2018a;杜等人。在梯度下降找到深度神经网络的全局最小值,2018a)沿着这条线,我们的结果依赖于神经网络宽度上较温和的过度参数化条件,并享有更快的梯度下降全局收敛速度来训练深度神经网络。在通过过度参数化进行深度学习的收敛理论,2018a;杜等人。在梯度下降找到深度神经网络的全局最小值,2018a)沿着这条线,我们的结果依赖于神经网络宽度上较温和的过度参数化条件,并享有更快的梯度下降全局收敛速度来训练深度神经网络。
更新日期:2019-10-23
down
wechat
bug