当前位置:
X-MOL 学术
›
arXiv.cs.NE
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Universal scaling laws in the gradient descent training of neural networks
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2021-05-02 , DOI: arxiv-2105.00507 Maksim Velikanov, Dmitry Yarotsky
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2021-05-02 , DOI: arxiv-2105.00507 Maksim Velikanov, Dmitry Yarotsky
Current theoretical results on optimization trajectories of neural networks
trained by gradient descent typically have the form of rigorous but potentially
loose bounds on the loss values. In the present work we take a different
approach and show that the learning trajectory can be characterized by an
explicit asymptotic at large training times. Specifically, the leading term in
the asymptotic expansion of the loss behaves as a power law $L(t) \sim
t^{-\xi}$ with exponent $\xi$ expressed only through the data dimension, the
smoothness of the activation function, and the class of function being
approximated. Our results are based on spectral analysis of the integral
operator representing the linearized evolution of a large network trained on
the expected loss. Importantly, the techniques we employ do not require
specific form of a data distribution, for example Gaussian, thus making our
findings sufficiently universal.
中文翻译:
神经网络梯度下降训练中的通用缩放定律
当前关于梯度下降训练的神经网络的优化轨迹的理论结果通常具有严格的形式,但可能对损失值有宽松的界限。在目前的工作中,我们采用了不同的方法,并表明在大训练时间,学习轨迹的特征可以是显式渐近的。具体来说,损失的渐近展开中的前导项表现为幂定律$ L(t)\ sim t ^ {-\ xi} $,指数$ \ xi $仅通过数据维表示,激活的平滑度函数,以及近似的函数类别。我们的结果基于对积分算子的频谱分析,该算子表示在预期损失下训练的大型网络的线性演化。重要的是,我们采用的技术不需要特定的数据分发形式,
更新日期:2021-05-04
中文翻译:
神经网络梯度下降训练中的通用缩放定律
当前关于梯度下降训练的神经网络的优化轨迹的理论结果通常具有严格的形式,但可能对损失值有宽松的界限。在目前的工作中,我们采用了不同的方法,并表明在大训练时间,学习轨迹的特征可以是显式渐近的。具体来说,损失的渐近展开中的前导项表现为幂定律$ L(t)\ sim t ^ {-\ xi} $,指数$ \ xi $仅通过数据维表示,激活的平滑度函数,以及近似的函数类别。我们的结果基于对积分算子的频谱分析,该算子表示在预期损失下训练的大型网络的线性演化。重要的是,我们采用的技术不需要特定的数据分发形式,