Universal scaling laws in the gradient descent training of neural networks,arXiv - CS - Neural and Evolutionary Computing

当前位置： X-MOL 学术 › arXiv.cs.NE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Universal scaling laws in the gradient descent training of neural networks
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2021-05-02 , DOI: arxiv-2105.00507
Maksim Velikanov, Dmitry Yarotsky

Current theoretical results on optimization trajectories of neural networks trained by gradient descent typically have the form of rigorous but potentially loose bounds on the loss values. In the present work we take a different approach and show that the learning trajectory can be characterized by an explicit asymptotic at large training times. Specifically, the leading term in the asymptotic expansion of the loss behaves as a power law $L(t) \sim t^{-\xi}$ with exponent $\xi$ expressed only through the data dimension, the smoothness of the activation function, and the class of function being approximated. Our results are based on spectral analysis of the integral operator representing the linearized evolution of a large network trained on the expected loss. Importantly, the techniques we employ do not require specific form of a data distribution, for example Gaussian, thus making our findings sufficiently universal.

中文翻译：

神经网络梯度下降训练中的通用缩放定律

当前关于梯度下降训练的神经网络的优化轨迹的理论结果通常具有严格的形式，但可能对损失值有宽松的界限。在目前的工作中，我们采用了不同的方法，并表明在大训练时间，学习轨迹的特征可以是显式渐近的。具体来说，损失的渐近展开中的前导项表现为幂定律$ L（t）\ sim t ^ {-\ xi} $，指数$ \ xi $仅通过数据维表示，激活的平滑度函数，以及近似的函数类别。我们的结果基于对积分算子的频谱分析，该算子表示在预期损失下训练的大型网络的线性演化。重要的是，我们采用的技术不需要特定的数据分发形式，

更新日期：2021-05-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>