Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setupThis article is an updated version of: Goldt S, Advani M S, Saxe A M, Krzakala F and Zdeborova L 2019 Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup Advances in Neural Information Processing Systems pp 6981–91.,Journal of Statistical Mechanics: Theory and Experiment

当前位置： X-MOL 学术 › J. Stat. Mech. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setupThis article is an updated version of: Goldt S, Advani M S, Saxe A M, Krzakala F and Zdeborova L 2019 Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup Advances in Neural Information Processing Systems pp 6981–91.
Journal of Statistical Mechanics: Theory and Experiment ( IF 2.2 ) Pub Date : 2020-12-22 , DOI: 10.1088/1742-5468/abc61e
Sebastian Goldt ₁ , Madhu S Advani ₂ , Andrew M Saxe ₃ , Florent Krzakala ₄ , Lenka Zdeborová ₁

Affiliation

Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher–student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.

中文翻译：

师生设置中两层神经网络的随机梯度下降动力学本文是以下内容的更新版本：Goldt S、Advani MS、Saxe AM、Krzakala F 和 Zdeborova L 2019 两层神经网络的随机梯度下降动力学在师生设置中神经信息处理系统的进展 pp 6981-91。

即使深度神经网络有足够的参数来轻松拟合所有训练数据，它们也能实现出色的泛化。我们通过分析教师 - 学生设置中过度参数化的两层神经网络的动态和性能来研究这种现象，其中一个网络，即学生，接受另一个网络生成的数据，称为教师。我们展示了如何通过一组微分方程捕获随机梯度下降 (SGD) 的动力学，并证明这种描述在大输入的限制下是渐近精确的。使用这个框架，我们计算参数比老师多的学生网络的最终泛化误差。我们发现当只训练第一层时，学生的最终泛化误差随着网络规模的增加而增加，但在训练两层时保持不变甚至随着大小而减小。我们表明，这些不同的行为源于 SGD 为不同的激活函数找到的不同解决方案。我们的结果表明，在神经网络中实现良好的泛化不仅仅是 SGD 的特性，而且至少取决于算法、模型架构和数据集的相互作用。

更新日期：2020-12-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文