当前位置: X-MOL 学术J. Stat. Mech. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Disentangling feature and lazy training in deep neural networks
Journal of Statistical Mechanics: Theory and Experiment ( IF 2.2 ) Pub Date : 2020-11-27 , DOI: 10.1088/1742-5468/abc4de
Mario Geiger , Stefano Spigler , Arthur Jacot , Matthieu Wyart

Two distinct limits for deep learning have been derived as the network width $h\rightarrow \infty$, depending on how the weights of the last layer scale with $h$. In the Neural Tangent Kernel (NTK) limit, the dynamics becomes linear in the weights and is described by a frozen kernel $\Theta$. By contrast, in the Mean-Field limit, the dynamics can be expressed in terms of the distribution of the parameters associated with a neuron, that follows a partial differential equation. In this work we consider deep networks where the weights in the last layer scale as $\alpha h^{-1/2}$ at initialization. By varying $\alpha$ and $h$, we probe the crossover between the two limits. We observe the previously identified regimes of lazy training and feature training. In the lazy-training regime, the dynamics is almost linear and the NTK barely changes after initialization. The feature-training regime includes the mean-field formulation as a limiting case and is characterized by a kernel that evolves in time, and learns some features. We perform numerical experiments on MNIST, Fashion-MNIST, EMNIST and CIFAR10 and consider various architectures. We find that (i) The two regimes are separated by an $\alpha^*$ that scales as $h^{-1/2}$. (ii) Network architecture and data structure play an important role in determining which regime is better: in our tests, fully-connected networks perform generally better in the lazy-training regime, unlike convolutional networks. (iii) In both regimes, the fluctuations $\delta F$ induced on the learned function by initial conditions decay as $\delta F\sim 1/\sqrt{h}$, leading to a performance that increases with $h$. The same improvement can also be obtained at an intermediate width by ensemble-averaging several networks. (iv) In the feature-training regime we identify a time scale $t_1\sim\sqrt{h}\alpha$, such that for $t\ll t_1$ the dynamics is linear.

中文翻译:

深度神经网络中的解缠结特征和懒惰训练

深度学习的两个明显限制被推导出为网络宽度 $h\rightarrow\infty$,这取决于最后一层的权重如何与 $h$ 成比例。在神经切线核 (NTK) 限制下,动态在权重中变为线性,并由冻结核 $\Theta$ 描述。相比之下,在平均场限制中,动力学可以根据与神经元相关的参数分布来表示,遵循偏微分方程。在这项工作中,我们考虑了深度网络,其中最后一层的权重在初始化时缩放为 $\alpha h^{-1/2}$。通过改变 $\alpha$ 和 $h$,我们探索了两个极限之间的交叉。我们观察了先前确定的懒惰训练和特征训练的制度。在懒惰的训练制度中,动态几乎是线性的,初始化后 NTK 几乎没有变化。特征训练机制包括作为极限情况的平均场公式,其特征在于一个随时间演化的内核,并学习一些特征。我们对 MNIST、Fashion-MNIST、EMNIST 和 CIFAR10 进行了数值实验,并考虑了各种架构。我们发现 (i) 这两个机制被 $\alpha^*$ 分隔,其缩放为 $h^{-1/2}$。(ii) 网络架构和数据结构在确定哪种机制更好方面起着重要作用:在我们的测试中,与卷积网络不同,全连接网络在惰性训练机制中通常表现更好。(iii) 在这两种情况下,初始条件在学习函数上引起的波动 $\delta F$ 衰减为 $\delta F\sim 1/\sqrt{h}$,导致性能随 $h$ 增加。通过对多个网络进行集成平均,也可以在中等宽度下获得相同的改进。(iv) 在特征训练机制中,我们确定了一个时间尺度 $t_1\sim\sqrt{h}\alpha$,这样对于 $t\ll t_1$,动态是线性的。
更新日期:2020-11-27
down
wechat
bug