当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Feature Learning in Infinite-Width Neural Networks
arXiv - CS - Machine Learning Pub Date : 2020-11-30 , DOI: arxiv-2011.14522
Greg Yang, Edward J. Hu

As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the standard parametrization to allow for feature learning in the limit. Using the *Tensor Programs* technique, we derive explicit formulas for such limits. On Word2Vec and few-shot learning on Omniglot via MAML, two canonical tasks that rely crucially on feature learning, we compute these limits exactly. We find that they outperform both NTK baselines and finite-width networks, with the latter approaching the infinite-width feature learning performance as width increases. More generally, we classify a natural space of neural network parametrizations that generalizes standard, NTK, and Mean Field parametrizations. We show 1) any parametrization in this space either admits feature learning or has an infinite-width training dynamics given by kernel gradient descent, but not both; 2) any such infinite-width limit can be computed using the Tensor Programs technique.

中文翻译:

无限宽神经网络中的特征学习

随着其宽度趋于无穷大,如果对其进行了适当的参数化(例如NTK参数化),则在梯度下降下的深度神经网络的行为将变得简化且可预测(例如,由神经正切核(NTK)给出)。但是,我们证明神经网络的标准参数和NTK参数不容许可以学习特征的无限宽度限制,这对于像BERT这样的预训练和转移学习至关重要。我们建议对标准参数化进行简单修改,以允许在极限条件下学习特征。使用“张量程序”技术,我们得出了此类极限的明确公式。在Word2Vec上以及通过MAML在Omniglot上进行的少量学习(这两个关键任务主要依赖于功能学习),我们可以精确地计算出这些限制。我们发现它们的性能优于NTK基线和有限宽度网络,后者随着宽度的增加而接近无限宽度特征学习性能。更一般地,我们对神经网络参数化的自然空间进行分类,以归纳标准,NTK和平均场参数化。我们证明1)在该空间中的任何参数化要么允许特征学习,要么具有由内核梯度下降给定的无限宽训练动力学,但不能同时具有两者;2)任何这样的无限宽度限制都可以使用Tensor Programs技术来计算。我们证明1)在该空间中的任何参数化要么允许特征学习,要么具有由内核梯度下降给定的无限宽训练动力学,但不能同时具有两者;2)任何这样的无限宽度限制都可以使用Tensor Programs技术来计算。我们证明1)在该空间中的任何参数化要么允许特征学习,要么具有由内核梯度下降给定的无限宽训练动力学,但不能同时具有两者;2)任何这样的无限宽度限制都可以使用Tensor Programs技术来计算。
更新日期:2020-12-01
down
wechat
bug