Backward Feature Correction: How Deep Learning Performs Deep Learning,arXiv - CS - Neural and Evolutionary Computing

当前位置： X-MOL 学术 › arXiv.cs.NE › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Backward Feature Correction: How Deep Learning Performs Deep Learning
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2020-01-13 , DOI: arxiv-2001.04413
Zeyuan Allen-Zhu and Yuanzhi Li

How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically by applying SGD. On the conceptual side, we present, to the best of our knowledge, the FIRST theory result indicating how deep neural networks can be sample and time efficient on certain hierarchical learning tasks, when NO KNOWN non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding, and their simple combinations) are efficient. We establish a principle called "backward feature correction", where training higher layers in the network can improve the features of lower level ones. We believe this is the key to understand the deep learning process in multi-layer neural networks. On the technical side, we show for every input dimension $d > 0$, there is a concept class consisting of degree $\omega(1)$ multi-variate polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any target function from this class in $\mathsf{poly}(d)$ time using $\mathsf{poly}(d)$ samples to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from $d^{\omega(1)}$ sample or time complexity to learn this concept class even to $d^{-0.01}$ error.

中文翻译：

后向特征校正：深度学习如何执行深度学习

一个 110 层的 ResNet 如何使用相对较少的训练样本和较短的训练时间来学习高复杂度的分类器？我们提出了一个理论来解释分层学习。我们将分层学习称为学习者通过将复杂的目标函数分解为一系列更简单的函数以减少样本和时间复杂度来学习表示复杂的目标函数。本文正式分析了多层神经网络如何通过应用 SGD 高效、自动地执行这种分层学习。在概念方面，据我们所知，第一个理论结果表明深度神经网络在某些分层学习任务上的采样效率和时间效率如何，当 NO KNOWN 非分层算法（如核方法、线性回归特征映射，张量分解、稀疏编码及其简单组合）是有效的。我们建立了一个称为“后向特征校正”的原则，在网络中训练高层可以改善低层的特征。我们相信这是理解多层神经网络中深度学习过程的关键。在技术方面，我们展示了对于每个输入维度 $d > 0$，存在一个由阶数 $\omega(1)$ 多元多项式组成的概念类，因此，使用 $\omega(1)$-layer 神经网络网络作为学习者，SGD 可以在 $\mathsf{poly}(d)$ 时间内从这个类中学习任何目标函数，使用 $\mathsf{poly}(d)$ 样本到任何 $\frac{1}{\mathsf{poly }(d)}$ 错误，通过学习将其表示为 $\omega(1)$ 二次函数层的组合。相比之下，

更新日期：2020-09-11

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>