当前位置: X-MOL 学术arXiv.cs.LG › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Weak and Strong Gradient Directions: Explaining Memorization, Generalization, and Hardness of Examples at Scale
arXiv - CS - Machine Learning Pub Date : 2020-03-16 , DOI: arxiv-2003.07422
Piotr Zielinski, Shankar Krishnan, Satrajit Chatterjee

Coherent Gradients (CGH) is a recently proposed hypothesis to explain why over-parameterized neural networks trained with gradient descent generalize well even though they have sufficient capacity to memorize the training set. The key insight of CGH is that, since the overall gradient for a single step of SGD is the sum of the per-example gradients, it is strongest in directions that reduce the loss on multiple examples if such directions exist. In this paper, we validate CGH on ResNet, Inception, and VGG models on ImageNet. Since the techniques presented in the original paper do not scale beyond toy models and datasets, we propose new methods. By posing the problem of suppressing weak gradient directions as a problem of robust mean estimation, we develop a coordinate-based median of means approach. We present two versions of this algorithm, M3, which partitions a mini-batch into 3 groups and computes the median, and a more efficient version RM3, which reuses gradients from previous two time steps to compute the median. Since they suppress weak gradient directions without requiring per-example gradients, they can be used to train models at scale. Experimentally, we find that they indeed greatly reduce overfitting (and memorization) and thus provide the first convincing evidence that CGH holds at scale. We also propose a new test of CGH that does not depend on adding noise to training labels or on suppressing weak gradient directions. Using the intuition behind CGH, we posit that the examples learned early in the training process (i.e., "easy" examples) are precisely those that have more in common with other training examples. Therefore, as per CGH, the easy examples should generalize better amongst themselves than the hard examples amongst themselves. We validate this hypothesis with detailed experiments, and believe that it provides further orthogonal evidence for CGH.

中文翻译:

弱梯度方向和强梯度方向:解释大规模示例的记忆、泛化和硬度

Coherent Gradients (CGH) 是最近提出的一个假设,用于解释为什么用梯度下降训练的过度参数化神经网络即使有足够的能力来记忆训练集也能很好地泛化。CGH 的关键见解是,由于 SGD 单个步骤的总体梯度是每个示例梯度的总和,如果存在这样的方向,那么在减少多个示例的损失的方向上是最强的。在本文中,我们在 ImageNet 上的 ResNet、Inception 和 VGG 模型上验证了 CGH。由于原始论文中介绍的技术不能扩展到玩具模型和数据集之外,我们提出了新方法。通过将抑制弱梯度方向的问题作为稳健均值估计的问题,我们开发了一种基于坐标的均值中值方法。我们提出了这个算法的两个版本,M3,它将一个 mini-batch 分成 3 组并计算中值,还有一个更有效的版本 RM3,它重用前两个时间步的梯度来计算中值。由于它们在不需要每个示例梯度的情况下抑制弱梯度方向,因此它们可用于大规模训练模型。通过实验,我们发现它们确实大大减少了过度拟合(和记忆),从而提供了第一个令人信服的证据,证明 CGH 在规模上是成立的。我们还提出了一种新的 CGH 测试,它不依赖于向训练标签添加噪声或抑制弱梯度方向。使用 CGH 背后的直觉,我们假设在训练过程早期学习的示例(即“简单”示例)正是那些与其他训练示例有更多共同点的示例。因此,根据 CGH,简单的例子应该比困难的例子在它们之间更好地概括。我们通过详细的实验验证了这一假设,并相信它为 CGH 提供了进一步的正交证据。
更新日期:2020-07-22
down
wechat
bug