AdaSGD: Bridging the gap between SGD and Adam,arXiv - CS - Machine Learning

当前位置： X-MOL 学术 › arXiv.cs.LG › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

AdaSGD: Bridging the gap between SGD and Adam
arXiv - CS - Machine Learning Pub Date : 2020-06-30 , DOI: arxiv-2006.16541
Jiaxuan Wang, Jenna Wiens

In the context of stochastic gradient descent(SGD) and adaptive moment estimation (Adam),researchers have recently proposed optimization techniques that transition from Adam to SGD with the goal of improving both convergence and generalization performance. However, precisely how each approach trades off early progress and generalization is not well understood; thus, it is unclear when or even if, one should transition from one approach to the other. In this work, by first studying the convex setting, we identify potential contributors to observed differences in performance between SGD and Adam. In particular,we provide theoretical insights for when and why Adam outperforms SGD and vice versa. We ad-dress the performance gap by adapting a single global learning rate for SGD, which we refer to as AdaSGD. We justify this proposed approach with empirical analyses in non-convex settings. On several datasets that span three different domains,we demonstrate how AdaSGD combines the benefits of both SGD and Adam, eliminating the need for approaches that transition from Adam to SGD.

中文翻译：

AdaSGD：弥合 SGD 和 Adam 之间的差距

在随机梯度下降 (SGD) 和自适应矩估计 (Adam) 的背景下，研究人员最近提出了从 Adam 过渡到 SGD 的优化技术，目的是提高收敛性和泛化性能。然而，每种方法如何在早期进展和泛化之间进行权衡尚不清楚。因此，不清楚何时或什至应该从一种方法过渡到另一种方法。在这项工作中，通过首先研究凸设置，我们确定了观察到的 SGD 和 Adam 之间性能差异的潜在贡献者。特别是，我们提供了有关 Adam 何时以及为何优于 SGD 以及反之亦然的理论见解。我们通过为 SGD 调整单一全局学习率来解决性能差距，我们将其称为 AdaSGD。我们通过非凸环境中的经验分析来证明这种提议的方法是合理的。在跨越三个不同领域的几个数据集上，我们展示了 AdaSGD 如何结合 SGD 和 Adam 的优点，消除了从 Adam 过渡到 SGD 的方法的需要。

更新日期：2020-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文