Communication-efficient SGD: From Local SGD to One-Shot Averaging,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Communication-efficient SGD: From Local SGD to One-Shot Averaging
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-06-09 , DOI: arxiv-2106.04759
Artin Spiridonoff, Alex Olshevsky, Ioannis Ch. Paschalidis

We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among $N$ workers, who can take SGD steps and coordinate with a central server. While it is possible to obtain a linear reduction in the variance by averaging all the stochastic gradients at every step, this requires a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs $\Omega ( \sqrt{T} )$ communications for $T$ local gradient steps in order for the error to scale proportionately to $1/(NT)$, this has been successively improved in a string of papers, with the state-of-the-art requiring $\Omega \left( N \left( \mbox{ polynomial in log } (T) \right) \right)$ communications. In this paper, we suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows. Our analysis shows that this can achieve an error that scales as $1/(NT)$ with a number of communications that is completely independent of $T$. In particular, we show that $\Omega(N)$ communications are sufficient. Empirical evidence suggests this bound is close to tight as we further show that $\sqrt{N}$ or $N^{3/4}$ communications fail to achieve linear speed-up in simulations. Moreover, we show that under mild assumptions, the main of which is twice differentiability on any neighborhood of the optimal solution, one-shot averaging which only uses a single round of communication can also achieve the optimal convergence rate asymptotically.

中文翻译：

高效通信的 SGD：从本地 SGD 到一次性平均

我们考虑通过在多个工作人员之间并行化来加速随机梯度下降 (SGD)。我们假设相同的数据集在 $N$ 个工人之间共享，他们可以采取 SGD 步骤并与中央服务器协调。虽然可以通过对每一步的所有随机梯度求平均值来获得方差的线性减少，但这需要工作人员和服务器之间进行大量通信，这会显着降低并行性带来的收益。早期文献中提出和分析的本地 SGD 方法表明，机器应该在此类通信之间进行许多本地步骤。虽然对 Local SGD 的初步分析表明它需要 $\Omega ( \sqrt{T} )$ 通信来进行 $T$ 局部梯度步骤，以便误差按比例缩放到 $1/(NT)$，这已在一系列论文中相继改进，最先进的技术需要 $\Omega \left( N \left( \mbox{ polynomial in log } (T) \right) \right)$ 通信。在本文中，我们提出了一种局部 SGD 方案，随着迭代次数的增加，通过降低通信频率来减少整体通信。我们的分析表明，通过完全独立于 $T$ 的大量通信，这可以实现按 $1/(NT)$ 缩放的误差。特别是，我们表明 $\Omega(N)$ 通信就足够了。经验证据表明这个界限接近于紧，因为我们进一步表明 $\sqrt{N}$ 或 $N^{3/4}$ 通信无法在模拟中实现线性加速。此外，我们表明，在温和的假设下，主要是最优解的任何邻域的两倍可微性，

更新日期：2021-06-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文