DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2019-05-15 , DOI: arxiv-1905.05957
Hanlin Tang, Xiangru Lian, Chen Yu, Tong Zhang, Ji Liu

A standard approach in large scale machine learning is distributed stochastic gradient training, which requires the computation of aggregated stochastic gradients over multiple nodes on a network. Communication is a major bottleneck in such applications, and in recent years, compressed stochastic gradient methods such as QSGD (quantized SGD) and sparse SGD have been proposed to reduce communication. It was also shown that error compensation can be combined with compression to achieve better convergence in a scheme that each node compresses its local stochastic gradient and broadcast the result to all other nodes over the network in a single pass. However, such a single pass broadcast approach is not realistic in many practical implementations. For example, under the popular parameter server model for distributed learning, the worker nodes need to send the compressed local gradients to the parameter server, which performs the aggregation. The parameter server has to compress the aggregated stochastic gradient again before sending it back to the worker nodes. In this work, we provide a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server. We show that the error-compensated stochastic gradient algorithm admits three very nice properties: 1) it is compatible with an \emph{arbitrary} compression technique; 2) it admits an improved convergence rate than the non error-compensated stochastic gradient methods such as QSGD and sparse SGD; 3) it admits linear speedup with respect to the number of workers. The empirical study is also conducted to validate our theoretical results.

中文翻译：

DoubleSqueeze：具有双通道误差补偿压缩的并行随机梯度下降

大规模机器学习的标准方法是分布式随机梯度训练，它需要计算网络上多个节点上的聚合随机梯度。通信是此类应用的主要瓶颈，近年来，人们提出了 QSGD（量化 SGD）和稀疏 SGD 等压缩随机梯度方法来减少通信。还表明，误差补偿可以与压缩相结合，以在每个节点压缩其局部随机梯度并将结果通过网络一次性广播到所有其他节点的方案中实现更好的收敛。然而，这种单遍广播方法在许多实际实现中是不现实的。比如流行的分布式学习参数服务器模型下，工作节点需要将压缩的局部梯度发送到参数服务器，参数服务器执行聚合。参数服务器必须再次压缩聚合的随机梯度，然后再将其发送回工作节点。在这项工作中，我们详细分析了这种双通道通信模型及其异步并行变体，并在工作节点和参数服务器上进行了错误补偿压缩。我们表明误差补偿随机梯度算法具有三个非常好的特性：1）它与\emph{任意}压缩技术兼容；2) 它比非误差补偿的随机梯度方法如 QSGD 和稀疏 SGD 具有更高的收敛速度；3）它承认相对于工人数量的线性加速。

更新日期：2020-03-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>