Sparse Communication for Training Deep Networks,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Sparse Communication for Training Deep Networks
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-09-19 , DOI: arxiv-2009.09271
Negar Foroutan Eghlidi and Martin Jaggi

Synchronous stochastic gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. Although distributed training reduces the computation time, the communication overhead associated with the gradient exchange forms a scalability bottleneck for the algorithm. There are many compression techniques proposed to reduce the number of gradients that needs to be communicated. However, compressing the gradients introduces yet another overhead to the problem. In this work, we study several compression schemes and identify how three key parameters affect the performance. We also provide a set of insights on how to increase performance and introduce a simple sparsification scheme, random-block sparsification, that reduces communication while keeping the performance close to standard SGD.

中文翻译：

用于训练深度网络的稀疏通信

同步随机梯度下降 (SGD) 是深度学习模型分布式训练最常用的方法。在该算法中，每个worker 与其他worker 共享其局部梯度，并使用所有worker 的平均梯度更新参数。尽管分布式训练减少了计算时间，但与梯度交换相关的通信开销形成了算法的可扩展性瓶颈。提出了许多压缩技术来减少需要传达的梯度数量。然而，压缩梯度给问题带来了另一个开销。在这项工作中，我们研究了几种压缩方案，并确定了三个关键参数如何影响性能。

更新日期：2020-09-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文