当前位置:
X-MOL 学术
›
arXiv.cs.DC
›
论文详情
Our official English website, www.x-mol.net, welcomes your
feedback! (Note: you will need to create a separate account there.)
Sparse Communication for Training Deep Networks
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-09-19 , DOI: arxiv-2009.09271 Negar Foroutan Eghlidi and Martin Jaggi
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2020-09-19 , DOI: arxiv-2009.09271 Negar Foroutan Eghlidi and Martin Jaggi
Synchronous stochastic gradient descent (SGD) is the most common method used
for distributed training of deep learning models. In this algorithm, each
worker shares its local gradients with others and updates the parameters using
the average gradients of all workers. Although distributed training reduces the
computation time, the communication overhead associated with the gradient
exchange forms a scalability bottleneck for the algorithm. There are many
compression techniques proposed to reduce the number of gradients that needs to
be communicated. However, compressing the gradients introduces yet another
overhead to the problem. In this work, we study several compression schemes and
identify how three key parameters affect the performance. We also provide a set
of insights on how to increase performance and introduce a simple
sparsification scheme, random-block sparsification, that reduces communication
while keeping the performance close to standard SGD.
中文翻译:
用于训练深度网络的稀疏通信
同步随机梯度下降 (SGD) 是深度学习模型分布式训练最常用的方法。在该算法中,每个worker 与其他worker 共享其局部梯度,并使用所有worker 的平均梯度更新参数。尽管分布式训练减少了计算时间,但与梯度交换相关的通信开销形成了算法的可扩展性瓶颈。提出了许多压缩技术来减少需要传达的梯度数量。然而,压缩梯度给问题带来了另一个开销。在这项工作中,我们研究了几种压缩方案,并确定了三个关键参数如何影响性能。
更新日期:2020-09-22
中文翻译:
用于训练深度网络的稀疏通信
同步随机梯度下降 (SGD) 是深度学习模型分布式训练最常用的方法。在该算法中,每个worker 与其他worker 共享其局部梯度,并使用所有worker 的平均梯度更新参数。尽管分布式训练减少了计算时间,但与梯度交换相关的通信开销形成了算法的可扩展性瓶颈。提出了许多压缩技术来减少需要传达的梯度数量。然而,压缩梯度给问题带来了另一个开销。在这项工作中,我们研究了几种压缩方案,并确定了三个关键参数如何影响性能。