当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-01-19 , DOI: 10.1109/tpds.2021.3052862
Shaohuai Shi , Xiaowen Chu , Bo Li

Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks (DNNs) on computer clusters. With the increase of computational power, network communications generally limit the system scalability. Wait-free backpropagation (WFBP) is a popular solution to overlap communications with computations during the training process. In this article, we observe that many DNNs have a large number of layers with only a small amount of data to be communicated at each layer in distributed training, which could make WFBP inefficient. Based on the fact that merging some short communication tasks into a single one can reduce the overall communication time, we formulate an optimization problem to minimize the training time in pipelining communications and computations. We derive an optimal solution that can be solved efficiently without affecting the training performance. We then apply the solution to propose a distributed training algorithm named merged-gradient WFBP (MG-WFBP) and implement it in two platforms Caffe and PyTorch. Extensive experiments in three GPU clusters are conducted to verify the effectiveness of MG-WFBP. We further exploit trace-based simulations of 4 to 2048 GPUs to explore the potential scaling efficiency of MG-WFBP. Experimental results show that MG-WFBP achieves much better scaling performance than existing methods.

中文翻译:

MG-WFBP:明智地合并梯度以实现分布式深度学习中的有效通信

分布式同步随机梯度下降已广泛用于在计算机集群上训练深层神经网络(DNN)。随着计算能力的提高,网络通信通常会限制系统的可伸缩性。无等待反向传播(WFBP)是一种流行的解决方案,可以在训练过程中将通信与计算重叠。在本文中,我们观察到许多DNN具有大量的层,而在分布式训练中每层仅需要少量的数据通信,这会使WFBP效率低下。基于将一些简短的通信任务合并为一个可以减少总体通信时间这一事实,我们提出了一个优化问题,以使流水线通信和计算中的训练时间最小化。我们得出了可以有效解决而不影响训练效果的最佳解决方案。然后,我们将该解决方案应用于提出一种名为合并梯度WFBP(MG-WFBP)的分布式训练算法,并将其在两个平台Caffe和PyTorch中实现。在三个GPU集群中进行了广泛的实验,以验证MG-WFBP的有效性。我们进一步利用4到2048个GPU的基于跟踪的仿真来探索MG-WFBP的潜在缩放效率。实验结果表明,MG-WFBP比现有方法具有更好的缩放性能。在三个GPU集群中进行了广泛的实验,以验证MG-WFBP的有效性。我们进一步利用4到2048个GPU的基于跟踪的仿真来探索MG-WFBP的潜在缩放效率。实验结果表明,MG-WFBP比现有方法具有更好的缩放性能。在三个GPU集群中进行了广泛的实验,以验证MG-WFBP的有效性。我们进一步利用4到2048个GPU的基于跟踪的仿真来探索MG-WFBP的潜在缩放效率。实验结果表明,MG-WFBP比现有方法具有更好的缩放性能。
更新日期:2021-02-23
down
wechat
bug