当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Overlapping Communication With Computation in Parameter Server for Scalable DL Training
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-03-01 , DOI: 10.1109/tpds.2021.3062721
Shaoqi Wang , Aidi Pi , Xiaobo Zhou , Jun Wang , Cheng-Zhong Xu

Scalability of distributed deep learning (DL) training with parameter server (PS) architecture is often communication constrained in large clusters. There are recent efforts that use a layer by layer strategy to overlap gradient communication with backward computation so as to reduce the impact of communication constraint on the scalability. However, the approaches could bring significant overhead in gradient communication. Meanwhile, they cannot be effectively applied to the overlap between parameter communication and forward computation. In this article, we propose and develop iPart, a novel approach that partitions communication and computation in various partition sizes to overlap gradient communication with backward computation and parameter communication with forward computation. iPart formulates the partitioning decision as an optimization problem and solves it based on a greedy algorithm to derive communication and computation partitions. We implement iPart in the open-source DL framework BigDL and perform evaluations with various DL workloads. Experimental results show that iPart improves the scalability of a cluster of 72 nodes by up to 94 percent over the default PS and 52 percent over the layer by layer strategy.

中文翻译:

参数服务器中通信与计算的重叠通信,以实现可扩展的DL训练

使用大型参数集群的参数服务器(PS)体系结构的分布式深度学习(DL)培训的可伸缩性通常受到通信的限制。最近有一些使用层策略将梯度通信与后向计算重叠的工作,以减少通信约束对可伸缩性的影响。但是,这些方法可能会在梯度通信中带来大量开销。同时,它们不能有效地应用于参数通信和正向计算之间的重叠。在本文中,我们提出并开发了iPart,它是一种将通信和计算划分为各种分区大小的新方法,以使梯度通信与后向计算重叠,而参数通信与前向计算重叠。iPart将分区决策公式化为一个优化问题,并基于贪婪算法对其进行求解,以得出通信和计算分区。我们在开源DL框架BigDL中实现iPart,并针对各种DL工作负载执行评估。实验结果表明,iPart可以将72个节点的群集的可伸缩性提高到默认PS的94%,并将其提高到逐层策略的52%。
更新日期:2021-03-19
down
wechat
bug