当前位置: X-MOL 学术J. Parallel Distrib. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Communication optimization strategies for distributed deep neural network training: A survey
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2020-11-17 , DOI: 10.1016/j.jpdc.2020.11.005
Shuo Ouyang , Dezun Dong , Yemao Xu , Liquan Xiao

Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slow the overall training speeds, which causes bottlenecks in distributed training, particularly in clusters with limited network bandwidths. To mitigate the drawbacks of distributed communications, researchers have proposed various optimization strategies. In this paper, we provide a comprehensive survey of communication strategies from both an algorithm viewpoint and a computer network perspective. Algorithm optimizations focus on reducing the communication volumes used in distributed training, while network optimizations focus on accelerating the communications between distributed devices. At the algorithm level, we describe how to reduce the number of communication rounds and transmitted bits per round. In addition, we elucidate how to overlap computation and communication. At the network level, we discuss the effects caused by network infrastructures, including logical communication schemes and network protocols. Finally, we extrapolate the potential future challenges and new research directions to accelerate communications for distributed deep neural network training.



中文翻译:

分布式深度神经网络训练的通信优化策略:一项调查

高性能计算和深度学习的最新趋势已导致大规模深度神经网络训练的研究激增。但是,计算节点之间频繁的通信需求大大降低了整体训练速度,这导致了分布式训练的瓶颈,尤其是在网络带宽有限的群集中。为了减轻分布式通信的弊端,研究人员提出了各种优化策略。在本文中,我们从算法角度和计算机网络角度对通信策略进行了全面的概述。算法优化着重于减少分布式训练中使用的通信量,而网络优化则着重于加速分布式设备之间的通信。在算法一级,我们描述了如何减少通信回合和每回合传输位的数量。另外,我们阐明了如何重叠计算和通信。在网络级别,我们讨论了由网络基础结构(包括逻辑通信方案和网络协议)引起的影响。最后,我们推断了潜在的未来挑战和新的研究方向,以加快用于分布式深度神经网络训练的通信。

更新日期:2020-12-02
down
wechat
bug