当前位置: X-MOL 学术IEEE Micro › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
ACCL: Architecting Highly Scalable Distributed Training Systems With Highly Efficient Collective Communication Library
IEEE Micro ( IF 2.8 ) Pub Date : 2021-06-22 , DOI: 10.1109/mm.2021.3091475
Jianbo Dong 1 , Shaochuang Wang 1 , Fei Feng 1 , Zheng Cao 1 , Heng Pan 1 , Lingbo Tang 1 , Pengcheng Li 1 , Hao Li 1 , Qianyuan Ran 1 , Yiqun Guo 1 , Shanyuan Gao 1 , Xin Long 1 , Jie Zhang 1 , Yong Li 1 , Zhisheng Xia 1 , Liuyihan Song 1 , Yingya Zhang 1 , Pan Pan 1 , Guohui Wang 1 , Xiaowei Jiang 1
Affiliation  

Distributed systems have been widely adopted for deep neural networks model training. However, the scalability of distributed training systems is largely bounded by the communication cost. We design a highly efficient collective communication library, namely Alibaba Collective Communication Library (ACCL), to build distributed training systems with linear scalability. ACCL provides optimized algorithms to fully make use of heterogeneous interconnects simultaneously. And the experimental results show significant performance improvement.

中文翻译:


ACCL:利用高效的集体通信库构建高度可扩展的分布式培训系统



分布式系统已广泛用于深度神经网络模型训练。然而,分布式训练系统的可扩展性很大程度上受到通信成本的限制。我们设计了一个高效的集体通信库,即阿里巴巴集体通信库(ACCL),以构建具有线性可扩展性的分布式训练系统。 ACCL 提供优化算法以同时充分利用异构互连。实验结果显示性能显着提升。
更新日期:2021-06-22
down
wechat
bug