当前位置: X-MOL 学术IEEE ACM Trans. Netw. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning
IEEE/ACM Transactions on Networking ( IF 3.7 ) Pub Date : 2020-06-19 , DOI: 10.1109/tnet.2020.2999377
Songtao Wang , Dan Li , Yang Cheng , Jinkun Geng , Yanshu Wang , Shuai Wang , Shutao Xia , Jianping Wu

In large-scale distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML , a scalable, high-performance and fault-tolerant DML network architecture on top of Ethernet and commodity devices. BML builds on BCube topology, and runs a fully-distributed gradient synchronization algorithm. Compared to a Fat-Tree network with the same size, a BML network is expected to take much less time for gradient synchronization, for both low theoretical synchronization time and its benefit to RDMA transport. With server/link failures, the performance of BML degrades in a graceful way. Experiments of MNIST and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4% compared with Fat-Tree running state-of-the-art gradient synchronization algorithm.

中文翻译:

用于分布式机器学习的可扩展,高性能和容错网络架构

在大规模分布式机器学习(DML)中,机器之间的网络性能会显着影响迭代训练的速度。在本文中,我们提出BML ,是一种在以太网和商用设备之上的可扩展,高性能,容错DML网络体系结构。BML建立在BCube拓扑上,并运行完全分布式的梯度同步算法。与具有相同大小的胖树网络相比,由于理论同步时间短且其对RDMA传输的好处,预计BML网络将花费更少的时间进行梯度同步。发生服务器/链接故障时,BML的性能会正常下降。在具有9个双GPU服务器的测试平台上进行的MNIST和VGG-19基准测试表明,与Fat-Tree运行最新的梯度同步算法相比,BML将DML训练的完成时间缩短了多达56.4% 。
更新日期:2020-08-18
down
wechat
bug