A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning,IEEE/ACM Transactions on Networking

当前位置： X-MOL 学术 › IEEE ACM Trans. Netw. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Scalable, High-Performance, and Fault-Tolerant Network Architecture for Distributed Machine Learning
IEEE/ACM Transactions on Networking ( IF 3.7 ) Pub Date : 2020-06-19 , DOI: 10.1109/tnet.2020.2999377
Songtao Wang , Dan Li , Yang Cheng , Jinkun Geng , Yanshu Wang , Shuai Wang , Shutao Xia , Jianping Wu

In large-scale distributed machine learning (DML), the network performance between machines significantly impacts the speed of iterative training. In this paper we propose BML , a scalable, high-performance and fault-tolerant DML network architecture on top of Ethernet and commodity devices. BML builds on BCube topology, and runs a fully-distributed gradient synchronization algorithm. Compared to a Fat-Tree network with the same size, a BML network is expected to take much less time for gradient synchronization, for both low theoretical synchronization time and its benefit to RDMA transport. With server/link failures, the performance of BML degrades in a graceful way. Experiments of MNIST and VGG-19 benchmarks on a testbed with 9 dual-GPU servers show that, BML reduces the job completion time of DML training by up to 56.4% compared with Fat-Tree running state-of-the-art gradient synchronization algorithm.

中文翻译：

用于分布式机器学习的可扩展，高性能和容错网络架构

在大规模分布式机器学习（DML）中，机器之间的网络性能会显着影响迭代训练的速度。在本文中，我们提出BML ，是一种在以太网和商用设备之上的可扩展，高性能，容错DML网络体系结构。BML建立在BCube拓扑上，并运行完全分布式的梯度同步算法。与具有相同大小的胖树网络相比，由于理论同步时间短且其对RDMA传输的好处，预计BML网络将花费更少的时间进行梯度同步。发生服务器/链接故障时，BML的性能会正常下降。在具有9个双GPU服务器的测试平台上进行的MNIST和VGG-19基准测试表明，与Fat-Tree运行最新的梯度同步算法相比，BML将DML训练的完成时间缩短了多达56.4％。

更新日期：2020-08-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>