Improving the Performance of Distributed MXNet with RDMA,International Journal of Parallel Programming

当前位置： X-MOL 学术 › Int. J. Parallel. Program › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Improving the Performance of Distributed MXNet with RDMA
International Journal of Parallel Programming ( IF 0.9 ) Pub Date : 2019-01-01 , DOI: 10.1007/s10766-018-00623-w
Mingfan Li , Ke Wen , Han Lin , Xu Jin , Zheng Wu , Hong An , Mengxian Chi

As one of the most influential deep learning frameworks, MXNet has achieved excellent performance and many breakthroughs in academic and industrial fields for various machine learning situations. The initial implementation of MXNet uses proxy-socket interface, which delivers suboptimal performance in distributed environment. In a massive parallel training task, parameters are updated frequently during each training loop, in which case network performance becomes the main factor of overall performance. Over the past decade, high performance interconnects have employed remote direct memory access (RDMA) technology to provide excellent performance for numerous scientific domains. In this paper, we describe an efficient design that extends the open-source MXNet to make it RDMA capable via RDMA-based parameter server interfaces. With modest optimizations towards memory usage and transmission overhead, RDMA-based MXNet achieves great performance improvement over the original software. Our experiments reveal that, for the communication subsystem of MXNet, the new design achieves 16x speedup (up to 21x at peak) over 1 Gigabit Ethernet (1GigE). For the two training cases on MXNet, the optimized implementation gains 5x and 9x speedup, respectively. Compared to experiments on the IP-over-InfiniBand (IPoIB) protocol, it achieves nearly 30% performance improvement, as well as better scalability and alleviation of bottlenecks.

中文翻译：

使用 RDMA 提高分布式 MXNet 的性能

作为最具影响力的深度学习框架之一，MXNet 在学术和工业领域针对各种机器学习场景取得了优异的性能和多项突破。MXNet 的初始实现使用代理套接字接口，它在分布式环境中提供次优的性能。在大规模并行训练任务中，参数在每个训练循环中都会频繁更新，在这种情况下，网络性能成为影响整体性能的主要因素。在过去十年中，高性能互连采用远程直接内存访问 (RDMA) 技术为众多科学领域提供卓越的性能。在本文中，我们描述了一种有效的设计，它扩展了开源 MXNet，使其通过基于 RDMA 的参数服务器接口具有 RDMA 能力。通过对内存使用和传输开销的适度优化，基于 RDMA 的 MXNet 实现了相对于原始软件的巨大性能提升。我们的实验表明，对于 MXNet 的通信子系统，新设计在 1 千兆位以太网 (1GigE) 上实现了 16 倍的加速（峰值高达 21 倍）。对于 MXNet 上的两个训练案例，优化的实现分别获得了 5 倍和 9 倍的加速。与 IP-over-InfiniBand (IPoIB) 协议的实验相比，它实现了近 30% 的性能提升，以及更好的可扩展性和瓶颈的缓解。新设计在 1 千兆位以太网 (1GigE) 上实现了 16 倍的加速（峰值高达 21 倍）。对于 MXNet 上的两个训练案例，优化的实现分别获得了 5 倍和 9 倍的加速。与 IP-over-InfiniBand (IPoIB) 协议的实验相比，它实现了近 30% 的性能提升，以及更好的可扩展性和瓶颈的缓解。新设计在 1 千兆位以太网 (1GigE) 上实现了 16 倍的加速（峰值高达 21 倍）。对于 MXNet 上的两个训练案例，优化的实现分别获得了 5 倍和 9 倍的加速。与 IP-over-InfiniBand (IPoIB) 协议的实验相比，它实现了近 30% 的性能提升，以及更好的可扩展性和瓶颈的缓解。

更新日期：2019-01-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11