当前位置: X-MOL 学术arXiv.cs.NE › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training
arXiv - CS - Neural and Evolutionary Computing Pub Date : 2018-05-21 , DOI: arxiv-1805.07891
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee, Arvind Krishnamurthy

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art distributed training techniques for cloud-based ImageNet workloads, with 25% better throughput per dollar.

中文翻译:

Parameter Hub:用于分布式深度神经网络训练的机架级参数服务器

分布式深度神经网络 (DDNN) 训练构成了一个越来越重要的工作负载,它经常在云端运行。更大的 DNN 模型和更快的计算引擎正在将 DDNN 训练瓶颈从计算转移到通信。本文描述了 DDNN 训练以精确定位这些瓶颈。我们发现及时的训练需要具有优化网络堆栈和梯度处理管道的高性能参数服务器 (PS),以及具有平衡计算和通信资源的服务器和网络硬件。因此,我们提出了 PHub,一种高性能的多租户、机架级 PS 设计。PHub 协同设计 PS 软件和硬件以加速机架级和分层跨机架参数交换,具有与许多 DDNN 训练框架兼容的 API。
更新日期:2020-01-22
down
wechat
bug