当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Benchmarking network fabrics for data distributed training of deep neural networks
arXiv - CS - Performance Pub Date : 2020-08-18 , DOI: arxiv-2008.08057
Siddharth Samsi, Andrew Prout, Michael Jones, Andrew Kirby, Bill Arcand, Bill Bergeron, David Bestor, Chansup Byun, Vijay Gadepally, Michael Houle, Matthew Hubbell, Anna Klein, Peter Michaleas, Lauren Milechin, Julie Mullen, Antonio Rosa, Charles Yee, Albert Reuther, Jeremy Kepner

Artificial Intelligence/Machine Learning applications require the training of complex models on large amounts of labelled data. The large computational requirements for training deep models have necessitated the development of new methods for faster training. One such approach is the data parallel approach, where the training data is distributed across multiple compute nodes. This approach is simple to implement and supported by most of the commonly used machine learning frameworks. The data parallel approach leverages MPI for communicating gradients across all nodes. In this paper, we examine the effects of using different physical hardware interconnects and network-related software primitives for enabling data distributed deep learning. We compare the effect of using GPUDirect and NCCL on Ethernet and OmniPath fabrics. Our results show that using Ethernet-based networking in shared HPC systems does not have a significant effect on the training times for commonly used deep neural network architectures or traditional HPC applications such as Computational Fluid Dynamics.

中文翻译:

用于深度神经网络数据分布式训练的基准网络结构

人工智能/机器学习应用程序需要在大量标记数据上训练复杂模型。训练深度模型的大量计算要求迫使开发新方法以加快训练速度。其中一种方法是数据并行方法,其中训练数据分布在多个计算节点上。这种方法实施起来很简单,并且得到了大多数常用机器学习框架的支持。数据并行方法利用 MPI 在所有节点之间传递梯度。在本文中,我们研究了使用不同物理硬件互连和网络相关软件原语实现数据分布式深度学习的效果。我们比较了在以太网和 OmniPath 结构上使用 GPUDirect 和 NCCL 的效果。
更新日期:2020-08-19
down
wechat
bug