当前位置: X-MOL 学术Computing › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Using machine learning to model the training scalability of convolutional neural networks on clusters of GPUs
Computing ( IF 3.7 ) Pub Date : 2021-08-30 , DOI: 10.1007/s00607-021-00997-9
Sergio Barrachina 1 , Adrián Castelló 1 , Mar Catalán 1 , Manuel F. Dolz 1 , Jose I. Mestre 1
Affiliation  

In this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) in charge of modeling the NVIDIA cuDNN/cuBLAS library kernels involved in the training of some of the state-of-the-art CNNs; and ii) an analytical model in charge of modeling the NVIDIA NCCL Allreduce collective primitive using the Ring algorithm. The CNN training scalability study performed using this model in combination with the Roofline technique on varying batch sizes, node (floating-point) arithmetic performance, node memory bandwidth, network link bandwidth, and cluster dimension unveil some crucial bottlenecks at both GPU and cluster level. To provide evidence of this analysis, we validate the accuracy of the proposed model against a Python library for distributed deep learning training.



中文翻译:

使用机器学习对卷积神经网络在 GPU 集群上的训练可扩展性进行建模

在这项工作中,我们构建了一个通用的分段模型来分析卷积神经网络 (CNN) 在 GPU 集群上的数据并行 (DP) 训练成本。该通用模型基于i)多层感知器 (MLP),负责对 NVIDIA cuDNN/cuBLAS 库内核进行建模,这些内核涉及一些最先进的 CNN 的训练;和ii)一个分析模型,负责使用 Ring 算法对 NVIDIA NCCL Allreduce 集体基元进行建模。使用该模型结合 Roofline 技术对不同批次大小、节点(浮点)算术性能、节点内存带宽、网络链路带宽和集群维度进行的 CNN 训练可扩展性研究揭示了 GPU 和集群级别的一些关键瓶颈. 为了提供这种分析的证据,我们针对分布式深度学习训练的 Python 库验证了所提出模型的准确性。

更新日期:2021-08-30
down
wechat
bug