Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster,Concurrency and Computation: Practice and Experience

当前位置： X-MOL 学术 › Concurr. Comput. Pract. Exp. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster
Concurrency and Computation: Practice and Experience ( IF 2 ) Pub Date : 2020-07-23 , DOI: 10.1002/cpe.5923
Jinghui Zhang _{1,

2} , Jun Zhan ₁ , Jiange Li ₁ , Jiahui Jin ₁ , Lei Qian ₂

Affiliation

Exorbitant resources (computing and memory) are required to train a deep neural network (DNN). Often researchers deploy an approach that uses distributed parallel training to acquire larger models faster on GPUs. This approach has its detriments, though; on one hand, a GPU's expanded capacity to compute also produces bigger bottlenecks in inter‐GPU's communications during model training, and multi‐GPU systems lead to complex connectivity. Workload schedulers then end up having to consider hardware topology and requirements for workload communication, in hopes of allocating GPU resources to optimize execution time and improve usage in a heterogeneous environment. On the other hand, the high memory requirements to train a DNN model make running the training processes on GPUs onerous. To contend with this, we introduce two execution optimization methods based on pipeline‐hybrid parallelism (using both data and model parallelism) in a GPU cluster with heterogeneous networking. First, we propose a model partition algorithm that accelerates pipeline‐hybrid parallelism training between heterogeneously network‐connected GPUs. Second, we introduce a cost‐balanced recomputing algorithm to reduce memory usage in the pipeline mode. Experiments show that our solution (Pipe‐Torch) averages a speedup of 1.4× compared with data parallelism, and reduces the memory footprint while maintaining pipelined load‐balanced training.

中文翻译：

在异构网络 GPU 集群中优化基于流水线的分布式深度学习的执行

训练深度神经网络 (DNN) 需要大量资源（计算和内存）。研究人员通常会部署一种方法，该方法使用分布式并行训练在 GPU 上更快地获取更大的模型。但是，这种方法有其不利之处。一方面，GPU 计算能力的扩展也会在模型训练期间在 GPU 间的通信中产生更大的瓶颈，并且多 GPU 系统导致复杂的连接。工作负载调度器最终不得不考虑硬件拓扑和工作负载通信的要求，以期分配 GPU 资源以优化执行时间并提高异构环境中的使用率。另一方面，训练 DNN 模型的高内存要求使得在 GPU 上运行训练过程变得繁重。为了应对这种情况，我们在具有异构网络的 GPU 集群中引入了两种基于管道混合并行性（同时使用数据和模型并行性）的执行优化方法。首先，我们提出了一种模型分区算法，可以加速异构网络连接 GPU 之间的管道混合并行训练。其次，我们引入了一种成本平衡的重新计算算法来减少流水线模式下的内存使用。实验表明，与数据并行相比，我们的解决方案 (Pipe-Torch) 平均加速了 1.4 倍，并在保持流水线负载平衡训练的同时减少了内存占用。我们提出了一种模型分区算法，可以加速异构网络连接的 GPU 之间的管道混合并行训练。其次，我们引入了一种成本平衡的重新计算算法来减少流水线模式下的内存使用。实验表明，与数据并行相比，我们的解决方案 (Pipe-Torch) 平均加速了 1.4 倍，并在保持流水线负载平衡训练的同时减少了内存占用。我们提出了一种模型分区算法，可以加速异构网络连接的 GPU 之间的管道混合并行训练。其次，我们引入了一种成本平衡的重新计算算法来减少流水线模式下的内存使用。实验表明，与数据并行相比，我们的解决方案 (Pipe-Torch) 平均加速了 1.4 倍，并在保持流水线负载平衡训练的同时减少了内存占用。

更新日期：2020-07-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>