Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster,Cluster Computing

当前位置： X-MOL 学术 › Cluster Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards an optimized distributed deep learning framework for a heterogeneous multi-GPU cluster
Cluster Computing ( IF 3.6 ) Pub Date : 2020-07-11 , DOI: 10.1007/s10586-020-03144-9
Youngrang Kim , Hyeonseong Choi , Jaehwan Lee , Jik-Soo Kim , Hyunseung Jei , Hongchan Roh

This paper presents a novel “Distributed Deep Learning Framework” for a heterogeneous multi-GPU cluster that can effectively improve overall resource utilization without sacrificing training accuracy. Specifically, we employ a hybrid aggregation approach using a parameter-server and all-reduce schemes in order to address potential performance degradation problems in running deep learning applications on a heterogeneous computing system. In addition, we design and implement an asynchronous large mini-batch training mechanism to maintain training accuracy for asynchronous data-paralleled deep learning processing with enhanced collective communication capability based on MPI. We successfully implement our proposed framework on TensorFlow and perform extensive experiments in both of homogeneous and heterogeneous computing systems. Evaluation results show that our proposed framework can improve computing performance by decreasing I/O bottlenecks, and effectively increasing the resource utilization in the heterogeneous multi-GPU cluster.

中文翻译：

面向异构多GPU集群的优化的分布式深度学习框架

本文针对异质性提出了一种新颖的“分布式深度学习框架”多GPU群集，可以在不牺牲训练准确性的情况下有效提高整体资源利用率。具体来说，我们采用使用参数服务器和全约简方案的混合聚合方法，以解决在异构计算系统上运行深度学习应用程序时潜在的性能下降问题。此外，我们设计并实现了一个异步大型微型批处理训练机制，以维持基于MPI的增强的集体通信能力的异步数据并行深度学习处理的训练精度。我们成功地在TensorFlow上实现了我们提出的框架，并在同构和异构计算系统中进行了广泛的实验。

更新日期：2020-07-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11