Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster,Journal of Grid Computing

当前位置： X-MOL 学术 › J. Grid Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Effective Scheduler for Distributed DNN Training Based on MapReduce and GPU Cluster
Journal of Grid Computing ( IF 3.6 ) Pub Date : 2021-02-22 , DOI: 10.1007/s10723-021-09550-6
Jie Xu , Jingyu Wang , Qi Qi , Haifeng Sun , Jianxin Liao , Di Yang

Parallel training accelerates the Deep Neural Networks (DNN) training by parallel GPUs. While the in-memory data transmission becomes the cross-node network transmission due to distribution of GPUs on different nodes, which drags the training time. Most researches address it by reducing the data size on network links. However, the factor of network distance is ignored. In this paper, we construct a distributed DNN training architecture based on MapReduce. The customized scheduler is designed to make the computations nodes that finish the training closer to the nodes that storage data. At the same time, the parallel training models are synchronized by adjusting the data transmission time. The experimental results show that the shortened network distance benefits the reduced network traffic usage. The resulting data transmission time decreases the training time by at least 50% and guarantees the synchronization for the parallel training.

中文翻译：

基于MapReduce和GPU集群的分布式DNN培训的有效调度程序

并行训练通过并行GPU加速了深度神经网络（DNN）训练。内存中数据传输由于GPU在不同节点上的分布而成为跨节点网络传输，这拖累了训练时间。大多数研究通过减少网络链接上的数据大小来解决这一问题。但是，网络距离的因素被忽略了。在本文中，我们基于MapReduce构建了分布式DNN训练架构。定制的调度程序旨在使完成训练的计算节点更靠近存储数据的节点。同时，并行训练模型通过调整数据传输时间来同步。实验结果表明，缩短的网络距离有利于减少网络流量的使用。

更新日期：2021-02-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11