Journal of Grid Computing ( IF 3.6 ) Pub Date : 2021-02-22 , DOI: 10.1007/s10723-021-09550-6 Jie Xu , Jingyu Wang , Qi Qi , Haifeng Sun , Jianxin Liao , Di Yang
Parallel training accelerates the Deep Neural Networks (DNN) training by parallel GPUs. While the in-memory data transmission becomes the cross-node network transmission due to distribution of GPUs on different nodes, which drags the training time. Most researches address it by reducing the data size on network links. However, the factor of network distance is ignored. In this paper, we construct a distributed DNN training architecture based on MapReduce. The customized scheduler is designed to make the computations nodes that finish the training closer to the nodes that storage data. At the same time, the parallel training models are synchronized by adjusting the data transmission time. The experimental results show that the shortened network distance benefits the reduced network traffic usage. The resulting data transmission time decreases the training time by at least 50% and guarantees the synchronization for the parallel training.
中文翻译:
基于MapReduce和GPU集群的分布式DNN培训的有效调度程序
并行训练通过并行GPU加速了深度神经网络(DNN)训练。内存中数据传输由于GPU在不同节点上的分布而成为跨节点网络传输,这拖累了训练时间。大多数研究通过减少网络链接上的数据大小来解决这一问题。但是,网络距离的因素被忽略了。在本文中,我们基于MapReduce构建了分布式DNN训练架构。定制的调度程序旨在使完成训练的计算节点更靠近存储数据的节点。同时,并行训练模型通过调整数据传输时间来同步。实验结果表明,缩短的网络距离有利于减少网络流量的使用。