Journal of Grid Computing ( IF 2.095 ) Pub Date : 2021-02-22 , DOI: 10.1007/s10723-021-09550-6 Jie Xu, Jingyu Wang, Qi Qi, Haifeng Sun, Jianxin Liao, Di Yang
Parallel training accelerates the Deep Neural Networks (DNN) training by parallel GPUs. While the in-memory data transmission becomes the cross-node network transmission due to distribution of GPUs on different nodes, which drags the training time. Most researches address it by reducing the data size on network links. However, the factor of network distance is ignored. In this paper, we construct a distributed DNN training architecture based on MapReduce. The customized scheduler is designed to make the computations nodes that finish the training closer to the nodes that storage data. At the same time, the parallel training models are synchronized by adjusting the data transmission time. The experimental results show that the shortened network distance benefits the reduced network traffic usage. The resulting data transmission time decreases the training time by at least 50% and guarantees the synchronization for the parallel training.