当前位置: X-MOL 学术IEEE Trans. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TurboDL: Improving CNN Training on GPU with Fine-grained Multi-streaming Scheduling
IEEE Transactions on Computers ( IF 3.7 ) Pub Date : 2020-01-01 , DOI: 10.1109/tc.2020.2990321
Hai Jin , Wenchao Wu , Xuanhua Shi , Ligang He , Bing B Zhou

Graphics Processing Units (GPUs) have evolved as powerful co-processors for the CNN training. Many new features have been introduced into GPUs such as concurrent kernel execution and hyper-Q technology. It is challenging to orchestrate concurrency for CNN (convolutional neural networks) training on GPUs since it may introduce much synchronization overhead and poor resource utilization. Unlike previous research which mainly focuses on single layer or coarse-grained optimization, we introduce a critical-path based, asynchronous parallelization mechanism, and propose the optimization technique for the CNN training that takes into account global network architecture and GPU resource usage together. The proposed methods can effectively overlap the synchronization and the computation in different streams. As a result, the training process of CNN is accelerated. We have integrated our methods into caffe. The experimental results show that the caffe integrated with our methods can achieve 1.30X performance speedup on average compared with caffe+cuDNN, and even higher performance speedup can be achieved for deeper, wider and more complicated networks.

中文翻译:

TurboDL:使用细粒度多流调度改进 GPU 上的 CNN 训练

图形处理单元 (GPU) 已发展成为 CNN 训练的强大协处理器。GPU 中引入了许多新功能,例如并发内核执行和 hyper-Q 技术。在 GPU 上协调 CNN(卷积神经网络)训练的并发性具有挑战性,因为它可能会引入大量同步开销和较差的资源利用率。与以往的研究主要侧重于单层或粗粒度优化不同,我们引入了基于关键路径的异步并行化机制,并提出了将全局网络架构和 GPU 资源使用情况综合考虑在内的 CNN 训练优化技术。所提出的方法可以有效地重叠不同流中的同步和计算。因此,加速了CNN的训练过程。我们已将我们的方法集成到 caffe 中。实验结果表明,与caffe+cuDNN相比,集成我们方法的caffe平均可以实现1.30倍的性能加速,对于更深、更广、更复杂的网络,可以实现更高的性能加速。
更新日期:2020-01-01
down
wechat
bug