Distributed Deep Learning With GPU-FPGA Heterogeneous Computing,IEEE Micro

当前位置： X-MOL 学术 › IEEE Micro › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Distributed Deep Learning With GPU-FPGA Heterogeneous Computing
IEEE Micro ( IF 3.6 ) Pub Date : 2020-11-24 , DOI: 10.1109/mm.2020.3039835
Kenji Tanaka ₁ , Yuki Arikawa ₁ , Tsuyoshi Ito ₁ , Kazutaka Morita ₂ , Naru Nemoto ₁ , Kazuhiko Terada ₁ , Junji Teramoto ₂ , Takeshi Sakamoto ₁

Affiliation

In distributed deep learning (DL), collective communication algorithms, such as Allreduce, used to share training results between graphical processing units (GPUs) are an inevitable bottleneck. We hypothesize that a cache access latency occurred at every Allreduce is a significant bottleneck in the current computational systems with high-bandwidth interconnects for distributed DL. To reduce this frequency of latency, it is important to aggregate data at the network interfaces. We implement a data aggregation circuit in a field-programmable gate array (FPGA). Using this FPGA, we proposed novel Allreduce architecture and training strategy without accuracy degradation. Results of the measurement show Allreduce latency reduction to 1/4. Our system can also conceal about 90% of the communication overhead and improve scalability by 20%. The end-to-end time consumed for training in distributed DL with ResNet-50 and ImageNet is reduced to 87.3% without any degradation in validation accuracy.

中文翻译：

使用GPU-FPGA异构计算进行分布式深度学习

在分布式深度学习（DL）中，用于在图形处理单元（GPU）之间共享训练结果的集体通信算法（例如Allreduce）是不可避免的瓶颈。我们假设在每个Allreduce处发生的缓存访问延迟是当前具有分布式DL高带宽互连的计算系统中的重大瓶颈。为了减少这种延迟频率，重要的是在网络接口处聚合数据。我们在现场可编程门阵列（FPGA）中实现数据聚合电路。使用该FPGA，我们提出了新颖的Allreduce架构和训练策略，而不会降低精度。测量结果表明将等待时间减少到1/4。我们的系统还可隐藏大约90％的通信开销，并将可扩展性提高20％。

更新日期：2021-01-29

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>