当前位置: X-MOL 学术IEEE Micro › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Distributed Deep Learning With GPU-FPGA Heterogeneous Computing
IEEE Micro ( IF 3.6 ) Pub Date : 2020-11-24 , DOI: 10.1109/mm.2020.3039835
Kenji Tanaka 1 , Yuki Arikawa 1 , Tsuyoshi Ito 1 , Kazutaka Morita 2 , Naru Nemoto 1 , Kazuhiko Terada 1 , Junji Teramoto 2 , Takeshi Sakamoto 1
Affiliation  

In distributed deep learning (DL), collective communication algorithms, such as Allreduce, used to share training results between graphical processing units (GPUs) are an inevitable bottleneck. We hypothesize that a cache access latency occurred at every Allreduce is a significant bottleneck in the current computational systems with high-bandwidth interconnects for distributed DL. To reduce this frequency of latency, it is important to aggregate data at the network interfaces. We implement a data aggregation circuit in a field-programmable gate array (FPGA). Using this FPGA, we proposed novel Allreduce architecture and training strategy without accuracy degradation. Results of the measurement show Allreduce latency reduction to 1/4. Our system can also conceal about 90% of the communication overhead and improve scalability by 20%. The end-to-end time consumed for training in distributed DL with ResNet-50 and ImageNet is reduced to 87.3% without any degradation in validation accuracy.

中文翻译:

使用GPU-FPGA异构计算进行分布式深度学习

在分布式深度学习(DL)中,用于在图形处理单元(GPU)之间共享训练结果的集体通信算法(例如Allreduce)是不可避免的瓶颈。我们假设在每个Allreduce处发生的缓存访问延迟是当前具有分布式DL高带宽互连的计算系统中的重大瓶颈。为了减少这种延迟频率,重要的是在网络接口处聚合数据。我们在现场可编程门阵列(FPGA)中实现数据聚合电路。使用该FPGA,我们提出了新颖的Allreduce架构和训练策略,而不会降低精度。测量结果表明将等待时间减少到1/4。我们的系统还可隐藏大约90%的通信开销,并将可扩展性提高20%。
更新日期:2021-01-29
down
wechat
bug