当前位置: X-MOL 学术ACM Trans. Reconfig. Technol. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Specializing FGPU for Persistent Deep Learning
ACM Transactions on Reconfigurable Technology and Systems ( IF 3.1 ) Pub Date : 2021-07-15 , DOI: 10.1145/3457886
Rui Ma 1 , Jia-Ching Hsu 1 , Tian Tan 1 , Eriko Nurvitadhi 2 , David Sheffield 2 , Rob Pelt 2 , Martin Langhammer 2 , Jaewoong Sim 2 , Aravind Dasu 2 , Derek Chiou 3
Affiliation  

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL )-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.

中文翻译:

专门用于持久深度学习的 FGPU

与完全定制的 FPGA 设计相比,覆盖架构是一种在 FPGA 上实现快速开发和调试的好方法,但代价可能是有限的性能。当与手动调整的 FPGA 解决方案一起使用时,高性能覆盖架构可以缩短解决方案的时间,从而提高 FPGA 解决方案的整体生产力。这项工作调整并专门研究了 FGPU,这是一种用于 FPGA 的开源 OpenCL 可编程 GPU 覆盖。我们证明我们的持续的深度学习 (PDL)-FGPU 架构保持了 GPU 编程的易用性和通用性,同时通过持续深度学习领域的专业化实现高性能。我们还提出了一种专门针对其他领域的简单方法。PDL-FGPU 包括新指令,以及微架构和编译器增强功能。我们在运行持久 DL 应用程序(RNN、GRU、LSTM)和非 DL 应用程序的模拟中在现代高端 Intel Stratix 10 2800 FPGA 上评估 FGPU 基线和提议的 PDL-FGPU,以证明通用性。PDL-FGPU 需要比基线多 1.4-3 倍的 ALM、多 4.4-6.4 倍的 M20ks 和多 1-9.5 倍的 DSP,但 PDL 应用程序的性能提高了 56-693 倍,非 PDL 应用程序平均下降 23.1% .
更新日期:2021-07-15
down
wechat
bug