Specializing FGPU for Persistent Deep Learning,ACM Transactions on Reconfigurable Technology and Systems

当前位置： X-MOL 学术 › ACM Trans. Reconfig. Technol. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Specializing FGPU for Persistent Deep Learning
ACM Transactions on Reconfigurable Technology and Systems ( IF 3.1 ) Pub Date : 2021-07-15 , DOI: 10.1145/3457886
Rui Ma ₁ , Jia-Ching Hsu ₁ , Tian Tan ₁ , Eriko Nurvitadhi ₂ , David Sheffield ₂ , Rob Pelt ₂ , Martin Langhammer ₂ , Jaewoong Sim ₂ , Aravind Dasu ₂ , Derek Chiou ₃

Affiliation

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL )-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.

中文翻译：

专门用于持久深度学习的 FGPU

与完全定制的 FPGA 设计相比，覆盖架构是一种在 FPGA 上实现快速开发和调试的好方法，但代价可能是有限的性能。当与手动调整的 FPGA 解决方案一起使用时，高性能覆盖架构可以缩短解决方案的时间，从而提高 FPGA 解决方案的整体生产力。这项工作调整并专门研究了 FGPU，这是一种用于 FPGA 的开源 OpenCL 可编程 GPU 覆盖。我们证明我们的持续的深度学习（PDL)-FGPU 架构保持了 GPU 编程的易用性和通用性，同时通过持续深度学习领域的专业化实现高性能。我们还提出了一种专门针对其他领域的简单方法。PDL-FGPU 包括新指令，以及微架构和编译器增强功能。我们在运行持久 DL 应用程序（RNN、GRU、LSTM）和非 DL 应用程序的模拟中在现代高端 Intel Stratix 10 2800 FPGA 上评估 FGPU 基线和提议的 PDL-FGPU，以证明通用性。PDL-FGPU 需要比基线多 1.4-3 倍的 ALM、多 4.4-6.4 倍的 M20ks 和多 1-9.5 倍的 DSP，但 PDL 应用程序的性能提高了 56-693 倍，非 PDL 应用程序平均下降 23.1% .

更新日期：2021-07-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11