当前位置: X-MOL 学术IEEE Trans. Neural Netw. Learn. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
High-Performance Acceleration of 2-D and 3-D CNNs on FPGAs Using Static Block Floating Point
IEEE Transactions on Neural Networks and Learning Systems ( IF 10.4 ) Pub Date : 2021-10-13 , DOI: 10.1109/tnnls.2021.3116302
Hongxiang Fan , Shuanglong Liu , Zhiqiang Que , Xinyu Niu , Wayne Luk

Over the past few years, 2-D convolutional neural networks (CNNs) have demonstrated their great success in a wide range of 2-D computer vision applications, such as image classification and object detection. At the same time, 3-D CNNs, as a variant of 2-D CNNs, have shown their excellent ability to analyze 3-D data, such as video and geometric data. However, the heavy algorithmic complexity of 2-D and 3-D CNNs imposes a substantial overhead over the speed of these networks, which limits their deployment in real-life applications. Although various domain-specific accelerators have been proposed to address this challenge, most of them only focus on accelerating 2-D CNNs, without considering their computational efficiency on 3-D CNNs. In this article, we propose a unified hardware architecture to accelerate both 2-D and 3-D CNNs with high hardware efficiency. Our experiments demonstrate that the proposed accelerator can achieve up to 92.4% and 85.2% multiply-accumulate efficiency on 2-D and 3-D CNNs, respectively. To improve the hardware performance, we propose a hardware-friendly quantization approach called static block floating point (BFP), which eliminates the frequent representation conversions required in traditional dynamic BFP arithmetic. Comparing with the integer linear quantization using zero-point, the static BFP quantization can decrease the logic resource consumption of the convolutional kernel design by nearly 50% on a field-programmable gate array (FPGA). Without time-consuming retraining, the proposed static BFP quantization is able to quantize the precision to 8-bit mantissa with negligible accuracy loss. As different CNNs on our reconfigurable system require different hardware and software parameters to achieve optimal hardware performance and accuracy, we also propose an automatic tool for parameter optimization. Based on our hardware design and optimization, we demonstrate that the proposed accelerator can achieve 3.8–5.6 times higher energy efficiency than graphics processing unit (GPU) implementation. Comparing with the state-of-the-art FPGA-based accelerators, our design achieves higher generality and up to 1.4–2.2 times higher resource efficiency on both 2-D and 3-D CNNs.

中文翻译:

使用静态块浮点在 FPGA 上对 2D 和 3D CNN 进行高性能加速

在过去的几年中,二维卷积神经网络(CNN)在图像分类和目标检测等广泛的二维计算机视觉应用中展现了巨大的成功。同时,3-D CNN 作为 2-D CNN 的变体,显示出其分析 3-D 数据(例如视频和几何数据)的出色能力。然而,2D 和 3D CNN 算法的复杂性给这些网络的速度带来了巨大的开销,这限制了它们在现实应用中的部署。尽管已经提出了各种特定领域的加速器来应对这一挑战,但大多数加速器只专注于加速 2D CNN,而没有考虑它们在 3D CNN 上的计算效率。在本文中,我们提出了一种统一的硬件架构,以高硬件效率加速 2D 和 3D CNN。我们的实验表明,所提出的加速器在 2D 和 3D CNN 上分别可以实现高达 92.4% 和 85.2% 的乘法累加效率。为了提高硬件性能,我们提出了一种硬件友好的量化方法,称为静态块浮点(BFP),它消除了传统动态 BFP 算法中所需的频繁表示转换。与使用零点的整数线性量化相比,静态BFP量化可以在现场可编程门阵列(FPGA)上将卷积核设计的逻辑资源消耗减少近50%。无需耗时的重新训练,所提出的静态 BFP 量化能够将精度量化为 8 位尾数,并且精度损失可以忽略不计。由于我们的可重构系统上的不同 CNN 需要不同的硬件和软件参数来实现最佳的硬件性能和精度,因此我们还提出了一种用于参数优化的自动工具。基于我们的硬件设计和优化,我们证明所提出的加速器可以实现比图形处理单元 (GPU) 实现高 3.8-5.6 倍的能源效率。与最先进的基于 FPGA 的加速器相比,我们的设计在 2D 和 3D CNN 上实现了更高的通用性和高达 1.4-2.2 倍的资源效率。
更新日期:2021-10-13
down
wechat
bug