O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices,IEEE Transactions on Circuits and Systems I: Regular Papers

当前位置： X-MOL 学术 › IEEE Trans. Circuits Syst. I Regul. Pap. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

O⁴-DNN: A Hybrid DSP-LUT-Based Processing Unit With Operation Packing and Out-of-Order Execution for Efficient Realization of Convolutional Neural Networks on FPGA Devices
IEEE Transactions on Circuits and Systems I: Regular Papers ( IF 5.2 ) Pub Date : 2020-09-01 , DOI: 10.1109/tcsi.2020.2986350
Pouya Haghi , Mehdi Kamal , Ali Afzali-Kusha , Massoud Pedram

In this paper, we propose O⁴-DNN, a high-performance FPGA-based architecture for convolutional neural network (CNN) accelerators relying on operation packing and out-of-order (OoO) execution for DSP blocks augmented with LUT-based glue logic. The high-level architecture is comprised of a systolic array of processing elements (PEs), supporting output stationary dataflow. In this architecture, the computational unit of each PE is realized by using a DSP block as well as a small number of LUTs. Given the limited number of DSP blocks in FPGAs, the combination (DSP block and some LUTs) provides more computational power obtainable through each DSP block. The proposed computational unit performs eight convolutional operations on five input operands where one of them is an 8-bit weight and the others are four 8-bit input feature (IF) maps. In addition, to improve the energy efficiency of the proposed computational unit, we present an approximate form of the unit suitable for neural network applications. To reduce the memory bandwidth as well as increase the utilization of the computational units, a data reusing technique based on the weight sharing is also presented. To improve the performance of the proposed computational unit further, an addressing approach for computing the partial sums out-of-order is proposed. The efficacy of the architecture is assessed using two FPGA devices executing four state-of-the-art neural networks. Experimental results show that this architecture leads to, on average (up to),

$2.5\times $

(

$3.44\times$

) higher throughput compared to a baseline structure. In addition, on average (maximum of), 12% (40%) energy efficiency improvement is achievable by employing the O⁴-DNN compared to the baseline structure.

中文翻译：

O⁴-DNN：一种基于混合 DSP-LUT 的处理单元，具有操作打包和乱序执行功能，可在 FPGA 设备上有效实现卷积神经网络

在本文中，我们提出了 O ⁴ -DNN，这是一种基于 FPGA 的高性能架构，用于卷积神经网络 (CNN) 加速器，依赖于○操作包装和 ○ut-○F-○顺序 (哦哦) 执行基于 LUT 的胶合逻辑增强的 DSP 块。高级架构由处理元件 (PE) 的脉动阵列组成，支持输出固定数据流。在这种架构中，每个PE的计算单元是通过使用一个DSP块以及少量的LUT来实现的。鉴于 FPGA 中 DSP 块的数量有限，组合（DSP 块和一些 LUT）提供了通过每个 DSP 块可获得的更多计算能力。提议的计算单元对五个输入操作数执行八次卷积运算，其中一个是 8 位权重，其他是四个 8 位输入特征 (IF) 映射。此外，为了提高所提出的计算单元的能源效率，我们提出了适用于神经网络应用的单元的近似形式。为了减少内存带宽并提高计算单元的利用率，还提出了一种基于权重共享的数据重用技术。为了进一步提高所提出的计算单元的性能，提出了一种用于计算无序部分和的寻址方法。该架构的功效是使用两个 FPGA 设备执行四个最先进的神经网络来评估的。实验结果表明，这种架构导致，平均（最多），该架构的功效是使用两个 FPGA 设备执行四个最先进的神经网络来评估的。实验结果表明，这种架构导致，平均（最多），该架构的功效是使用两个 FPGA 设备执行四个最先进的神经网络来评估的。实验结果表明，这种架构导致，平均（最多），

$2.5\times $

(

$3.44\times$

) 与基线结构相比更高的吞吐量。此外，与基线结构相比，通过采用 O ⁴ -DNN平均（最大）可以实现 12% (40%) 的能效改进。

更新日期：2020-09-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文