An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs,IEEE Transactions on Very Large Scale Integration (VLSI) Systems

当前位置： X-MOL 学术 › IEEE Trans. Very Larg. Scale Integr. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Efficient Hardware Accelerator for Structured Sparse Convolutional Neural Networks on FPGAs
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( IF 2.8 ) Pub Date : 2020-09-01 , DOI: 10.1109/tvlsi.2020.3002779
Chaoyang Zhu , Kejie Huang , Shuyuan Yang , Ziqi Zhu , Hejia Zhang , Haibin Shen

Deep convolutional neural networks (CNNs) have achieved state-of-the-art performance in a wide range of applications. However, deeper CNN models, which are usually computation consuming, are widely required for complex artificial intelligence (AI) tasks. Though recent research progress on network compression, such as pruning, has emerged as a promising direction to mitigate computational burden, existing accelerators are still prevented from completely utilizing the benefits of leveraging sparsity due to the irregularity caused by pruning. On the other hand, field-programmable gate arrays (FPGAs) have been regarded as a promising hardware platform for CNN inference acceleration. However, most existing FPGA accelerators focus on dense CNN and cannot address the irregularity problem. In this article, we propose a sparsewise dataflow to skip the cycles of processing multiply-and-accumulates (MACs) with zero weights and exploit data statistics to minimize energy through zeros gating to avoid unnecessary computations. The proposed sparsewise dataflow leads to a low bandwidth requirement and high data sharing. Then, we design an FPGA accelerator containing a vector generator module (VGM) that can match the index between sparse weights and input activations according to the proposed dataflow. Experimental results demonstrate that our implementation can achieve 987-, 46-, and 57-imag/s performance for AlexNet, VGG-16, and ResNet-50 on Xilinx ZCU102, respectively, which provides

$1.5\times $

–

$6.7\times $

speedup and

$2.0\times $

–

$6.0\times $

energy efficiency over previous CNN FPGA accelerators.

中文翻译：

FPGA 上结构化稀疏卷积神经网络的高效硬件加速器

深度卷积神经网络 (CNN) 在广泛的应用中取得了最先进的性能。然而，复杂的人工智能 (AI) 任务广泛需要更深的 CNN 模型，这些模型通常是计算消耗量。尽管最近关于网络压缩（例如剪枝）的研究进展已成为减轻计算负担的有希望的方向，但由于剪枝引起的不规则性，现有的加速器仍然无法完全利用稀疏性的好处。另一方面，现场可编程门阵列 (FPGA) 已被视为 CNN 推理加速的有前途的硬件平台。然而，现有的大多数 FPGA 加速器都专注于密集的 CNN，无法解决不规则问题。在本文中，我们提出了一种稀疏数据流来跳过处理零权重的乘法累加 (MAC) 的循环，并利用数据统计来通过零门控来最小化能量以避免不必要的计算。所提出的稀疏数据流导致低带宽要求和高数据共享。然后，我们设计了一个包含向量生成器模块 (VGM) 的 FPGA 加速器，该模块可以根据所提出的数据流匹配稀疏权重和输入激活之间的索引。实验结果表明，我们的实现可以分别在 Xilinx ZCU102 上为 AlexNet、VGG-16 和 ResNet-50 实现 987-、46- 和 57-imag/s 的性能，这提供了所提出的稀疏数据流导致低带宽要求和高数据共享。然后，我们设计了一个包含向量生成器模块 (VGM) 的 FPGA 加速器，该模块可以根据所提出的数据流匹配稀疏权重和输入激活之间的索引。实验结果表明，我们的实现可以分别在 Xilinx ZCU102 上为 AlexNet、VGG-16 和 ResNet-50 实现 987-、46- 和 57-imag/s 的性能，这提供了所提出的稀疏数据流导致低带宽要求和高数据共享。然后，我们设计了一个包含向量生成器模块 (VGM) 的 FPGA 加速器，该模块可以根据所提出的数据流匹配稀疏权重和输入激活之间的索引。实验结果表明，我们的实现可以分别在 Xilinx ZCU102 上为 AlexNet、VGG-16 和 ResNet-50 实现 987-、46- 和 57-imag/s 的性能，这提供了

$1.5\times $

——

$6.7\times $

加速和

$2.0\times $

——

$6.0\times $

比以前的 CNN FPGA 加速器的能效更高。

更新日期：2020-09-01

点击分享查看原文

点击收藏

公开下载