An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks,IEEE Transactions on Circuits and Systems I: Regular Papers

当前位置： X-MOL 学术 › IEEE Trans. Circuits Syst. I Regul. Pap. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Efficient and Flexible Accelerator Design for Sparse Convolutional Neural Networks
IEEE Transactions on Circuits and Systems I: Regular Papers ( IF 5.2 ) Pub Date : 2021-05-20 , DOI: 10.1109/tcsi.2021.3074300
Xiaoru Xie , Jun Lin , Zhongfeng Wang , Jinghe Wei

Designing hardware accelerators for convolutional neural networks (CNNs) has recently attracted tremendous attention. Plenty of existing accelerators are built for dense CNNs or structured sparse CNNs. By contrast, unstructured sparse CNNs can achieve higher compression ratio with equivalent accuracy. However, their corresponding hardware implementations generally suffer from load imbalance and conflict access to on-chip buffers, which results in under utilization of processing elements (PEs). To tackle these issues, we propose a hardware/power-efficient and highly flexible architecture to support both unstructured and structured sparse CNNs with various configurations. Firstly, we propose an efficient weight reordering algorithm to preprocess compressed weights and balance the workload of PEs. Secondly, an adaptive on-chip dataflow, namely hybrid parallel (HP) dataflow, is introduced to promote weight reuse. Thirdly, the partial fusion scheme, which was first introduced in one of our prior works, is incorporated as the off-chip dataflow. Benefited from dataflow optimizations, the repetitive data exchanges between on-chip buffers and external memories are significantly reduced. We implement the design on the Intel Arria10 SX660 platform and evaluate with MobileNet-v2, ResNet-50, and ResNet-18 on ImageNet dataset. Compared to existing sparse accelerators on FPGAs, the proposed accelerator can achieve 1.35 ~ 1.81× improvement in power efficiency with the same sparsity. Compared to prior dense accelerators, this accelerator can achieve an improvement of 1.92 ~ 5.84× in DSP efficiency.

中文翻译：

一种高效灵活的稀疏卷积神经网络加速器设计

为卷积神经网络（CNN）设计硬件加速器最近引起了极大的关注。许多现有的加速器都是为密集 CNN 或结构化稀疏 CNN 构建的。相比之下，非结构化稀疏 CNN 可以在同等精度下实现更高的压缩比。然而，它们相应的硬件实现通常会遭受负载不平衡和对片上缓冲区的访问冲突，从而导致处理元件（PE）利用率不足。为了解决这些问题，我们提出了一种硬件/节能且高度灵活的架构，以支持具有各种配置的非结构化和结构化稀疏 CNN。首先，我们提出了一种有效的权重重新排序算法来预处理压缩权重并平衡PE的工作量。其次，引入自适应片上数据流，即混合并行（HP）数据流，以促进权值重用。第三，在我们之前的工作之一中首次引入的部分融合方案被合并为片外数据流。受益于数据流优化，片上缓冲区和外部存储器之间的重复数据交换显着减少。我们在 Intel Arria10 SX660 平台上实现该设计，并在 ImageNet 数据集上使用 MobileNet-v2、ResNet-50 和 ResNet-18 进行评估。与 FPGA 上现有的稀疏加速器相比，所提出的加速器可以在相同的稀疏性下实现 1.35 ~ 1.81 倍的功率效率提高。与现有的密集加速器相比，该加速器的DSP效率可实现1.92~5.84倍的提升。

更新日期：2021-05-20

点击分享查看原文

点击收藏

阅读更多本刊最新论文