BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization,ACM Transactions on Reconfigurable Technology and Systems

当前位置： X-MOL 学术 › ACM Trans. Reconfig. Technol. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization
ACM Transactions on Reconfigurable Technology and Systems ( IF 2.3 ) Pub Date : 2021-09-14 , DOI: 10.1145/3467476
Tao Yang ₁ , Zhezhi He ₁ , Tengchuan Kou ₁ , Qingzheng Li ₂ , Qi Han ₂ , Haibao Yu ₂ , Fangxin Liu ₁ , Yun Liang ₃ , Li Jiang ₁

Affiliation

Field-programmable Gate Array (FPGA) is a high-performance computing platform for Convolution Neural Networks (CNNs) inference. Winograd algorithm, weight pruning, and quantization are widely adopted to reduce the storage and arithmetic overhead of CNNs on FPGAs. Recent studies strive to prune the weights in the Winograd domain, however, resulting in irregular sparse patterns and leading to low parallelism and reduced utilization of resources. Besides, there are few works to discuss a suitable quantization scheme for Winograd. In this article, we propose a regular sparse pruning pattern in the Winograd-based CNN, namely, Sub-row-balanced Sparsity (SRBS) pattern, to overcome the challenge of the irregular sparse pattern. Then, we develop a two-step hardware co-optimization approach to improve the model accuracy using the SRBS pattern. Based on the pruned model, we implement a mixed precision quantization to further reduce the computational complexity of bit operations. Finally, we design an FPGA accelerator that takes both the advantage of the SRBS pattern to eliminate low-parallelism computation and the irregular memory accesses, as well as the mixed precision quantization to get a layer-wise bit width. Experimental results on VGG16/VGG-nagadomi with CIFAR-10 and ResNet-18/34/50 with ImageNet show up to 11.8×/8.67× and 8.17×/8.31×/10.6× speedup, 12.74×/9.19× and 8.75×/8.81×/11.1× energy efficiency improvement, respectively, compared with the state-of-the-art dense Winograd accelerator [20] with negligible loss of model accuracy. We also show that our design has 4.11× speedup compared with the state-of-the-art sparse Winograd accelerator [19] on VGG16.

中文翻译：

BISWSRBS：基于 Winograd 的 CNN 加速器，具有细粒度的规则稀疏模式和混合精度量化

现场可编程门阵列 (FPGA) 是用于卷积神经网络 (CNN) 推理的高性能计算平台。Winograd 算法、权重剪枝和量化被广泛采用以减少 CNN 在 FPGA 上的存储和算术开销。然而，最近的研究试图修剪 Winograd 域中的权重，这会导致不规则的稀疏模式，并导致低并行度和资源利用率降低。此外，很少有工作讨论适合 Winograd 的量化方案。在本文中，我们在基于 Winograd 的 CNN 中提出了一种规则稀疏剪枝模式，即 Sub-row-balanced Sparsity (SRBS) 模式，以克服不规则稀疏模式的挑战。然后，我们开发了一种两步硬件协同优化方法，以使用 SRBS 模式提高模型精度。在剪枝模型的基础上，我们实现了混合精度量化，以进一步降低位运算的计算复杂度。最后，我们设计了一个 FPGA 加速器，它既利用 SRBS 模式来消除低并行计算和不规则内存访问，又利用混合精度量化来获得逐层位宽。VGG16/VGG-nagadomi 与 CIFAR-10 和 ResNet-18/34/50 与 ImageNet 的实验结果显示高达 11.8×/8.67× 和 8.17×/8.31×/10.6× 加速，12.74×/9.19× 和 8.75×/与最先进的密集 Winograd 加速器 [20] 相比，能效分别提高了 8.81 倍/11.1 倍，模型精度损失可忽略不计。我们还表明，与 VGG16 上最先进的稀疏 Winograd 加速器 [19] 相比，我们的设计具有 4.11 倍的加速。我们实现了混合精度量化，以进一步降低位运算的计算复杂度。最后，我们设计了一个 FPGA 加速器，它既利用 SRBS 模式来消除低并行计算和不规则内存访问，又利用混合精度量化来获得逐层位宽。VGG16/VGG-nagadomi 与 CIFAR-10 和 ResNet-18/34/50 与 ImageNet 的实验结果显示高达 11.8×/8.67× 和 8.17×/8.31×/10.6× 加速，12.74×/9.19× 和 8.75×/与最先进的密集 Winograd 加速器 [20] 相比，能效分别提高了 8.81 倍/11.1 倍，模型精度损失可忽略不计。我们还表明，与 VGG16 上最先进的稀疏 Winograd 加速器 [19] 相比，我们的设计具有 4.11 倍的加速。我们实现了混合精度量化，以进一步降低位运算的计算复杂度。最后，我们设计了一个 FPGA 加速器，它既利用 SRBS 模式来消除低并行计算和不规则内存访问，又利用混合精度量化来获得逐层位宽。VGG16/VGG-nagadomi 与 CIFAR-10 和 ResNet-18/34/50 与 ImageNet 的实验结果显示高达 11.8×/8.67× 和 8.17×/8.31×/10.6× 加速，12.74×/9.19× 和 8.75×/与最先进的密集 Winograd 加速器 [20] 相比，能效分别提高了 8.81 倍/11.1 倍，模型精度损失可忽略不计。我们还表明，与 VGG16 上最先进的稀疏 Winograd 加速器 [19] 相比，我们的设计具有 4.11 倍的加速。

更新日期：2021-09-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>