当前位置: X-MOL 学术IEEE Trans. Very Larg. Scale Integr. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
SWM: A High-Performance Sparse-Winograd Matrix Multiplication CNN Accelerator
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( IF 2.8 ) Pub Date : 2021-03-09 , DOI: 10.1109/tvlsi.2021.3060041
Di Wu , Xitian Fan , Wei Cao , Lingli Wang

Many convolutional neural network (CNN) accelerators are proposed to exploit the sparsity of the networks recently to enjoy the benefits of both computation and memory reduction. However, most accelerators cannot exploit the sparsity of both activations and weights. For those works that exploit both sparsity opportunities, they cannot achieve the stable load balance through a static scheduling (SS) strategy, which is vulnerable to the sparsity distribution. In this work, a balanced compressed sparse row format and a dynamic scheduling strategy are proposed to improve the load balance. A set-associate structure is also presented to tradeoff the load balance and hardware resource overhead. We propose SWM to accelerate the CNN inference, which supports both sparse convolution and sparse fully connected (FC) layers. SWM provides Winograd adaptability for large convolution kernels and supports both 16-bit and 8-bit quantized CNNs. Due to the activation sharing, 8-bit processing can achieve theoretically twice the performance of the 16-bit processing with the same sparsity. The architecture is evaluated with VGG16 and ResNet50, which achieves: at most 7.6 TOP/s for sparse-Winograd convolution and three TOP/s for sparse matrix multiplication with 16-bit quantization on Xilinx VCU1525 platform. SWM can process 310/725 images per second for VGG16/ResNet50 with 16-bit quantization. Compared with the state-of-the-art works, our design can achieve at least $1.53 \boldsymbol {\times }$ speedup and $1.8 \boldsymbol {\times }$ energy efficiency improvement.

中文翻译:

SWM:高性能稀疏-Winograd矩阵乘法CNN加速器

最近提出了许多卷积神经网络(CNN)加速器来利用网络的稀疏性,以享受计算和内存减少的好处。但是,大多数加速器无法利用激活和权重的稀疏性。对于那些同时利用这两种稀疏机会的作品,他们无法通过静态计划(SS)策略来实现稳定的负载平衡,该策略容易受到稀疏性分布的影响。在这项工作中,提出了一种平衡的压缩稀疏行格式和一种动态调度策略,以改善负载平衡。还提出了一种集合关联结构来权衡负载平衡和硬件资源开销。我们建议使用SWM来加速CNN推理,它支持稀疏卷积和稀疏全连接(FC)层。SWM为大型卷积内核提供Winograd适应性,并支持16位和8位量化CNN。由于激活共享,因此在相同的稀疏性下,8位处理理论上可以实现16位处理性能的两倍。该架构使用VGG16和ResNet50进行了评估,在Xilinx VCU1525平台上,通过稀疏-Winograd卷积可实现:稀疏-Winograd卷积最高为7.6 TOP / s,稀疏矩阵乘法最高为3 TOP / s。对于16位量化的VGG16 / ResNet50,SWM可以每秒处理310/725图像。与最先进的作品相比,我们的设计至少可以实现 理论上,在具有相同稀疏性的情况下,8位处理可以达到16位处理性能的两倍。该架构使用VGG16和ResNet50进行了评估,在Xilinx VCU1525平台上,通过稀疏的Winograd卷积,其稀疏矩阵乘法最多可达到7.6 TOP / s,对于稀疏矩阵乘法最多可达到3 TOP / s。对于16位量化的VGG16 / ResNet50,SWM可以每秒处理310/725图像。与最先进的作品相比,我们的设计至少可以实现 理论上,在具有相同稀疏性的情况下,8位处理可以达到16位处理性能的两倍。该架构使用VGG16和ResNet50进行了评估,在Xilinx VCU1525平台上,通过稀疏的Winograd卷积,其稀疏矩阵乘法最多可达到7.6 TOP / s,对于稀疏矩阵乘法最多可达到3 TOP / s。对于16位量化的VGG16 / ResNet50,SWM可以每秒处理310/725图像。与最先进的作品相比,我们的设计至少可以实现 $ 1.53 \ boldsymbol {\ times} $ 加速和 $ 1.8 \ boldsymbol {\ times} $ 能源效率的提高。
更新日期:2021-04-30
down
wechat
bug