Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference
arXiv - CS - Hardware Architecture Pub Date : 2020-05-16 , DOI: arxiv-2005.08098
Zhi-Gang Liu, Paul N. Whatmough, Matthew Mattina

Convolutional neural network (CNN) inference on mobile devices demands efficient hardware acceleration of low-precision (INT8) general matrix multiplication (GEMM). The systolic array (SA) is a pipelined 2D array of processing elements (PEs), with very efficient local data movement, well suited to accelerating GEMM, and widely deployed in industry. In this work, we describe two significant improvements to the traditional SA architecture, to specifically optimize for CNN inference. Firstly, we generalize the traditional scalar PE, into a Tensor-PE, which gives rise to a family of new Systolic Tensor Array (STA) microarchitectures. The STA family increases intra-PE operand reuse and datapath efficiency, resulting in circuit area and power dissipation reduction of as much as 2.08x and 1.36x respectively, compared to the conventional SA at iso-throughput with INT8 operands. Secondly, we extend this design to support a novel block-sparse data format called density-bound block (DBB). This variant (STA-DBB) achieves a 3.14x and 1.97x improvement over the SA baseline at iso-throughput in area and power respectively, when processing specially-trained DBB-sparse models, while remaining fully backwards compatible with dense models.

中文翻译：

收缩张量阵列：用于移动 CNN 推理的高效结构化稀疏 GEMM 加速器

移动设备上的卷积神经网络 (CNN) 推理需要低精度 (INT8) 通用矩阵乘法 (GEMM) 的高效硬件加速。脉动阵列 (SA) 是处理元件 (PE) 的流水线二维阵列，具有非常高效的本地数据移动，非常适合加速 GEMM，并在工业中广泛部署。在这项工作中，我们描述了对传统 SA 架构的两项重大改进，以专门针对 CNN 推理进行优化。首先，我们将传统的标量 PE 推广到 Tensor-PE，从而产生了一系列新的收缩张量阵列 (STA) 微架构。STA 系列提高了 PE 内操作数重用和数据路径效率，从而使电路面积和功耗分别减少了 2.08 倍和 1.36 倍，与具有 INT8 操作数的 iso 吞吐量的传统 SA 相比。其次，我们扩展了这种设计以支持一种称为密度绑定块 (DBB) 的新型块稀疏数据格式。在处理经过特殊训练的 DBB 稀疏模型时，该变体 (STA-DBB) 在面积和功率等吞吐量方面分别比 SA 基线提高了 3.14 倍和 1.97 倍，同时保持与密集模型完全向后兼容。

更新日期：2020-05-19

点击分享查看原文

点击收藏

阅读更多本刊最新论文