XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes,IEEE Transactions on Emerging Topics in Computing

当前位置： X-MOL 学术 › IEEE Trans. Emerg. Top. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Networks on RISC-V Based IoT End Nodes
IEEE Transactions on Emerging Topics in Computing ( IF 5.1 ) Pub Date : 2021-04-16 , DOI: 10.1109/tetc.2021.3072337
Angelo Garofalo ₁ , Giuseppe Tagliavini ₂ , Francesco Conti ₃ , Luca Benini ₄ , Davide Rossi ₅

Affiliation

Heavily quantized fixed-point arithmetic is becoming a common approach to deploy Convolutional Neural Networks (CNNs) on limited-memory low-power IoT end-nodes. However, this trend is narrowed by the lack of support for low-bitwidth in the arithmetic units of state-of-the-art embedded Microcontrollers (MCUs). This work proposes a multi-precision arithmetic unit fully integrated into a RISC-V processor at the micro-architectural and ISA level to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64 × peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 × and 8 × faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU.

中文翻译：

XpulpNN：在基于 RISC-V 的物联网终端节点上实现量化神经网络的节能且灵活的推理

大量量化的定点算法正在成为在有限内存低功耗物联网终端节点上部署卷积神经网络 (CNN) 的常用方法。然而，由于最先进的嵌入式微控制器 (MCU) 的算术单元缺乏对低位宽的支持，这一趋势正在缩小。这项工作提出了一种在微架构和 ISA 级别完全集成到 RISC-V 处理器中的多精度算术单元，以提高微控制器级内核上的大量量化神经网络 (QNN) 推理的效率。通过使用半字节（4 位）和碎片（2 位）SIMD 指令扩展 ISA，我们在 QNN 计算的关键内核上实现了更高精度整数计算的近线性加速。此外，我们还提出了一种用于 SIMD 点积和运算的自定义执行范例，其中包括将点积与加载运算融合，与标准执行场景相比，峰值 MAC/周期改进高达 1.64 倍。为了进一步提高效率，我们将 RISC-V 扩展核心集成在 8 个处理器的并行集群中，相对于单核心架构实现了近线性的改进。为了评估所提出的扩展，我们在 GF22FDX 技术中完全实现了处理器集群。与仅支持 8 位 SIMD 指令的基线处理集群相比，在考虑 4 位和 2 位数据操作数时，实现所提出的扩展的并行集群上的 QNN 卷积核的运行速度分别快 6 倍和 8 倍。峰值为2。22 TOPs/s/W，所提出的解决方案实现了与专用 DNN 推理加速器相当的效率水平，并且比最先进的基于 ARM Cortex-M 的微控制器系统（例如低端 STM32L4 MCU）高出三个数量级以及高端STM32H7 MCU。

更新日期：2021-04-16

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11