XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes
arXiv - CS - Hardware Architecture Pub Date : 2020-11-29 , DOI: arxiv-2011.14325
Angelo Garofalo, Giuseppe Tagliavini, Francesco Conti, Luca Benini, Davide Rossi

This work introduces lightweight extensions to the RISC-V ISA to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we are able to show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64x peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 x and 8 x faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators, and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU.

中文翻译：

XpulpNN：在基于RISC-V的IoT终端节点上实现量化神经网络的节能高效和灵活推理

这项工作为RISC-V ISA引入了轻量级扩展，以提高对微控制器类内核的高量化神经网络（QNN）推理的效率。通过使用半字节（4位）和面包屑（2位）SIMD指令扩展ISA，相对于用于QNN计算的关键内核上的高精度整数计算，我们能够显示出近乎线性的加速。此外，我们提出了一种用于SIMD点积和运算的自定义执行范例，该范例包括将点积与加载运算融合在一起，与标准执行方案相比，其最大MAC /周期峰值提高了1.64倍。为了进一步提高效率，我们将RISC-V扩展内核集成在8个处理器的并行集群中，相对于单个内核体系结构，其线性改进程度非常高。要评估建议的扩展，我们以GF22FDX技术完全实现了处理器集群。与仅支持8位SIMD指令的基线处理集群相比，在实现建议扩展的并行集群上的QNN卷积内核分别运行4倍和2倍数据时，运行速度分别快6倍和8倍。所提出的解决方案具有2.22 TOPs / s / W的峰值，可实现与专用DNN推理加速器相媲美的效率水平，并且比基于最新ARM Cortex-M的最新微控制器系统（例如MCU）高出三个数量级。低端STM32L4 MCU和高端STM32H7 MCU。与仅支持8位SIMD指令的基准处理集群相比。所提出的解决方案具有2.22 TOPs / s / W的峰值，可实现与专用DNN推理加速器相媲美的效率水平，并且比基于最新ARM Cortex-M的最新微控制器系统（例如MCU）高出三个数量级。低端STM32L4 MCU和高端STM32H7 MCU。与仅支持8位SIMD指令的基准处理集群相比。所提出的解决方案具有2.22 TOPs / s / W的峰值，可实现与专用DNN推理加速器相媲美的效率水平，并且比基于最新ARM Cortex-M的最新微控制器系统（例如MCU）高出三个数量级。低端STM32L4 MCU和高端STM32H7 MCU。

更新日期：2020-12-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文