当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
XpulpNN: Enabling Energy Efficient and Flexible Inference of Quantized Neural Network on RISC-V based IoT End Nodes
arXiv - CS - Hardware Architecture Pub Date : 2020-11-29 , DOI: arxiv-2011.14325
Angelo Garofalo, Giuseppe Tagliavini, Francesco Conti, Luca Benini, Davide Rossi

This work introduces lightweight extensions to the RISC-V ISA to boost the efficiency of heavily Quantized Neural Network (QNN) inference on microcontroller-class cores. By extending the ISA with nibble (4-bit) and crumb (2-bit) SIMD instructions, we are able to show near-linear speedup with respect to higher precision integer computation on the key kernels for QNN computation. Also, we propose a custom execution paradigm for SIMD sum-of-dot-product operations, which consists of fusing a dot product with a load operation, with an up to 1.64x peak MAC/cycle improvement compared to a standard execution scenario. To further push the efficiency, we integrate the RISC-V extended core in a parallel cluster of 8 processors, with near-linear improvement with respect to a single core architecture. To evaluate the proposed extensions, we fully implement the cluster of processors in GF22FDX technology. QNN convolution kernels on a parallel cluster implementing the proposed extension run 6 x and 8 x faster when considering 4- and 2-bit data operands, respectively, compared to a baseline processing cluster only supporting 8-bit SIMD instructions. With a peak of 2.22 TOPs/s/W, the proposed solution achieves efficiency levels comparable with dedicated DNN inference accelerators, and up to three orders of magnitude better than state-of-the-art ARM Cortex-M based microcontroller systems such as the low-end STM32L4 MCU and the high-end STM32H7 MCU.

中文翻译:

XpulpNN:在基于RISC-V的IoT终端节点上实现量化神经网络的节能高效和灵活推理

这项工作为RISC-V ISA引入了轻量级扩展,以提高对微控制器类内核的高量化神经网络(QNN)推理的效率。通过使用半字节(4位)和面包屑(2位)SIMD指令扩展ISA,相对于用于QNN计算的关键内核上的高精度整数计算,我们能够显示出近乎线性的加速。此外,我们提出了一种用于SIMD点积和运算的自定义执行范例,该范例包括将点积与加载运算融合在一起,与标准执行方案相比,其最大MAC /周期峰值提高了1.64倍。为了进一步提高效率,我们将RISC-V扩展内核集成在8个处理器的并行集群中,相对于单个内核体系结构,其线性改进程度非常高。要评估建议的扩展,我们以GF22FDX技术完全实现了处理器集群。与仅支持8位SIMD指令的基线处理集群相比,在实现建议扩展的并行集群上的QNN卷积内核分别运行4倍和2倍数据时,运行速度分别快6倍和8倍。所提出的解决方案具有2.22 TOPs / s / W的峰值,可实现与专用DNN推理加速器相媲美的效率水平,并且比基于最新ARM Cortex-M的最新微控制器系统(例如MCU)高出三个数量级。低端STM32L4 MCU和高端STM32H7 MCU。与仅支持8位SIMD指令的基准处理集群相比。所提出的解决方案具有2.22 TOPs / s / W的峰值,可实现与专用DNN推理加速器相媲美的效率水平,并且比基于最新ARM Cortex-M的最新微控制器系统(例如MCU)高出三个数量级。低端STM32L4 MCU和高端STM32H7 MCU。与仅支持8位SIMD指令的基准处理集群相比。所提出的解决方案具有2.22 TOPs / s / W的峰值,可实现与专用DNN推理加速器相媲美的效率水平,并且比基于最新ARM Cortex-M的最新微控制器系统(例如MCU)高出三个数量级。低端STM32L4 MCU和高端STM32H7 MCU。
更新日期:2020-12-01
down
wechat
bug