Binary Precision Neural Network Manycore Accelerator,ACM Journal on Emerging Technologies in Computing Systems

当前位置： X-MOL 学术 › ACM J. Emerg. Technol. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Binary Precision Neural Network Manycore Accelerator
ACM Journal on Emerging Technologies in Computing Systems ( IF 2.1 ) Pub Date : 2021-04-05 , DOI: 10.1145/3423136
Morteza Hosseini ₁ , Tinoosh Mohsenin ₁

Affiliation

This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR , Patch-Select , and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm 2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm 2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.

中文翻译：

二进制精度神经网络众核加速器

本文介绍了一种低功耗、可编程、特定领域的众核加速器，二值化神经网络众核加速器 (BiNMAC)，它采用并高效执行二进制精度权重/激活神经网络模型。这样的网络具有紧凑的模型，其中权重被限制为仅 1 位，并且可以将多个打包到一个内存条目中，从而最大限度地减少内存占用。打包权重还有助于通过简单的电路执行单指令、多数据，从而最大限度地提高性能和效率。提议的 BiNMAC 具有支持特定领域指令的轻量级内核，以及基于路由器的内存访问架构，有助于在适当大小的二进制精度权重/激活神经网络中有效实现层。只有 3.73% 和 1。组合人口计数-XNOR,补丁选择，和基于位的累积被添加到 BiNMAC 的指令集架构中，每个指令集都将常用函数的执行周期替换为 1 个时钟周期，否则分别需要 54、4 和 3 个时钟周期。此外，每个内核都添加了定制逻辑，以在位级基础上转置 16×16 位内存块，从而加快了对中间数据的重新整形，以便为按位运算进行良好对齐。BiNMAC 的 64 集群架构完全采用 65 纳米 TSMC CMOS 技术进行布局和布线，其中单个集群占地 0.53 毫米2在 1 GHz 时钟频率和 1.1 V 下的平均功率为 232 mW。64 集群架构占用 36.5 mm2面积，如果充分利用，总功耗为 16.4 W，每秒可执行 1,360 Giga 操作 (GOPS)，同时提供完全可编程性。为了证明其可扩展性，在 BiNMAC 上实现了四个二值化案例研究，包括用于高性能图像分类的 ResNet-20 和 LeNet-5，以及用于低功耗生理应用的 ConvNet 和多层感知器。实现结果表明，仅人口计数指令可以将性能提高大约 5 倍。当其他新指令添加到具有现有人口计数指令的 RISC 机器时，性能平均提高 58%。为了比较 BiNMAC 与其他商用现成平台的性能，带有双精度浮点模型的案例研究也在 NVIDIA Jetson TX2 SoC (CPU+GPU) 上实现。结果表明，在 ∼2.1%--9.5% 的精度损失范围内，BiNMAC 在图像分类应用的能耗方面平均优于 TX2 GPU 约 1.9 倍（或按制造技术规模计算的 7.5 倍）。与 ARM Cortex-A57 CPU 实现相比，在低功耗设置和 ∼3.7%--5.5% 的精度损失范围内，BiNMAC 大约为 ∼9.7×--17.2×（或 38.8×--68.8× 制造技术规模）在满足申请截止日期的同时，对生理应用更节能。5 倍与制造技术规模）图像分类应用的能耗。与 ARM Cortex-A57 CPU 实现相比，在低功耗设置和 ∼3.7%--5.5% 的精度损失范围内，BiNMAC 大约为 ∼9.7×--17.2×（或 38.8×--68.8× 制造技术规模）在满足申请截止日期的同时，对生理应用更节能。5 倍与制造技术规模）图像分类应用的能耗。与 ARM Cortex-A57 CPU 实现相比，在低功耗设置和 ∼3.7%--5.5% 的精度损失范围内，BiNMAC 大约为 ∼9.7×--17.2×（或 38.8×--68.8× 制造技术规模）在满足申请截止日期的同时，对生理应用更节能。

更新日期：2021-04-05

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11