TMA: Tera‐MACs/W neural hardware inference accelerator with a multiplier‐less massive parallel processor,International Journal of Circuit Theory and Applications

当前位置： X-MOL 学术 › Int. J. Circ. Theory Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

TMA: Tera‐MACs/W neural hardware inference accelerator with a multiplier‐less massive parallel processor
International Journal of Circuit Theory and Applications ( IF 1.8 ) Pub Date : 2021-01-07 , DOI: 10.1002/cta.2917
Hyunbin Park ₁ , Dohyun Kim ₂ , Shiho Kim ₂

Affiliation

Computationally intensive inference tasks of deep neural networks have brought about a revolution in accelerator architecture, aiming to reduce power consumption as well as latency. The key figure‐of‐merit in hardware inference accelerators is the number of multiply‐and‐accumulation operations per watt (MACs/W); the state‐of‐ the‐art MACs/W, so far, has been several hundreds Giga‐MACs/W. We propose a Tera‐ MACS/W neural hardware inference accelerator (TMA) with 8‐bit activations and scalable integer weights less than 1‐byte. The architecture's main feature is a configurable neural processing element for matrix‐vector operations. The proposed neural processing element uses a multiplier‐less massive parallel processor that works without multipliers, which makes it attractive for energy efficient high‐performance neural network applications. We benchmark our system's latency, power, and performance using Alexnet trained on ImageNet. Finally, we compared our accelerator's throughput and power consumption to that of the prior works. The proposed accelerator outperforms the state‐of‐the‐art counterparts, in terms of the energy and area efficiency, achieving 2.3 TMACs/W@1.0 V on a 28‐nm Virtex‐7 FPGA chip.

中文翻译：

TMA：具有无乘法器的大规模并行处理器的Tera-MACs / W神经硬件推理加速器

深度神经网络的计算密集型推理任务带来了加速器体系结构的一场革命，旨在减少功耗和延迟。硬件推理加速器的关键性能指标是每瓦的乘法和累加操作数（MACs / W）；到目前为止，最先进的MAC / W已经达到数百Giga-MAC / W。我们提出了一种TeraMACS / W神经硬件推理加速器（TMA），具有8位激活功能和小于1字节的可伸缩整数权重。该体系结构的主要特征是用于矩阵向量运算的可配置神经处理元素。拟议的神经处理元件使用无需乘法器的无乘法器大规模并行处理器，这使其对于高能效的高性能神经网络应用具有吸引力。我们使用在ImageNet上训练的Alexnet对系统的延迟，功耗和性能进行基准测试。最后，我们将加速器的吞吐量和功耗与先前的工作进行了比较。拟议的加速器在能量和面积效率方面都优于最新的加速器，在28纳米Virtex-7 FPGA芯片上达到2.3 TMACs/W@1.0V。

更新日期：2021-01-07

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11