FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4Bit-Compact Multilayer Perceptrons,IEEE Open Journal of Circuits and Systems

当前位置： X-MOL 学术 › IEEE Open J. Circuits Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

FantastIC4: A Hardware-Software Co-Design Approach for Efficiently Running 4Bit-Compact Multilayer Perceptrons
IEEE Open Journal of Circuits and Systems ( IF 2.4 ) Pub Date : 2021-05-25 , DOI: 10.1109/ojcas.2021.3083332
Simon Wiedemann ₁ , Suhas Shivapakash ₂ , Daniel Becking ₁ , Pablo Wiedemann ₁ , Wojciech Samek ₁ , Friedel Gerfers ₂ , Thomas Wiegand ₁

Affiliation

With the growing demand for deploying Deep Learning models to the “edge”, it is paramount to develop techniques that allow to execute state-of-the-art models within very tight and limited resource constraints. In this work we propose a software-hardware optimization paradigm for obtaining a highly efficient execution engine of deep neural networks (DNNs) that are based on fully-connected layers. The work’s approach is centred around compression as a means for reducing the area as well as power requirements of, concretely, multilayer perceptrons (MLPs) with high predictive performances. Firstly, we design a novel hardware architecture named FantastIC4, which (1) supports the efficient on-chip execution of multiple compact representations of fully-connected layers and (2) minimizes the required number of multipliers for inference down to only 4 (thus the name). Moreover, in order to make the models amenable for efficient execution on FantastIC4, we introduce a novel entropy-constrained training method that renders them to be robust to 4bit quantization and highly compressible in size simultaneously. The experimental results show that we can achieve throughputs of 2.45 TOPS with a total power consumption of 3.6W on a Virtual Ultrascale FPGA XCVU440 device implementation, and achieve a total power efficiency of 20.17 TOPS/W on a 22nm process ASIC version. When compared to other state-of-the-art accelerators designed for the Google Speech Command (GSC) dataset, FantastIC4 is better by 51×51\times in terms of throughput and 145×145\times in terms of area efficiency (GOPS/mm2).

中文翻译：

FantastIC4：一种用于高效运行 4 位紧凑型多层感知器的硬件-软件协同设计方法

随着将深度学习模型部署到“边缘”的需求不断增长，开发能够在非常严格和有限的资源限制内执行最先进模型的技术至关重要。在这项工作中，我们提出了一种软件硬件优化范例，用于获得基于全连接层的深度神经网络（DNN）的高效执行引擎。这项工作的方法以压缩为中心，作为减少具有高预测性能的多层感知器（MLP）的面积和功率要求的手段。首先，我们设计了一种名为 FantastIC4 的新型硬件架构，它 (1) 支持全连接层的多个紧凑表示的高效片上执行，(2) 将推理所需的乘法器数量最小化至仅 4 个（因此姓名）。此外，为了使模型能够在 FantastIC4 上高效执行，我们引入了一种新颖的熵约束训练方法，使它们对 4 位量化具有鲁棒性，同时在大小上具有高度可压缩性。实验结果表明，我们可以在 Virtual Ultrascale FPGA XCVU440 器件实现上以 3.6W 的总功耗实现 2.45 TOPS 的吞吐量，并在 22nm 工艺 ASIC 版本上实现 20.17 TOPS/W 的总功率效率。与其他为 Google Speech Command (GSC) 数据集设计的最先进的加速器相比，FantastIC4 在吞吐量方面优于 51×51\ 倍，在面积效率方面优于 145×145\ 倍（GOPS/毫米2）。

更新日期：2021-05-25

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文