Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks
arXiv - CS - Hardware Architecture Pub Date : 2020-11-25 , DOI: arxiv-2011.12839
Nick Iliev, Amit Ranjan Trivedi

We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 128 High Bandwidth Memory (HBM) units for storing the pretrained weights. Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 48.4 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer accelerator quoted at 28.8 GOPS with a 150 MHz clock. We have achieved this considerable improvement by fully utilizing the HBM units for storing and reading out column-specific FClayer weights in 1 cycle with a novel colum-row-column schedule, and implementing a maximally parallel datapath for processing these weights with the corresponding MAC and PE units. When up-scaled to 128 16x16 PEs, for 16x16 tiles of weights, the design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution which uses compression.

中文翻译：

深度神经网络中用于全连接层的低延迟CMOS硬件加速

我们提出了一种新颖的低延迟CMOS硬件加速器，用于深度神经网络（DNN）中的全连接（FC）层。FC加速器FC-ACCL基于用于矩阵矢量乘法的128个8x8或16x16处理元件（PE），以及与128个高带宽内存（HBM）单元集成在一起的128个乘累加（MAC）单元，用于存储预训练的权重。给出了CMOS ASIC实现的微体系结构详细信息，并将模拟性能与AlexNet和VGG 16的DNN的最新硬件加速器进行了比较。当比较4096-1000 FC8层的模拟处理延迟时，我们的FC-ACCL能够达到48.4 GOPS （具有100 MHz时钟），在最近的FC8层加速器（150 MHz时钟的报价为28.8 GOPS）上有所改进。通过充分利用HBM单元以一种新的colum-row-column计划在1个周期中存储和读取特定于列的FClayer权重，并实现最大并行数据路径以使用相应的MAC和PE单位。与使用压缩的替代EIE解决方案相比，当将权重提升至128个16x16 PE（对于16x16瓦片）时，该设计可以将AlexNet中大型FC6层的延迟减少60％，将VGG16中的延迟减少3％。

更新日期：2020-11-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文