当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks
arXiv - CS - Hardware Architecture Pub Date : 2020-11-25 , DOI: arxiv-2011.12839
Nick Iliev, Amit Ranjan Trivedi

We present a novel low latency CMOS hardware accelerator for fully connected (FC) layers in deep neural networks (DNNs). The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements (PEs) for matrix-vector multiplication, and 128 multiply-accumulate (MAC) units integrated with 128 High Bandwidth Memory (HBM) units for storing the pretrained weights. Micro-architectural details for CMOS ASIC implementations are presented and simulated performance is compared to recent hardware accelerators for DNNs for AlexNet and VGG 16. When comparing simulated processing latency for a 4096-1000 FC8 layer, our FC-ACCL is able to achieve 48.4 GOPS (with a 100 MHz clock) which improves on a recent FC8 layer accelerator quoted at 28.8 GOPS with a 150 MHz clock. We have achieved this considerable improvement by fully utilizing the HBM units for storing and reading out column-specific FClayer weights in 1 cycle with a novel colum-row-column schedule, and implementing a maximally parallel datapath for processing these weights with the corresponding MAC and PE units. When up-scaled to 128 16x16 PEs, for 16x16 tiles of weights, the design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution which uses compression.

中文翻译:

深度神经网络中用于全连接层的低延迟CMOS硬件加速

我们提出了一种新颖的低延迟CMOS硬件加速器,用于深度神经网络(DNN)中的全连接(FC)层。FC加速器FC-ACCL基于用于矩阵矢量乘法的128个8x8或16x16处理元件(PE),以及与128个高带宽内存(HBM)单元集成在一起的128个乘累加(MAC)单元,用于存储预训练的权重。给出了CMOS ASIC实现的微体系结构详细信息,并将模拟性能与AlexNet和VGG 16的DNN的最新硬件加速器进行了比较。当比较4096-1000 FC8层的模拟处理延迟时,我们的FC-ACCL能够达到48.4 GOPS (具有100 MHz时钟),在最近的FC8层加速器(150 MHz时钟的报价为28.8 GOPS)上有所改进。通过充分利用HBM单元以一种新的colum-row-column计划在1个周期中存储和读取特定于列的FClayer权重,并实现最大并行数据路径以使用相应的MAC和PE单位。与使用压缩的替代EIE解决方案相比,当将权重提升至128个16x16 PE(对于16x16瓦片)时,该设计可以将AlexNet中大型FC6层的延迟减少60%,将VGG16中的延迟减少3%。
更新日期:2020-11-27
down
wechat
bug