A Communication-Aware DNN Accelerator on ImageNet Using in-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65nm CMOS,IEEE Journal on Emerging and Selected Topics in Circuits and Systems

当前位置： X-MOL 学术 › IEEE J. Emerg. Sel. Top. Circuits Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Communication-Aware DNN Accelerator on ImageNet Using in-Memory Entry-Counting Based Algorithm-Circuit-Architecture Co-Design in 65nm CMOS
IEEE Journal on Emerging and Selected Topics in Circuits and Systems ( IF 3.7 ) Pub Date : 2020-09-01 , DOI: 10.1109/jetcas.2020.3014920
Haozhe Zhu , Chixiao Chen , Shiwei Liu , Qiaosha Zou , Mingyu Wang , Lihua Zhang , Xiaoyang Zeng , C.-J. Richard Shi

This article presents a communication-aware processing-in-memory deep neural network accelerator, which implements an in-memory entry-counting scheme for low bit-width quantized multiplication-and-accumulations (MACs). To maintain good accuracy on ImageNet, the proposed design adopts a full-stack co-design methodology, from algorithms, circuits to architectures. In the algorithm level, an entry-counting based MAC is proposed to fit the learned step-sized quantization scheme, and exploit the sparsity of both activations and weights intrinsically. In the circuit level, content addressable memory cells and multiplexed arrays are developed in the processing-in-memory macro. In the architecture level, the proposed design is compatible with different stationary dataflow mappings, further reducing the memory access. An in-memory entry-counting silicon prototype and its entire peripheral circuits are fabricated in 65nm LP CMOS technology with an active area of $0.76\times 0.66$ mm2. The 7.36-Kb processing-in-memory macro with 128 search entries can reduce the multiplication number by $12.8\times $ . The peak throughput is 3.58 GOPS, achieved at a clock rate of 143MHz and a power supply of 1.23V. The peak energy efficiency of the processing-in-memory macro is 11.6 TOPS/W, achieved at a clock rate of 40MHz and a power supply of 1.01V. Note that the physical design of the entry-counting memory is completed in a standard digital placement and routing flow by augmenting the library with two dedicated memory cells. A 3-bit quantized ResNet-18 on the ImageNet dataset is performed, where the top-1 accuracy is 64.4%.

中文翻译：

ImageNet 上的通信感知 DNN 加速器，在 65nm CMOS 中使用基于内存条目计数的算法-电路-架构协同设计

本文介绍了一种通信感知内存处理深度神经网络加速器，它为低位宽量化乘法累加 (MAC) 实现了内存条目计数方案。为了在 ImageNet 上保持良好的准确性，所提出的设计采用了从算法、电路到架构的全栈协同设计方法。在算法层面，提出了一种基于条目计数的 MAC 来适应学习的步长量化方案，并从本质上利用激活和权重的稀疏性。在电路层面，内容可寻址存储单元和多路复用阵列是在内存处理宏中开发的。在架构层面，所提出的设计兼容不同的静态数据流映射，进一步减少了内存访问。内存入口计数硅原型及其整个外围电路采用 65 纳米 LP CMOS 技术制造，有效面积为 0.76 美元\x 0.66 美元 mm2。具有 128 个搜索条目的 7.36-Kb 内存处理宏可以将乘法数减少 $12.8\times $。峰值吞吐量为 3.58 GOPS，时钟频率为 143MHz，电源电压为 1.23V。内存处理宏的峰值能效为 11.6 TOPS/W，在 40MHz 时钟频率和 1.01V 电源下实现。请注意，条目计数存储器的物理设计是在标准数字布局和布线流程中通过用两个专用存储器单元扩充库来完成的。在 ImageNet 数据集上执行 3 位量化 ResNet-18，其中 top-1 准确率为 64.4%。

更新日期：2020-09-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11