当前位置: X-MOL 学术IEEE Trans. Circuits Syst. I Regul. Pap. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Logic-Compatible eDRAM Compute-In-Memory With Embedded ADCs for Processing Neural Networks
IEEE Transactions on Circuits and Systems I: Regular Papers ( IF 5.2 ) Pub Date : 2021-02-01 , DOI: 10.1109/tcsi.2020.3036209
Chengshuo Yu , Taegeun Yoo , Hyunjoon Kim , Tony Tae-Hyoung Kim , Kevin Chai Tshun Chuan , Bongjin Kim

A novel 4T2C ternary embedded DRAM (eDRAM) cell is proposed for computing a vector-matrix multiplication in the memory array. The proposed eDRAM-based compute-in-memory (CIM) architecture addresses a well-known Von Neumann bottle-neck in the traditional computer architecture and improves both latency and energy in processing neural networks. The proposed ternary eDRAM cell takes a smaller area than prior SRAM-based bitcells using 6–12 transistors. Nevertheless, the compact eDRAM cell stores a ternary state (−1, 0, or +1), while the SRAM bitcells can only store a binary state. We also present a method to mitigate the compute accuracy degradation issue due to device mismatches and variations. Besides, we extend the eDRAM cell retention time to $200~\mu \text{s}$ by adding a custom metal capacitor at the storage node. With the improved retention time, the overall energy consumption of eDRAM macro, including a regular refresh operation, is lower than most of prior SRAM-based CIM macros. A $128\times 128$ ternary eDRAM macro computes a vector-matrix multiplication between a vector with 64 binary inputs and a matrix with $64\times 128$ ternary weights. Hence, 128 outputs are generated in parallel. Note that both weight and input bit-precisions are programmable for supporting a wide range of edge computing applications with different performance requirements. The bit-precisions are readily tunable by assigning a variable number of eDRAM cells per weight or adding multiple pulses to input. An embedded column ADC based on replica cells sweeps the reference level for $2^{\mathrm {N}}-1$ cycles and converts the analog accumulated bitline voltage to a 1-5bit digital output. A critical bitline accumulate operation is simulated (Monte-Carlo, 3K runs). It shows the standard deviation of 2.84% that could degrade the classification accuracy of the MNIST dataset by 0.6% and the CIFAR-10 dataset by 1.3% versus a baseline with no variation. The simulated energy is 1.81fJ/operation, and the energy efficiency is 552.5-17.8TOPS/W (for 1-5bit ADC) at 200MHz using 65nm technology.

中文翻译:

具有用于处理神经网络的嵌入式 ADC 的逻辑兼容 eDRAM 内存计算

提出了一种新颖的 4T2C 三元嵌入式 DRAM (eDRAM) 单元,用于计算存储器阵列中的向量矩阵乘法。所提出的基于 eDRAM 的内存计算 (CIM) 架构解决了传统计算机架构中众所周知的冯诺依曼瓶颈,并改善了处理神经网络的延迟和能量。与之前使用 6-12 个晶体管的基于 SRAM 的位单元相比,建议的三元 eDRAM 单元占用的面积更小。然而,紧凑型 eDRAM 单元存储三元状态(-1、0 或 +1),而 SRAM 位单元只能存储二元状态。我们还提出了一种方法来缓解由于设备不匹配和变化导致的计算精度下降问题。此外,我们将 eDRAM 单元保留时间延长至 $200~\mu \text{s}$ 通过在存储节点添加自定义金属电容器。随着保留时间的改善,eDRAM 宏的整体能耗(包括定期刷新操作)低于大多数先前基于 SRAM 的 CIM 宏。一种 128 美元\乘以 128 美元 三元 eDRAM 宏计算具有 64 个二进制输入的向量和具有 $64\乘以 128$ 三元权重。因此,并行生成 128 个输出。请注意,权重和输入位精度都是可编程的,以支持具有不同性能要求的各种边缘计算应用程序。通过为每个权重分配可变数量的 eDRAM 单元或向输入添加多个脉冲,可以轻松调整位精度。基于复制单元的嵌入式列 ADC 扫描参考电平 $2^{\mathrm {N}}-1$ 循环并将模拟累积位线电压转换为 1-5 位数字输出。模拟关键位线累加操作(Monte-Carlo,3K 运行)。它显示了 2.84% 的标准偏差,与没有变化的基线相比,这会使 MNIST 数据集的分类准确度降低 0.6%,CIFAR-10 数据集的分类准确度降低 1.3%。模拟能量为1.81fJ/操作,使用65nm技术在200MHz时能效为552.5-17.8TOPS/W(对于1-5bit ADC)。
更新日期:2021-02-01
down
wechat
bug