当前位置: X-MOL 学术IEEE J. Emerg. Sel. Top. Circuits Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Design of Processing-in-Memory With Triple Computational Path and Sparsity Handling for Energy-Efficient DNN Training
IEEE Journal on Emerging and Selected Topics in Circuits and Systems ( IF 4.6 ) Pub Date : 2022-04-20 , DOI: 10.1109/jetcas.2022.3168852
Wontak Han 1 , Jaehoon Heo 1 , Junsoo Kim 1 , Sukbin Lim 1 , Joo-Young Kim 1
Affiliation  

As machine learning (ML) and artificial intelligence (AI) have become mainstream technologies, many accelerators have been proposed to cope with their computation kernels. However, they access the external memory frequently due to the large size of deep neural network model, suffering from the von Neumann bottleneck. Moreover, as privacy issue is becoming more critical, on-device training is emerging as its solution. However, on-device training is challenging because it should perform the training under a limited power budget, which requires a lot more computations and memory accesses than the inference. In this paper, we present an energy-efficient processing-in-memory (PIM) architecture supporting end-to-end on-device training named T-PIM. Its macro design includes an 8T-SRAM cell-based PIM block to compute in-memory AND operation and three computational datapaths for end-to-end training. Each of three computational paths integrates arithmetic units for forward propagation, backward propagation, and gradient calculation and weight update, respectively, allowing the weight data stored in the memory stationary. T-PIM also supports variable bit precision to cover various ML scenarios. It can use fully variable input bit precision and 2-bit, 4-bit, 8-bit, and 16-bit weight bit precision for the forward propagation and the same input bit precision and 16-bit weight bit precision for the backward propagation. In addition, T-PIM implements sparsity handling schemes that skip the computation for input data and turn off the arithmetic units for weight data to reduce both unnecessary computations and leakage power. Finally, we fabricate the T-PIM chip on a 5.04mm2 die in a 28-nm CMOS logic process. It operates at 50–280MHz with the supply voltage of 0.75–1.05V, dissipating 5.25–51.23mW power in inference and 6.10–37.75mW in training. As a result, it achieves 17.90–161.08TOPS/W energy efficiency for the inference of 1-bit activation and 2-bit weight data, and 0.84–7.59TOPS/W for the training of 8-bit activation/error and 16-bit weight data. In conclusion, T-PIM is the first PIM chip that supports end-to-end training, demonstrating 2.02 times performance improvement over the latest PIM that partially supports training.

中文翻译:

用于高效 DNN 训练的具有三重计算路径和稀疏处理的内存处理设计

随着机器学习(ML)和人工智能(AI)成为主流技术,许多加速器被提出来应对它们的计算内核。然而,由于深度神经网络模型的规模很大,他们经常访问外部存储器,受到冯诺依曼瓶颈的困扰。此外,随着隐私问题变得越来越重要,设备上的培训正在成为其解决方案。然而,设备上的训练具有挑战性,因为它应该在有限的功率预算下执行训练,这需要比推理更多的计算和内存访问。在本文中,我们提出了一种节能的内存处理 (PIM) 架构,该架构支持名为 T-PIM 的端到端设备上训练。其宏设计包括一个基于 8T-SRAM 单元的 PIM 块,用于计算内存中的操作和三个用于端到端训练的计算数据路径。三个计算路径中的每一个都集成了分别用于前向传播、后向传播和梯度计算和权重更新的算术单元,从而使权重数据可以静止地存储在内存中。T-PIM 还支持可变位精度以覆盖各种 ML 场景。它可以使用完全可变的输入位精度和 2 位、4 位、8 位和 16 位权重位精度进行前向传播,并使用相同的输入位精度和 16 位权重位精度进行反向传播。此外,T-PIM 实现了稀疏处理方案,跳过输入数据的计算并关闭权重数据的算术单元,以减少不必要的计算和泄漏功率。最后,我们在 5.04mm 上制作 T-PIM 芯片2芯片采用 28-nm CMOS 逻辑工艺。它的工作频率为 50–280MHz,电源电压为 0.75–1.05V,推理功耗为 5.25–51.23mW,训练功耗为 6.10–37.75mW。结果,对于 1 位激活和 2 位权重数据的推理,它实现了 17.90–161.08TOPS/W 的能效,在 8 位激活/错误和 16 位的训练中达到了 0.84–7.59TOPS/W重量数据。总之,T-PIM 是第一款支持端到端训练的 PIM 芯片,与部分支持训练的最新 PIM 相比,性能提升了 2.02 倍。
更新日期:2022-04-20
down
wechat
bug