当前位置: X-MOL 学术IEEE J. Solid-State Circuits › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An Energy-Efficient Deep Convolutional Neural Network Training Accelerator for In Situ Personalization on Smart Devices
IEEE Journal of Solid-State Circuits ( IF 4.6 ) Pub Date : 2020-10-01 , DOI: 10.1109/jssc.2020.3005786
Seungkyu Choi , Jaehyeong Sim , Myeonggu Kang , Yeongjae Choi , Hyeonuk Kim , Lee-Sup Kim

A scalable deep-learning accelerator supporting the training process is implemented for device personalization of deep convolutional neural networks (CNNs). It consists of three processor cores operating with distinct energy-efficient dataflow for different types of computation in CNN training. Unlike the previous works where they implement design techniques to exploit the same characteristics from the inference, we analyze major issues that occurred from training in a resource-constrained system to resolve the bottlenecks. A masking scheme in the propagation core reduces a massive amount of intermediate activation data storage. It eliminates frequent off-chip memory accesses for holding the generated activation data until the backward path. A disparate dataflow architecture is implemented for the weight gradient computation to enhance PE utilization while maximally reuse the input data. Furthermore, the modified weight update system enables an 8-bit fixed-point computing datapath. The processor is implemented in 65-nm CMOS technology and occupies 10.24 mm2 of the core area. It operates with the supply voltage from 0.63 to 1.0 V, and the computing engine runs in near-threshold voltage of 0.5 V. The chip consumes 40.7 mW at 50 MHz with the highest efficiency and achieves 47.4 $\mu \text{J}$ /epoch of training efficiency for the customized CNN model.

中文翻译:

用于智能设备原位个性化的节能深度卷积神经网络训练加速器

支持训练过程的可扩展深度学习加速器用于深度卷积神经网络 (CNN) 的设备个性化。它由三个处理器核心组成,以不同的节能数据流运行,用于 CNN 训练中的不同类型的计算。与之前的工作不同,他们实施设计技术以利用推理中的相同特征,我们分析了在资源受限系统中进行培训时出现的主要问题,以解决瓶颈。传播核心中的屏蔽方案减少了大量的中间激活数据存储。它消除了频繁的片外存储器访问,用于保存生成的激活数据直到反向路径。为权重梯度计算实施了不同的数据流架构,以提高 PE 利用率,同时最大限度地重用输入数据。此外,改进的权重更新系统支持 8 位定点计算数据路径。该处理器采用 65 纳米 CMOS 技术实现,核心面积为 10.24 平方毫米。它在 0.63 至 1.0 V 的电源电压下运行,计算引擎在 0.5 V 的接近阈值电压下运行。该芯片在 50 MHz 下功耗为 40.7 mW,效率最高,达到 47.4 $\mu \text{J}$ /epoch 定制 CNN 模型的训练效率。
更新日期:2020-10-01
down
wechat
bug