当前位置: X-MOL 学术ACM J. Emerg. Technol. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Dynamic Reliability Management in Neuromorphic Computing
ACM Journal on Emerging Technologies in Computing Systems ( IF 2.1 ) Pub Date : 2021-07-21 , DOI: 10.1145/3462330
Shihao Song 1 , Jui Hanamshet 1 , Adarsha Balaji 1 , Anup Das 1 , Jeffrey L. Krichmar 2 , Nikil D. Dutt 2 , Nagarajan Kandasamy 1 , Francky Catthoor 3
Affiliation  

Neuromorphic computing systems execute machine learning tasks designed with spiking neural networks. These systems are embracing non-volatile memory to implement high-density and low-energy synaptic storage. Elevated voltages and currents needed to operate non-volatile memories cause aging of CMOS-based transistors in each neuron and synapse circuit in the hardware, drifting the transistor’s parameters from their nominal values. If these circuits are used continuously for too long, the parameter drifts cannot be reversed, resulting in permanent degradation of circuit performance over time, eventually leading to hardware faults. Aggressive device scaling increases power density and temperature, which further accelerates the aging, challenging the reliable operation of neuromorphic systems. Existing reliability-oriented techniques periodically de-stress all neuron and synapse circuits in the hardware at fixed intervals, assuming worst-case operating conditions, without actually tracking their aging at run-time. To de-stress these circuits, normal operation must be interrupted, which introduces latency in spike generation and propagation, impacting the inter-spike interval and hence, performance (e.g., accuracy). We observe that in contrast to long-term aging, which permanently damages the hardware, short-term aging in scaled CMOS transistors is mostly due to bias temperature instability. The latter is heavily workload-dependent and, more importantly, partially reversible. We propose a new architectural technique to mitigate the aging-related reliability problems in neuromorphic systems by designing an intelligent run-time manager (NCRTM), which dynamically de-stresses neuron and synapse circuits in response to the short-term aging in their CMOS transistors during the execution of machine learning workloads, with the objective of meeting a reliability target. NCRTM de-stresses these circuits only when it is absolutely necessary to do so, otherwise reducing the performance impact by scheduling de-stress operations off the critical path. We evaluate NCRTM with state-of-the-art machine learning workloads on a neuromorphic hardware. Our results demonstrate that NCRTM significantly improves the reliability of neuromorphic hardware, with marginal impact on performance.

中文翻译:

神经形态计算中的动态可靠性管理

神经形态计算系统执行使用尖峰神经网络设计的机器学习任务。这些系统正在采用非易失性存储器来实现高密度和低能量的突触存储。操作非易失性存储器所需的升高的电压和电流会导致硬件中每个神经元和突触电路中基于 CMOS 的晶体管老化,从而使晶体管的参数偏离其标称值。如果这些电路连续使用时间过长,参数漂移无法逆转,导致电路性能随着时间的推移而永久下降,最终导致硬件故障。积极的设备缩放增加了功率密度和温度,这进一步加速了老化,挑战了神经形态系统的可靠运行。现有的面向可靠性的技术会以固定的时间间隔周期性地对硬件中的所有神经元和突触电路进行减压,假设最坏的操作条件,而不实际跟踪它们在运行时的老化。为了减轻这些电路的压力,必须中断正常操作,这会在尖峰生成和传播中引入延迟,从而影响尖峰间间隔,从而影响性能(例如,准确性)。我们观察到,与永久性损坏硬件的长期老化相比,按比例缩放的 CMOS 晶体管的短期老化主要是由于偏置温度不稳定性。后者严重依赖工作负载,更重要的是,部分可逆。我们提出了一种新的架构技术,通过设计一个智能运行时管理器 (NCRTM) 来缓解神经形态系统中与老化相关的可靠性问题,该管理器动态地减轻神经元和突触电路的压力,以响应其 CMOS 晶体管中的短期老化在执行机器学习工作负载期间,以达到可靠性目标为目标。NCRTM 仅在绝对必要时才对这些电路进行减压,否则会通过将减压操作安排在关键路径之外来降低性能影响。我们在神经形态硬件上使用最先进的机器学习工作负载评估 NCRTM。我们的结果表明,NCRTM 显着提高了神经形态硬件的可靠性,对性能的影响很小。在执行机器学习工作负载期间,它会动态地减轻神经元和突触电路的压力,以响应其 CMOS 晶体管的短期老化,以达到可靠性目标。NCRTM 仅在绝对必要时才对这些电路进行减压,否则会通过将减压操作安排在关键路径之外来降低性能影响。我们在神经形态硬件上使用最先进的机器学习工作负载评估 NCRTM。我们的结果表明,NCRTM 显着提高了神经形态硬件的可靠性,对性能的影响很小。在执行机器学习工作负载期间,它会动态地减轻神经元和突触电路的压力,以响应其 CMOS 晶体管的短期老化,以达到可靠性目标。NCRTM 仅在绝对必要时才对这些电路进行减压,否则会通过将减压操作安排在关键路径之外来降低性能影响。我们在神经形态硬件上使用最先进的机器学习工作负载评估 NCRTM。我们的结果表明,NCRTM 显着提高了神经形态硬件的可靠性,对性能的影响很小。NCRTM 仅在绝对必要时才对这些电路进行减压,否则会通过将减压操作安排在关键路径之外来降低性能影响。我们在神经形态硬件上使用最先进的机器学习工作负载评估 NCRTM。我们的结果表明,NCRTM 显着提高了神经形态硬件的可靠性,对性能的影响很小。NCRTM 仅在绝对必要时才对这些电路进行减压,否则会通过将减压操作安排在关键路径之外来降低性能影响。我们在神经形态硬件上使用最先进的机器学习工作负载评估 NCRTM。我们的结果表明,NCRTM 显着提高了神经形态硬件的可靠性,对性能的影响很小。
更新日期:2021-07-21
down
wechat
bug