当前位置: X-MOL 学术arXiv.cs.ET › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Hardware and software co-optimization for the initialization failure of the ReRAM based cross-bar array
arXiv - CS - Emerging Technologies Pub Date : 2020-02-11 , DOI: arxiv-2002.04605
Youngseok Kim, Seyoung Kim, Chun-chen Yeh, Vijay Narayanan, Jungwook Choi

Recent advances in deep neural network demand more than millions of parameters to handle and mandate the high-performance computing resources with improved efficiency. The cross-bar array architecture has been considered as one of the promising deep learning architectures that shows a significant computing gain over the conventional processors. To investigate the feasibility of the architecture, we examine non-idealities and their impact on the performance. Specifically, we study the impact of failed cells due to the initialization process of the resistive memory based cross-bar array. Unlike the conventional memory array, individual memory elements cannot be rerouted and, thus, may have a critical impact on model accuracy. We categorize the possible failures and propose hardware implementation that minimizes catastrophic failures. Such hardware optimization bounds the possible logical value of the failed cells and gives us opportunities to compensate for the loss of accuracy via off-line training. By introducing the random weight defects during the training, we show that the model becomes more resilient on the device initialization failures, therefore, less prone to degrade the inference performance due to the failed devices. Our study sheds light on the hardware and software co-optimization procedure to cope with potentially catastrophic failures in the cross-bar array.

中文翻译:

基于 ReRAM 的交叉阵列初始化失败的软硬件协同优化

深度神经网络的最新进展需要超过数百万个参数来以更高的效率处理和授权高性能计算资源。交叉阵列架构被认为是有前途的深度学习架构之一,与传统处理器相比,它显示出显着的计算增益。为了研究架构的可行性,我们检查了非理想情况及其对性能的影响。具体来说,我们研究了由于基于电阻存储器的交叉阵列的初始化过程而导致的失效单元的影响。与传统存储器阵列不同,单个存储器元件不能重新路由,因此可能对模型精度产生关键影响。我们对可能的故障进行了分类,并提出了最大限度减少灾难性故障的硬件实施方案。这种硬件优化限制了失败单元的可能逻辑值,并为我们提供了通过离线训练补偿准确性损失的机会。通过在训练期间引入随机权重缺陷,我们表明模型在设备初始化失败时变得更具弹性,因此不太可能因设备出现故障而降低推理性能。我们的研究阐明了硬件和软件协同优化程序,以应对横杆阵列中潜在的灾难性故障。因此,不太可能因设备故障而降低推理性能。我们的研究阐明了硬件和软件协同优化程序,以应对横杆阵列中潜在的灾难性故障。因此,不太可能因设备故障而降低推理性能。我们的研究阐明了硬件和软件协同优化程序,以应对横杆阵列中潜在的灾难性故障。
更新日期:2020-08-24
down
wechat
bug