当前位置: X-MOL 学术IEEE Trans. Very Larg. Scale Integr. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reliability Evaluation and Analysis of FPGA-Based Neural Network Acceleration System
IEEE Transactions on Very Large Scale Integration (VLSI) Systems ( IF 2.8 ) Pub Date : 2021-01-10 , DOI: 10.1109/tvlsi.2020.3046075
Dawen Xu , Ziyang Zhu , Cheng Liu , Ying Wang , Shuang Zhao , Lei Zhang , Huaguo Liang , Huawei Li , Kwang-Ting Cheng

Prior works typically conducted the fault analysis of neural network accelerator computing arrays with simulation and focused on the prediction accuracy loss of the neural network models. There is still a lack of systematic fault analysis of the neural network acceleration system that considers both the accuracy degradation and system exceptions, such as system stall and running overtime. To that end, we implemented a representative neural network accelerator and corresponding fault injection modules on a Xilinx ARM-FPGA platform and evaluated the reliability of the system under different fault injection rates when a series of typical neural network models are deployed on the neural network acceleration system. The entire fault injection and reliability evaluation system is open-sourced on GitHub. With comprehensive experiments on the system, we identify the system exceptions based on the various abnormal behaviors of the FPGA-based neural network acceleration system and analyze the underlying reasons. Particularly, we find that the probability of the system exceptions dominates the reliability of the system. The faults also incur accuracy degradation of the neural network models, but the influence depends on the applications of the models and can vary greatly. In addition, we also evaluated the use of conventional triple modular redundancy (TMR) and demonstrated the challenge of TMR with both experiments and analytical models, which may shed light on the reliability design of the FPGA-based neural network acceleration system.

中文翻译:


基于FPGA的神经网络加速系统可靠性评估与分析



先前的工作通常通过仿真对神经网络加速器计算阵列进行故障分析,并重点关注神经网络模型的预测精度损失。目前还缺乏既考虑精度下降又考虑系统异常(如系统卡顿、运行超时)的神经网络加速系统的系统故障分析。为此,我们在Xilinx ARM-FPGA平台上实现了具有代表性的神经网络加速器和相应的故障注入模块,并在神经网络加速器上部署了一系列典型的神经网络模型时,评估了不同故障注入率下系统的可靠性。系统。整个故障注入和可靠性评估系统在GitHub上开源。通过对系统进行全面的实验,根据基于FPGA的神经网络加速系统的各种异常行为,识别出系统异常,并分析其背后的原因。特别是,我们发现系统异常的概率主导着系统的可靠性。这些故障还会导致神经网络模型的精度下降,但影响取决于模型的应用,并且差异很大。此外,我们还评估了传统三重模块冗余(TMR)的使用,并通过实验和分析模型证明了 TMR 的挑战,这可能有助于基于 FPGA 的神经网络加速系统的可靠性设计。
更新日期:2021-01-10
down
wechat
bug