Low-Cost Error Detection in Deep Neural Network Accelerators with Linear Algorithmic Checksums

Ozen, Elbruz; Orailoglu, Alex

doi:10.1007/s10836-020-05920-2

Low-Cost Error Detection in Deep Neural Network Accelerators with Linear Algorithmic Checksums

Published: 06 January 2021

Volume 36, pages 703–718, (2020)
Cite this article

Journal of Electronic Testing Aims and scope Submit manuscript

813 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

The widespread adoption of deep neural networks in safety-critical systems necessitates the examination of the safety issues raised by hardware errors. The appropriateness of the concern is herein confirmed by evidencing the possible catastrophic impact of hardware bit errors on DNN accuracy. The consequent interest in fault tolerance methods that are comprehensive yet low-cost to match the margin requirements of consumer deep learning applications can be met through a rigorous exploration of the mathematical properties of the deep neural network computations. Our novel technique, Sanity-Check, allows error detection in fully-connected and convolutional layers through the use of linear algorithmic checksums. The purely software-based implementation of Sanity-Check facilitates the widespread adoption of our technique on a variety of off-the-shelf execution platforms while requiring no hardware modification. We further propose a dedicated hardware unit that seamlessly integrates with modern deep learning accelerators and eliminates the performance overhead of the software-based implementation at the cost of a negligible area and power budget in a DNN accelerator. Sanity-Check delivers perfect critical error coverage in our error injection experiments and offers a promising alternative for low-cost error detection in safety-critical deep neural network applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 4

Efficient On-Line Error Detection and Mitigation for Deep Neural Network Accelerators

Online Quantization Adaptation for Fault-Tolerant Neural Network Inference

Using Convolutional Neural Networks for Fault Analysis and Alleviation in Accelerator Systems

References

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray DG, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X (2016) TensorFlow: a system for large-scale machine learning. In: Proceedings 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp 265–283
Abraham JA, Banerjee P, Chen C, Fuchs WK, Kuo SY, Reddy ALN (1987) Fault tolerance techniques for systolic arrays. Computer 20(7):65–75
Article Google Scholar
Baleani M, Ferrari A, Mangeruca L, Sangiovanni-Vincentelli A, Peri M, Pezzini S (2003) Fault-tolerant platforms for automotive safety-critical applications. In: Proceedings international conference on compilers, architecture and synthesis for embedded systems. ACM, pp 170–177
Bayraktaroglu I, Orailoglu A (2001) Concurrent test for digital linear systems. IEEE Trans Comput-Aided Des Integrated Circ Syst 20(9):1132–1142
Article Google Scholar
Chen C, Seff A, Kornhauser A, Xiao J (2015) DeepDriving: learning affordance for direct perception in autonomous driving. In: Proceedings IEEE international conference on computer vision, pp 2722–2730
Chollet F, et al. (2015) Keras. https://keras.io
Cong J, Xiao B (2014) Minimizing computation in convolutional neural networks. In: Proceedings international conference on artificial neural networks. Springer, pp 281–290
Devlin J, Chang MW, Lee K, Toutanova K (2018) BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805
Ernst D, Das S, Lee S, Blaauw D, Austin T, Mudge T, Kim NS, Flautner K (2004) Razor: circuit-level correction of timing errors for low-power operation. IEEE Micro 24(6):10–20
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings IEEE conference on computer vision and pattern recognition, pp 770–778
He Z, Lin J, Ewetz R, Yuan JS, Fan D (2019) Noise injection adaption: end-to-end ReRAM crossbar non-ideal effect adaption for neural network mapping. In: Proceedings 56th annual design automation conference, pp 1–6
Huang KH, Abraham JA (1984) Algorithm-based fault tolerance for matrix operations. IEEE Trans Comput C-33(6):518–528
Article Google Scholar
ISO 26262-1:2018 road vehicles – functional safety. https://www.iso.org/standard/68383.html (2018)
Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, Boyle R, Cantin Pl, Chao C, Clark C, Coriell J, Daley M, Dau M, Dean J, Gelb B, Ghaemmaghami TV, Gottipati R, Gulland W, Hagmann R, Ho CR, Hogberg D, Hu J, Hundt R, Hurt D, Ibarz J, Jaffey A, Jaworski A, Kaplan A, Khaitan H, Killebrew D, Koch A, Kumar N, Lacy S, Laudon J, Law J, Le D, Leary C, Liu Z, Lucke K, Lundin A, MacKean G, Maggiore A, Mahony M, Miller K, Nagarajan R, Narayanaswami R, Ni R, Nix K, Norrie T, Omernick M, Penukonda N, Phelps A, Ross J, Ross M, Salek A, Samadiani E, Severn C, Sizikov G, Snelham M, Souter J, Steinberg D, Swing A, Tan M, Thorson G , Tian B, Toma H, Tuttle E, Vasudevan V, Walter R, Wang W, Wilcox E, Yoon DH (2017) In-datacenter performance analysis of a tensor processing unit. In: Proceedings 44th annual international symposium on computer architecture, pp 1–12
Katz G, Barrett C, Dill DL, Julian K, Kochenderfer MJ (2017) Towards proving the adversarial robustness of deep neural networks. arXiv:1709.02802
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings advances in neural information processing systems, pp 1097–1105
Kulkarni JP, Tokunaga C, Aseron PA, Nguyen T, Augustine C, Tschanz JW, De V (2015) A 409 GOPS/W adaptive and resilient domino register file in 22 nm tri-gate CMOS featuring in-situ timing margin and error detection for tolerance to within-die variation, voltage drop, temperature and aging. IEEE J Solid State Circuits 51(1):117–129
Google Scholar
Li G, Hari SKS, Sullivan M, Tsai T, Pattabiraman K, Emer J, Keckler SW (2017) Understanding error propagation in deep learning neural network (DNN) accelerators and applications. In: Proceedings international conference for high performance computing, networking, storage and analysis, pp 1–12
Li G, Pattabiraman K, DeBardeleben N (2018) TensorFI: a configurable fault injector for TensorFlow applications. In: Proceedings IEEE international symposium on software reliability engineering workshops (ISSREW). IEEE, pp 313–320
Liu C, Hu M, Strachan JP, Li H (2017) Rescuing memristor-based neuromorphic design with high defects. In: Proceedings 54th ACM/EDAC/IEEE design automation conference (DAC), pp 1–6
Liu M, Xia L, Wang Y, Chakrabarty K (2018) Fault tolerance for RRAM-based matrix operations. In: Proceedings international test conference (ITC). IEEE, pp 1–10
Liu M, Xia L, Wang Y, Chakrabarty K (2019) Fault tolerance in neuromorphic computing systems. In: Proceedings 24th Asia and South pacific design automation conference. ACM, pp 216–223
Liu T, Wen W, Jiang L, Wang Y, Yang C, Quan G (2019) A fault-tolerant neural network architecture. In: Proceedings 56th annual design automation conference. ACM/IEEE, pp 1–6
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu CY, Berg AC (2016) SSD: single shot multibox detector. In: Proceedings European conference on computer vision. Springer, pp 21–37
Long Y, She X, Mukhopadhyay S (2019) Design of reliable DNN accelerator with un-reliable ReRAM. In: Proceedings design, automation & test in europe conference & exhibition (DATE). IEEE, pp 1769–1774
Nair V, Abraham JA (1990) Real-number codes for fault-tolerant matrix operations on processor arrays. IEEE Trans Comput 39 (4):426–435
Article Google Scholar
Neggaz MA, Alouani I, Lorenzo PR, Niar S (2018) A reliability study on CNNs for critical embedded systems. In: Proceedings international conference on computer design (ICCD). IEEE, pp 476–479
Ozen E, Orailoglu A (2019) Sanity-Check: boosting the reliability of safety-critical deep neural network applications. In: Proceedings 28th Asian test symposium (ATS). IEEE, pp 7–12
Ozen E, Orailoglu A (2020) Boosting bit-error resilience of DNN accelerators through median feature selection. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39(11):3250–3262
Article Google Scholar
Ozen E, Orailoglu A (2020) Concurrent monitoring of operational health in neural networks through balanced output partitions. In: Proceedings 25th Asia and South pacific design automation conference (ASP-DAC). IEEE, pp 169–174
Ozen E, Orailoglu A (2020) Just say zero: containing critical bit-error propagation in deep neural networks with anomalous feature suppression. In: Proceedings IEEE/ACM international conference on computer aided design (ICCAD)
Papernot N, McDaniel P, Jha S, Fredrikson M, Celik ZB, Swami A (2016) The limitations of deep learning in adversarial settings. In: Proceedings IEEE European symposium on security and privacy (EuroS&P), pp 372–387
Reagen B, Gupta U, Pentecost L, Whatmough P, Lee SK, Mulholland N, Brooks D, Wei GY (2018) Ares: A framework for quantifying the resilience of deep neural networks. In: Proceedings 55th ACM/ESDA/IEEE design automation conference (DAC), pp 1–6
Reagen B, Whatmough P, Adolf R, Rama S, Lee H, Lee SK, Hernández-Lobato JM, Wei GY, Brooks D (2016) Minerva: enabling low-power, highly-accurate deep neural network accelerators. In: Proceedings ACM/IEEE 43rd annual international symposium on computer architecture (ISCA), pp 267–278
Schorn C, Guntoro A, Ascheid G (2018) Accurate neuron resilience prediction for a flexible reliability management in neural network accelerators. In: Proceedings design, automation & test in europe conference & exhibition (DATE). IEEE, pp 979–984
Schorn C, Guntoro A, Ascheid G (2018) Efficient on-line error detection and mitigation for deep neural network accelerators. In: Proceedings international conference on computer safety, reliability, and security. Springer, pp 205–219
Shaheen H, Boschi G, Harutyunyan G, Zorian Y (2017) Advanced ECC solution for automotive SoCs. In: Proceedings 23rd international symposium on on-line testing and robust system design (IOLTS). IEEE, pp 71–73
Sharma H, Park J, Mahajan D, Amaro E, Kim JK, Shao C, Mishra A, Esmaeilzadeh H (2016) From high-level deep neural models to FPGAs. In: Proceedings IEEE/ACM 49th annual IEEE/ACM international symposium on microarchitecture (MICRO), pp 1–12
Stallkamp J, Schlipsing M, Salmen J, Igel C (2012) Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Netw 32:323–332
Article Google Scholar
Su J, Vargas DV, Sakurai K (2019) One pixel attack for fooling deep neural networks. IEEE Trans Evol Comput 23(5):828–841
Article Google Scholar
Sze V, Chen YH, Yang TJ, Emer JS (2017) Efficient processing of deep neural networks: a tutorial and survey. Proc IEEE 105(12):2295–2329
Article Google Scholar
Tian Y, Pei K, Jana S, Ray B (2018) DeepTest: automated testing of deep-neural-network-driven autonomous cars. In: Proceedings 40th international conference on software engineering. ACM, pp 303–314
Xia L, Liu M, Ning X, Chakrabarty K, Wang Y (2018) Fault-tolerant training enabled by on-line fault detection for RRAM-based neural computing systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 38(9):1611–1624
Article Google Scholar
Zhang J, Rangineni K, Ghodsi Z, Garg S (2018) Thundervolt: enabling aggressive voltage underscaling and timing error resilience for energy efficient deep learning accelerators. In: Proceedings 55th annual design automation conference. ACM, pp 19:1–19:6
Zhang JJ, Gu T, Basu K, Garg S (2018) Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator. In: Proceedings IEEE 36th VLSI test symposium (VTS), pp 1–6

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA, 92093, USA
Elbruz Ozen & Alex Orailoglu

Authors

Elbruz Ozen
View author publications
You can also search for this author in PubMed Google Scholar
Alex Orailoglu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elbruz Ozen.

Additional information

Responsible Editor: J. C.-M. Li

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ozen, E., Orailoglu, A. Low-Cost Error Detection in Deep Neural Network Accelerators with Linear Algorithmic Checksums. J Electron Test 36, 703–718 (2020). https://doi.org/10.1007/s10836-020-05920-2

Download citation

Received: 27 June 2020
Accepted: 26 November 2020
Published: 06 January 2021
Issue Date: December 2020
DOI: https://doi.org/10.1007/s10836-020-05920-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Low-Cost Error Detection in Deep Neural Network Accelerators with Linear Algorithmic Checksums

Abstract

Access this article

Similar content being viewed by others

Efficient On-Line Error Detection and Mitigation for Deep Neural Network Accelerators

Online Quantization Adaptation for Fault-Tolerant Neural Network Inference

Using Convolutional Neural Networks for Fault Analysis and Alleviation in Accelerator Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Low-Cost Error Detection in Deep Neural Network Accelerators with Linear Algorithmic Checksums

Abstract

Access this article

Similar content being viewed by others

Efficient On-Line Error Detection and Mitigation for Deep Neural Network Accelerators

Online Quantization Adaptation for Fault-Tolerant Neural Network Inference

Using Convolutional Neural Networks for Fault Analysis and Alleviation in Accelerator Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation