Abstract
We present FPDetect, a low-overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits preserved across stencil applications. This estimate rigorously bounds the values expected in the data space of the computation. Violations of this bound can be attributed with certainty to errors. FPDetect helps synthesize error detectors customized for user-specified levels of accuracy and coverage. FPDetect also enables overhead reduction techniques based on deploying these detectors coarsely in space and time. Experimental evaluations demonstrate the practicality of our approach.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation
- IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (Aug. 2008), 1--70.Google Scholar
- George A. Articolo. 2009. Partial Differential Equations 8 Boundary Value Problems with Maple, Second Edition (2nd ed.). Academic Press, Orlando, FL.Google Scholar
- Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2016. PolyCheck: Dynamic verification of iteration space transformations on affine programs. In Proceedings of the POPL. 539--554.Google Scholar
- R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Design Test Comput. 22, 3 (May 2005), 258--266. DOI:https://doi.org/10.1109/MDT.2005.69Google ScholarDigital Library
- Sylvie Boldo and Jean-Christophe Filliâtre. 2007. Formal verification of floating-point programs. In Proceedings of the ARITH. 187--194.Google ScholarDigital Library
- Sylvie Boldo and Thi Minh Nguyen. 2011. Proofs of numerical programs when the compiler optimizes. Innov. Syst. Softw. Eng. 7, 2 (June 2011), 151--160. DOI:https://doi.org/10.1007/s11334-011-0151-6Google ScholarDigital Library
- Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Proceedings of the ETAPS CC.Google Scholar
- Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the PLDI. ACM, New York, NY, 101--113.Google Scholar
- Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the OOPSLA. 33--52.Google Scholar
- Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham, Pradip Bose, and Subhasish Mitra. 2018. Tolerating soft errors in processor cores using CLEAR. IEEE Trans. CAD Integr. Circ. Syst. 37, 9 (2018), 1839--1852.Google ScholarDigital Library
- Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2017. Rigorous floating-point mixed-precision tuning. In Proceedings of the POPL. 300--315.Google ScholarDigital Library
- Eva Darulova and Viktor Kuncak. 2014. Sound compilation of reals. In Proceedings of the POPL. 235--248.Google ScholarDigital Library
- Eva Darulova and Viktor Kuncak. 2017. Towards a compiler for reals. ACM Trans. Program. Lang. Syst. 39, 2 (Mar. 2017).Google ScholarDigital Library
- Arnab Das, Sriram Krishnamoorthy, Ian Briggs, Ganesh Gopalakrishnan, and Ramakrishna Tipireddy. 2020. FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation. arxiv:cs.DC/2004.04359.Google Scholar
- Marc Daumas, Guillaume Melquiond, and César A. Muñoz. 2005. Guaranteed proofs using interval arithmetic. In Proceedings of the ARITH. 188--195.Google Scholar
- Luiz Henrique de Figueiredo and Jorge Stolfi. 2004. Affine arithmetic: Concepts and applications. Numer. Algor. 37, 1 (Dec. 2004), 147--158.Google ScholarCross Ref
- Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. Trans. Parallel Distrib. Syst. 27, 10 (2016), 2809--2823.Google ScholarDigital Library
- James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Evaluating the impact of SDC on the GMRES iterative solver. In Proceedings of the IPDPS. 1193--1202.Google ScholarDigital Library
- James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Resilience in numerical methods: A position on fault models and methodologies. CoRR abs/1401.3013 (2014).Google Scholar
- James Elliott, Mark Hoemmen, and Frank Mueller. 2015. A numerical soft fault model for iterative linear solvers. In Proceedings of the HPDC. 271--274.Google ScholarDigital Library
- James Elliott, Mark Hoemmen, and Frank Mueller. 2016. Exploiting data representation for fault tolerance. J. Comput. Sci. 14 (2016), 51--60.Google ScholarCross Ref
- Aiman Fang, Aurélien Cavelan, Yves Robert, and Andrew A. Chien. 2017. Resilience for stencil computations with latent errors. In Proceedings of the ICPP. 581--590.Google Scholar
- Marc Gamell, Keita Teranishi, Michael A. Heroux, Jackson Mayo, Hemanth Kolla, Jacqueline Chen, and Manish Parashar. 2015. Local recovery and failure masking for stencil-based applications at extreme scales. In Proceedings of the SC. 70:1--70:12.Google Scholar
- David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (Mar. 1991), 5--48.Google ScholarDigital Library
- L. A. B. Gomez and F. Cappello. 2015. Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the CLUSTER. 595--602.Google Scholar
- John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM 62, 2 (Jan. 2019), 48--60. DOI:https://doi.org/10.1145/3282307Google ScholarDigital Library
- Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics. Retrieved from https://epubs.siam.org/doi/pdf/10.1137/1.9780898718027.Google Scholar
- Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 6 (1984), 518--528.Google ScholarDigital Library
- Padma Jayaraman and Ranjani Parthasarathi. 2017. A survey on post-silicon functional validation for multicore architectures. ACM Comput. Surv. 50, 4 (Aug. 2017). DOI:https://doi.org/10.1145/3107615Google ScholarDigital Library
- William Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754, 94720-1776 (1996), 11.Google Scholar
- Gokcen Kestor, Burcu Ozcelik Mutlu, Joseph Manzano, Omer Subasi, Osman Unsal, and Sriram Krishnamoorthy. 2018. Comparative analysis of soft-error detection strategies: A case study with iterative methods. In Proceedings of the CF. 173--182.Google ScholarDigital Library
- Walter Krämer. 1997. A priori worst-case error bounds for floating-point computations. In Proceedings of the ARITH. 64.Google ScholarCross Ref
- Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2015. Clover: Compiler directed lightweight soft error resilience. SIGPLAN Not. 50, 5 (June 2015). DOI:https://doi.org/10.1145/2808704.2754959Google ScholarDigital Library
- Victor Magron, George Constantinides, and Alastair Donaldson. 2017. Certified roundoff error bounds using semidefinite programming. ACM Trans. Math. Softw. 43, 4 (Jan. 2017).Google ScholarDigital Library
- Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels. In Proceedings of the OOPSLA. 309--328.Google Scholar
- Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Damien Stehlé, and Serge Torres. 2009. Handbook of Floating-Point Arithmetic. Birkhauser.Google Scholar
- H. Quinn and P. Graham. 2005. Terrestrial-based radiation upsets: A cautionary tale. In Proceedings of the FCCM. 193--202. DOI:https://doi.org/10.1109/FCCM.2005.61Google Scholar
- Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. IEEE Trans. CAD Integr. Circ. Syst. 30, 7 (2011), 945--959.Google ScholarDigital Library
- B. Sangchoolie, K. Pattabiraman, and J. Karlsson. 2017. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In Proceedings of the DSN. 97--108.Google Scholar
- Markus Schordan, Pei-Hung Lin, Daniel J. Quinlan, and Louis-Noël Pouchet. 2014. Verification of polyhedral optimizations with constant loop bounds in finite state space computations. In Proceedings of the ISoLA. 493--508.Google ScholarCross Ref
- N. Seifert. 2010. Radiation-induced Soft Error: A Chip-level Modeling. Delft, The Netherlands.Google Scholar
- Vishal Sharma, G. Gopalkrishnan, and Greg Bronevetsky. 2015. Detecting soft errors in stencil based computations. In the 11th IEEE Workshop on Silicon Errors in Logic -- System Effects (SELSE'15).Google ScholarCross Ref
- Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, et al. 2014. Addressing failures in exascale computing. Proceedings of the IJHPCA 28, 2 (2014), 129--173.Google Scholar
- Alexey Solovyev, Marek S. Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamaric, and Ganesh Gopalakrishnan. 2019. Rigorous estimation of floating-point round-off errors with symbolic Taylor expansions. ACM Trans. Program. Lang. Syst. 41, 1 (2019), 2:1--2:39.Google ScholarDigital Library
- Omer Subasi, Sheng Di, Prasanna Balaprakash, Osman S. Unsal, Jesús Labarta, Adrián Cristal, Sriram Krishnamoorthy, and Franck Cappello. 2017. MACORD: Online adaptive machine learning framework for silent error detection. In Proceedings of the CLUSTER. 717--724.Google ScholarCross Ref
- Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman S. Ünsal, Jesús Labarta, Adrián Cristal, and Franck Cappello. 2016. Spatial support vector regression to detect silent errors in the exascale era. In Proceedings of the CCGrid. 413--424.Google ScholarDigital Library
- Omer Subasi and Sriram Krishnamoorthy. 2017. A gaussian process approach for effective soft error detection. In Proceedings of the CLUSTER. 608--612.Google ScholarCross Ref
- Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the SPAA. 117--128.Google Scholar
- Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren J. Kerbyson, and Zizhong Chen. 2016. New-Sum: A novel online ABFT scheme for general iterative methods. In Proceedings of the HPDC. 43--55.Google ScholarDigital Library
- Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, and Don Maxwell. 2015. Reliability lessons learned from GPU experience with the titan supercomputer at oak ridge leadership computing facility. In Proceedings of the SC. ACM, New York, NY.Google ScholarDigital Library
- Ohio State University. 2012. the PolyOpt Polyhedral Compiler. Retrieved from http://hpcrl.cse.ohio-state.edu/wiki/index.php/Polyhedral_Compilation.Google Scholar
- Sven Verdoolaege, Gerda Janssens, and Maurice Bruynooghe. 2012. Equivalence checking of static affine programs using widening to handle recurrences. ACM Trans. Program. Lang. Syst. 34, 3 (2012), 11:1--11:35.Google ScholarDigital Library
- Panruo Wu and Zizhong Chen. 2014. FT-ScaLAPACK: Correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines. In Proceedings of the HPDC. 49--60.Google ScholarDigital Library
- Panruo Wu, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, and Zizhong Chen. 2017. Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the PPoPP. 415--427.Google ScholarDigital Library
- Ren Xiaoguang, Xu Xinhai, Wang Qian, Chen Juan, Wang Miao, and Yang Xuejun. 2015. GS-DMR: Low-overhead soft error detection scheme for stencil-based computation. Parallel Comput. 41 (2015), 50--65.Google ScholarDigital Library
- Yaqi Zhang, Ralph Nathan, and Daniel J. Sorin. 2015. Reduced Precision Checking to Detect Errors in Floating Point Arithmetic. arxiv:cs.NA/1510.01145.Google Scholar
Index Terms
- FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation
Recommendations
Control Focused Soft Error Detection for Embedded Applications
Advances in integrated circuits present several key challenges in system reliability as soft errors are expected to increase with successive technology generations. Computing systems must be able to continue functioning in spite of these soft errors, ...
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations
The most commonly used approach for solving reaction---diffusion systems relies upon stencil computations. Although stencil computations feature low compute intensity, they place high demands on memory bandwidth. Fortunately, GPU computing allows for ...
Automatic code generation and tuning for stencil kernels on modern shared memory architectures
In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for " P arallel A uto tu ned S ...
Comments