research-article

Open Access

FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

Authors:
Arnab Das

University of Utah, Salt Lake City, UT

University of Utah, Salt Lake City, UT

0000-0002-8421-4641
View Profile

,
Sriram Krishnamoorthy

Pacific Northwest National Laboratory, Richland, WA

Pacific Northwest National Laboratory, Richland, WA
View Profile

,
Ian Briggs

University of Utah, Salt Lake City, UT

University of Utah, Salt Lake City, UT
View Profile

,
Ganesh Gopalakrishnan

University of Utah, Salt Lake City, UT

University of Utah, Salt Lake City, UT
View Profile

,
Ramakrishna Tipireddy

Pacific Northwest National Laboratory, Richland, WA

Pacific Northwest National Laboratory, Richland, WA
View Profile

ACM Transactions on Architecture and Code Optimization Volume 17 Issue 3Article No.: 19pp 1–27https://doi.org/10.1145/3402451

Published:17 August 2020Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

We present FPDetect, a low-overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits preserved across stencil applications. This estimate rigorously bounds the values expected in the data space of the computation. Violations of this bound can be attributed with certainty to errors. FPDetect helps synthesize error detectors customized for user-specified levels of accuracy and coverage. FPDetect also enables overhead reduction techniques based on deploying these detectors coarsely in space and time. Experimental evaluations demonstrate the practicality of our approach.

Supplemental Material

Available for Download

zip

das.zip (1.2 MB)

Supplemental movie, appendix, image and software files for, FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

References

IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (Aug. 2008), 1--70.Google Scholar
George A. Articolo. 2009. Partial Differential Equations 8 Boundary Value Problems with Maple, Second Edition (2nd ed.). Academic Press, Orlando, FL.Google Scholar
Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2016. PolyCheck: Dynamic verification of iteration space transformations on affine programs. In Proceedings of the POPL. 539--554.Google Scholar
R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Design Test Comput. 22, 3 (May 2005), 258--266. DOI:https://doi.org/10.1109/MDT.2005.69Google ScholarDigital Library
Sylvie Boldo and Jean-Christophe Filliâtre. 2007. Formal verification of floating-point programs. In Proceedings of the ARITH. 187--194.Google ScholarDigital Library
Sylvie Boldo and Thi Minh Nguyen. 2011. Proofs of numerical programs when the compiler optimizes. Innov. Syst. Softw. Eng. 7, 2 (June 2011), 151--160. DOI:https://doi.org/10.1007/s11334-011-0151-6Google ScholarDigital Library
Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Proceedings of the ETAPS CC.Google Scholar
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the PLDI. ACM, New York, NY, 101--113.Google Scholar
Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the OOPSLA. 33--52.Google Scholar
Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham, Pradip Bose, and Subhasish Mitra. 2018. Tolerating soft errors in processor cores using CLEAR. IEEE Trans. CAD Integr. Circ. Syst. 37, 9 (2018), 1839--1852.Google ScholarDigital Library
Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2017. Rigorous floating-point mixed-precision tuning. In Proceedings of the POPL. 300--315.Google ScholarDigital Library
Eva Darulova and Viktor Kuncak. 2014. Sound compilation of reals. In Proceedings of the POPL. 235--248.Google ScholarDigital Library
Eva Darulova and Viktor Kuncak. 2017. Towards a compiler for reals. ACM Trans. Program. Lang. Syst. 39, 2 (Mar. 2017).Google ScholarDigital Library
Arnab Das, Sriram Krishnamoorthy, Ian Briggs, Ganesh Gopalakrishnan, and Ramakrishna Tipireddy. 2020. FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation. arxiv:cs.DC/2004.04359.Google Scholar
Marc Daumas, Guillaume Melquiond, and César A. Muñoz. 2005. Guaranteed proofs using interval arithmetic. In Proceedings of the ARITH. 188--195.Google Scholar
Luiz Henrique de Figueiredo and Jorge Stolfi. 2004. Affine arithmetic: Concepts and applications. Numer. Algor. 37, 1 (Dec. 2004), 147--158.Google ScholarCross Ref
Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. Trans. Parallel Distrib. Syst. 27, 10 (2016), 2809--2823.Google ScholarDigital Library
James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Evaluating the impact of SDC on the GMRES iterative solver. In Proceedings of the IPDPS. 1193--1202.Google ScholarDigital Library
James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Resilience in numerical methods: A position on fault models and methodologies. CoRR abs/1401.3013 (2014).Google Scholar
James Elliott, Mark Hoemmen, and Frank Mueller. 2015. A numerical soft fault model for iterative linear solvers. In Proceedings of the HPDC. 271--274.Google ScholarDigital Library
James Elliott, Mark Hoemmen, and Frank Mueller. 2016. Exploiting data representation for fault tolerance. J. Comput. Sci. 14 (2016), 51--60.Google ScholarCross Ref
Aiman Fang, Aurélien Cavelan, Yves Robert, and Andrew A. Chien. 2017. Resilience for stencil computations with latent errors. In Proceedings of the ICPP. 581--590.Google Scholar
Marc Gamell, Keita Teranishi, Michael A. Heroux, Jackson Mayo, Hemanth Kolla, Jacqueline Chen, and Manish Parashar. 2015. Local recovery and failure masking for stencil-based applications at extreme scales. In Proceedings of the SC. 70:1--70:12.Google Scholar
David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (Mar. 1991), 5--48.Google ScholarDigital Library
L. A. B. Gomez and F. Cappello. 2015. Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the CLUSTER. 595--602.Google Scholar
John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM 62, 2 (Jan. 2019), 48--60. DOI:https://doi.org/10.1145/3282307Google ScholarDigital Library
Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics. Retrieved from https://epubs.siam.org/doi/pdf/10.1137/1.9780898718027.Google Scholar
Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 6 (1984), 518--528.Google ScholarDigital Library
Padma Jayaraman and Ranjani Parthasarathi. 2017. A survey on post-silicon functional validation for multicore architectures. ACM Comput. Surv. 50, 4 (Aug. 2017). DOI:https://doi.org/10.1145/3107615Google ScholarDigital Library
William Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754, 94720-1776 (1996), 11.Google Scholar
Gokcen Kestor, Burcu Ozcelik Mutlu, Joseph Manzano, Omer Subasi, Osman Unsal, and Sriram Krishnamoorthy. 2018. Comparative analysis of soft-error detection strategies: A case study with iterative methods. In Proceedings of the CF. 173--182.Google ScholarDigital Library
Walter Krämer. 1997. A priori worst-case error bounds for floating-point computations. In Proceedings of the ARITH. 64.Google ScholarCross Ref
Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2015. Clover: Compiler directed lightweight soft error resilience. SIGPLAN Not. 50, 5 (June 2015). DOI:https://doi.org/10.1145/2808704.2754959Google ScholarDigital Library
Victor Magron, George Constantinides, and Alastair Donaldson. 2017. Certified roundoff error bounds using semidefinite programming. ACM Trans. Math. Softw. 43, 4 (Jan. 2017).Google ScholarDigital Library
Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels. In Proceedings of the OOPSLA. 309--328.Google Scholar
Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Damien Stehlé, and Serge Torres. 2009. Handbook of Floating-Point Arithmetic. Birkhauser.Google Scholar
H. Quinn and P. Graham. 2005. Terrestrial-based radiation upsets: A cautionary tale. In Proceedings of the FCCM. 193--202. DOI:https://doi.org/10.1109/FCCM.2005.61Google Scholar
Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. IEEE Trans. CAD Integr. Circ. Syst. 30, 7 (2011), 945--959.Google ScholarDigital Library
B. Sangchoolie, K. Pattabiraman, and J. Karlsson. 2017. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In Proceedings of the DSN. 97--108.Google Scholar
Markus Schordan, Pei-Hung Lin, Daniel J. Quinlan, and Louis-Noël Pouchet. 2014. Verification of polyhedral optimizations with constant loop bounds in finite state space computations. In Proceedings of the ISoLA. 493--508.Google ScholarCross Ref
N. Seifert. 2010. Radiation-induced Soft Error: A Chip-level Modeling. Delft, The Netherlands.Google Scholar
Vishal Sharma, G. Gopalkrishnan, and Greg Bronevetsky. 2015. Detecting soft errors in stencil based computations. In the 11th IEEE Workshop on Silicon Errors in Logic -- System Effects (SELSE'15).Google ScholarCross Ref
Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, et al. 2014. Addressing failures in exascale computing. Proceedings of the IJHPCA 28, 2 (2014), 129--173.Google Scholar
Alexey Solovyev, Marek S. Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamaric, and Ganesh Gopalakrishnan. 2019. Rigorous estimation of floating-point round-off errors with symbolic Taylor expansions. ACM Trans. Program. Lang. Syst. 41, 1 (2019), 2:1--2:39.Google ScholarDigital Library
Omer Subasi, Sheng Di, Prasanna Balaprakash, Osman S. Unsal, Jesús Labarta, Adrián Cristal, Sriram Krishnamoorthy, and Franck Cappello. 2017. MACORD: Online adaptive machine learning framework for silent error detection. In Proceedings of the CLUSTER. 717--724.Google ScholarCross Ref
Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman S. Ünsal, Jesús Labarta, Adrián Cristal, and Franck Cappello. 2016. Spatial support vector regression to detect silent errors in the exascale era. In Proceedings of the CCGrid. 413--424.Google ScholarDigital Library
Omer Subasi and Sriram Krishnamoorthy. 2017. A gaussian process approach for effective soft error detection. In Proceedings of the CLUSTER. 608--612.Google ScholarCross Ref
Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the SPAA. 117--128.Google Scholar
Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren J. Kerbyson, and Zizhong Chen. 2016. New-Sum: A novel online ABFT scheme for general iterative methods. In Proceedings of the HPDC. 43--55.Google ScholarDigital Library
Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, and Don Maxwell. 2015. Reliability lessons learned from GPU experience with the titan supercomputer at oak ridge leadership computing facility. In Proceedings of the SC. ACM, New York, NY.Google ScholarDigital Library
Ohio State University. 2012. the PolyOpt Polyhedral Compiler. Retrieved from http://hpcrl.cse.ohio-state.edu/wiki/index.php/Polyhedral_Compilation.Google Scholar
Sven Verdoolaege, Gerda Janssens, and Maurice Bruynooghe. 2012. Equivalence checking of static affine programs using widening to handle recurrences. ACM Trans. Program. Lang. Syst. 34, 3 (2012), 11:1--11:35.Google ScholarDigital Library
Panruo Wu and Zizhong Chen. 2014. FT-ScaLAPACK: Correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines. In Proceedings of the HPDC. 49--60.Google ScholarDigital Library
Panruo Wu, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, and Zizhong Chen. 2017. Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the PPoPP. 415--427.Google ScholarDigital Library
Ren Xiaoguang, Xu Xinhai, Wang Qian, Chen Juan, Wang Miao, and Yang Xuejun. 2015. GS-DMR: Low-overhead soft error detection scheme for stencil-based computation. Parallel Comput. 41 (2015), 50--65.Google ScholarDigital Library
Yaqi Zhang, Ralph Nathan, and Daniel J. Sorin. 2015. Reduced Precision Checking to Detect Errors in Floating Point Arithmetic. arxiv:cs.NA/1510.01145.Google Scholar

Index Terms

FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

Recommendations

Control Focused Soft Error Detection for Embedded Applications

Advances in integrated circuits present several key challenges in system reliability as soft errors are expected to increase with successive technology generations. Computing systems must be able to continue functioning in spite of these soft errors, ...
Read More
Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations

The most commonly used approach for solving reaction---diffusion systems relies upon stencil computations. Although stencil computations feature low compute intensity, they place high demands on memory bandwidth. Fortunately, GPU computing allows for ...
Read More
Automatic code generation and tuning for stencil kernels on modern shared memory architectures

In this paper, we present Patus, a code generation and auto-tuning framework for stencil computations targeted at multi- and manycore processors, such as multicore CPUs and graphics processing units. Patus, which stands for " P arallel A uto tu ned S ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 17, Issue 3
September 2020
200 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3415154
Editor:
David Kaeli
Northeastern University, USA
Issue’s Table of Contents
Copyright © 2020 ACM
© 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 August 2020
- Accepted: 1 May 2020
- Revised: 1 April 2020
- Received: 1 November 2019
Published in taco Volume 17, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Soft error detection
affine analysis
floating point round-off error
interval analysis
silent data corruption
software bug detection
stencil computations
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 563
  Total Downloads
- Downloads (Last 12 months)90
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

ACM Transactions on Architecture and Code Optimization

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Control Focused Soft Error Detection for Embedded Applications

Accelerating Single Iteration Performance of CUDA-Based 3D Reaction---Diffusion Simulations

Automatic code generation and tuning for stencil kernels on modern shared memory architectures