skip to main content
research-article
Open Access

FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

Published:17 August 2020Publication History
Skip Abstract Section

Abstract

We present FPDetect, a low-overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits preserved across stencil applications. This estimate rigorously bounds the values expected in the data space of the computation. Violations of this bound can be attributed with certainty to errors. FPDetect helps synthesize error detectors customized for user-specified levels of accuracy and coverage. FPDetect also enables overhead reduction techniques based on deploying these detectors coarsely in space and time. Experimental evaluations demonstrate the practicality of our approach.

Skip Supplemental Material Section

Supplemental Material

References

  1. IEEE. 2008. IEEE standard for floating-point arithmetic. IEEE Std 754-2008 (Aug. 2008), 1--70.Google ScholarGoogle Scholar
  2. George A. Articolo. 2009. Partial Differential Equations 8 Boundary Value Problems with Maple, Second Edition (2nd ed.). Academic Press, Orlando, FL.Google ScholarGoogle Scholar
  3. Wenlei Bao, Sriram Krishnamoorthy, Louis-Noël Pouchet, Fabrice Rastello, and P. Sadayappan. 2016. PolyCheck: Dynamic verification of iteration space transformations on affine programs. In Proceedings of the POPL. 539--554.Google ScholarGoogle Scholar
  4. R. Baumann. 2005. Soft errors in advanced computer systems. IEEE Design Test Comput. 22, 3 (May 2005), 258--266. DOI:https://doi.org/10.1109/MDT.2005.69Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Sylvie Boldo and Jean-Christophe Filliâtre. 2007. Formal verification of floating-point programs. In Proceedings of the ARITH. 187--194.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Sylvie Boldo and Thi Minh Nguyen. 2011. Proofs of numerical programs when the compiler optimizes. Innov. Syst. Softw. Eng. 7, 2 (June 2011), 151--160. DOI:https://doi.org/10.1007/s11334-011-0151-6Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Uday Bondhugula, Muthu Baskaran, Sriram Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. 2008. Automatic transformations for communication-minimized parallelization and locality optimization in the polyhedral model. In Proceedings of the ETAPS CC.Google ScholarGoogle Scholar
  8. Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the PLDI. ACM, New York, NY, 101--113.Google ScholarGoogle Scholar
  9. Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the OOPSLA. 33--52.Google ScholarGoogle Scholar
  10. Eric Cheng, Shahrzad Mirkhani, Lukasz G. Szafaryn, Chen-Yong Cher, Hyungmin Cho, Kevin Skadron, Mircea R. Stan, Klas Lilja, Jacob A. Abraham, Pradip Bose, and Subhasish Mitra. 2018. Tolerating soft errors in processor cores using CLEAR. IEEE Trans. CAD Integr. Circ. Syst. 37, 9 (2018), 1839--1852.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamaric. 2017. Rigorous floating-point mixed-precision tuning. In Proceedings of the POPL. 300--315.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Eva Darulova and Viktor Kuncak. 2014. Sound compilation of reals. In Proceedings of the POPL. 235--248.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Eva Darulova and Viktor Kuncak. 2017. Towards a compiler for reals. ACM Trans. Program. Lang. Syst. 39, 2 (Mar. 2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Arnab Das, Sriram Krishnamoorthy, Ian Briggs, Ganesh Gopalakrishnan, and Ramakrishna Tipireddy. 2020. FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation. arxiv:cs.DC/2004.04359.Google ScholarGoogle Scholar
  15. Marc Daumas, Guillaume Melquiond, and César A. Muñoz. 2005. Guaranteed proofs using interval arithmetic. In Proceedings of the ARITH. 188--195.Google ScholarGoogle Scholar
  16. Luiz Henrique de Figueiredo and Jorge Stolfi. 2004. Affine arithmetic: Concepts and applications. Numer. Algor. 37, 1 (Dec. 2004), 147--158.Google ScholarGoogle ScholarCross RefCross Ref
  17. Sheng Di and Franck Cappello. 2016. Adaptive impact-driven detection of silent data corruption for HPC applications. Trans. Parallel Distrib. Syst. 27, 10 (2016), 2809--2823.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Evaluating the impact of SDC on the GMRES iterative solver. In Proceedings of the IPDPS. 1193--1202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. James Elliott, Mark Hoemmen, and Frank Mueller. 2014. Resilience in numerical methods: A position on fault models and methodologies. CoRR abs/1401.3013 (2014).Google ScholarGoogle Scholar
  20. James Elliott, Mark Hoemmen, and Frank Mueller. 2015. A numerical soft fault model for iterative linear solvers. In Proceedings of the HPDC. 271--274.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. James Elliott, Mark Hoemmen, and Frank Mueller. 2016. Exploiting data representation for fault tolerance. J. Comput. Sci. 14 (2016), 51--60.Google ScholarGoogle ScholarCross RefCross Ref
  22. Aiman Fang, Aurélien Cavelan, Yves Robert, and Andrew A. Chien. 2017. Resilience for stencil computations with latent errors. In Proceedings of the ICPP. 581--590.Google ScholarGoogle Scholar
  23. Marc Gamell, Keita Teranishi, Michael A. Heroux, Jackson Mayo, Hemanth Kolla, Jacqueline Chen, and Manish Parashar. 2015. Local recovery and failure masking for stencil-based applications at extreme scales. In Proceedings of the SC. 70:1--70:12.Google ScholarGoogle Scholar
  24. David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, 1 (Mar. 1991), 5--48.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. A. B. Gomez and F. Cappello. 2015. Detecting and correcting data corruption in stencil applications through multivariate interpolation. In Proceedings of the CLUSTER. 595--602.Google ScholarGoogle Scholar
  26. John L. Hennessy and David A. Patterson. 2019. A new golden age for computer architecture. Commun. ACM 62, 2 (Jan. 2019), 48--60. DOI:https://doi.org/10.1145/3282307Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Nicholas J. Higham. 2002. Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics. Retrieved from https://epubs.siam.org/doi/pdf/10.1137/1.9780898718027.Google ScholarGoogle Scholar
  28. Kuang-Hua Huang and Jacob A. Abraham. 1984. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput. 33, 6 (1984), 518--528.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Padma Jayaraman and Ranjani Parthasarathi. 2017. A survey on post-silicon functional validation for multicore architectures. ACM Comput. Surv. 50, 4 (Aug. 2017). DOI:https://doi.org/10.1145/3107615Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. William Kahan. 1996. IEEE standard 754 for binary floating-point arithmetic. Lecture Notes Status IEEE 754, 94720-1776 (1996), 11.Google ScholarGoogle Scholar
  31. Gokcen Kestor, Burcu Ozcelik Mutlu, Joseph Manzano, Omer Subasi, Osman Unsal, and Sriram Krishnamoorthy. 2018. Comparative analysis of soft-error detection strategies: A case study with iterative methods. In Proceedings of the CF. 173--182.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Walter Krämer. 1997. A priori worst-case error bounds for floating-point computations. In Proceedings of the ARITH. 64.Google ScholarGoogle ScholarCross RefCross Ref
  33. Qingrui Liu, Changhee Jung, Dongyoon Lee, and Devesh Tiwari. 2015. Clover: Compiler directed lightweight soft error resilience. SIGPLAN Not. 50, 5 (June 2015). DOI:https://doi.org/10.1145/2808704.2754959Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Victor Magron, George Constantinides, and Alastair Donaldson. 2017. Certified roundoff error bounds using semidefinite programming. ACM Trans. Math. Softw. 43, 4 (Jan. 2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sasa Misailovic, Michael Carbin, Sara Achour, Zichao Qi, and Martin C. Rinard. 2014. Chisel: Reliability- and accuracy-aware optimization of approximate computational kernels. In Proceedings of the OOPSLA. 309--328.Google ScholarGoogle Scholar
  36. Jean-Michel Muller, Nicolas Brisebarre, Florent de Dinechin, Claude-Pierre Jeannerod, Vincent Lefèvre, Guillaume Melquiond, Nathalie Revol, Damien Stehlé, and Serge Torres. 2009. Handbook of Floating-Point Arithmetic. Birkhauser.Google ScholarGoogle Scholar
  37. H. Quinn and P. Graham. 2005. Terrestrial-based radiation upsets: A cautionary tale. In Proceedings of the FCCM. 193--202. DOI:https://doi.org/10.1109/FCCM.2005.61Google ScholarGoogle Scholar
  38. Jude A. Rivers, Meeta S. Gupta, Jeonghee Shin, Prabhakar N. Kudva, and Pradip Bose. 2011. Error tolerance in server class processors. IEEE Trans. CAD Integr. Circ. Syst. 30, 7 (2011), 945--959.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. Sangchoolie, K. Pattabiraman, and J. Karlsson. 2017. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In Proceedings of the DSN. 97--108.Google ScholarGoogle Scholar
  40. Markus Schordan, Pei-Hung Lin, Daniel J. Quinlan, and Louis-Noël Pouchet. 2014. Verification of polyhedral optimizations with constant loop bounds in finite state space computations. In Proceedings of the ISoLA. 493--508.Google ScholarGoogle ScholarCross RefCross Ref
  41. N. Seifert. 2010. Radiation-induced Soft Error: A Chip-level Modeling. Delft, The Netherlands.Google ScholarGoogle Scholar
  42. Vishal Sharma, G. Gopalkrishnan, and Greg Bronevetsky. 2015. Detecting soft errors in stencil based computations. In the 11th IEEE Workshop on Silicon Errors in Logic -- System Effects (SELSE'15).Google ScholarGoogle ScholarCross RefCross Ref
  43. Marc Snir, Robert W. Wisniewski, Jacob A. Abraham, Sarita V. Adve, Saurabh Bagchi, et al. 2014. Addressing failures in exascale computing. Proceedings of the IJHPCA 28, 2 (2014), 129--173.Google ScholarGoogle Scholar
  44. Alexey Solovyev, Marek S. Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamaric, and Ganesh Gopalakrishnan. 2019. Rigorous estimation of floating-point round-off errors with symbolic Taylor expansions. ACM Trans. Program. Lang. Syst. 41, 1 (2019), 2:1--2:39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Omer Subasi, Sheng Di, Prasanna Balaprakash, Osman S. Unsal, Jesús Labarta, Adrián Cristal, Sriram Krishnamoorthy, and Franck Cappello. 2017. MACORD: Online adaptive machine learning framework for silent error detection. In Proceedings of the CLUSTER. 717--724.Google ScholarGoogle ScholarCross RefCross Ref
  46. Omer Subasi, Sheng Di, Leonardo Bautista-Gomez, Prasanna Balaprakash, Osman S. Ünsal, Jesús Labarta, Adrián Cristal, and Franck Cappello. 2016. Spatial support vector regression to detect silent errors in the exascale era. In Proceedings of the CCGrid. 413--424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Omer Subasi and Sriram Krishnamoorthy. 2017. A gaussian process approach for effective soft error detection. In Proceedings of the CLUSTER. 608--612.Google ScholarGoogle ScholarCross RefCross Ref
  48. Yuan Tang, Rezaul Alam Chowdhury, Bradley C. Kuszmaul, Chi-Keung Luk, and Charles E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the SPAA. 117--128.Google ScholarGoogle Scholar
  49. Dingwen Tao, Shuaiwen Leon Song, Sriram Krishnamoorthy, Panruo Wu, Xin Liang, Eddy Z. Zhang, Darren J. Kerbyson, and Zizhong Chen. 2016. New-Sum: A novel online ABFT scheme for general iterative methods. In Proceedings of the HPDC. 43--55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Devesh Tiwari, Saurabh Gupta, George Gallarno, Jim Rogers, and Don Maxwell. 2015. Reliability lessons learned from GPU experience with the titan supercomputer at oak ridge leadership computing facility. In Proceedings of the SC. ACM, New York, NY.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Ohio State University. 2012. the PolyOpt Polyhedral Compiler. Retrieved from http://hpcrl.cse.ohio-state.edu/wiki/index.php/Polyhedral_Compilation.Google ScholarGoogle Scholar
  52. Sven Verdoolaege, Gerda Janssens, and Maurice Bruynooghe. 2012. Equivalence checking of static affine programs using widening to handle recurrences. ACM Trans. Program. Lang. Syst. 34, 3 (2012), 11:1--11:35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Panruo Wu and Zizhong Chen. 2014. FT-ScaLAPACK: Correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines. In Proceedings of the HPDC. 49--60.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Panruo Wu, Nathan DeBardeleben, Qiang Guan, Sean Blanchard, Jieyang Chen, Dingwen Tao, Xin Liang, Kaiming Ouyang, and Zizhong Chen. 2017. Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the PPoPP. 415--427.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Ren Xiaoguang, Xu Xinhai, Wang Qian, Chen Juan, Wang Miao, and Yang Xuejun. 2015. GS-DMR: Low-overhead soft error detection scheme for stencil-based computation. Parallel Comput. 41 (2015), 50--65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Yaqi Zhang, Ralph Nathan, and Daniel J. Sorin. 2015. Reduced Precision Checking to Detect Errors in Floating Point Arithmetic. arxiv:cs.NA/1510.01145.Google ScholarGoogle Scholar

Index Terms

  1. FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Architecture and Code Optimization
            ACM Transactions on Architecture and Code Optimization  Volume 17, Issue 3
            September 2020
            200 pages
            ISSN:1544-3566
            EISSN:1544-3973
            DOI:10.1145/3415154
            Issue’s Table of Contents

            Copyright © 2020 ACM

            © 2020 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 August 2020
            • Accepted: 1 May 2020
            • Revised: 1 April 2020
            • Received: 1 November 2019
            Published in taco Volume 17, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format