Skip to main content
Log in

Fault Recovery Methods for Asynchronous Linear Solvers

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. A theoretical investigation into the performance of the asynchronous iterative methods is presented and used to motivate several fault recovery methods for asynchronous linear solvers. The numerical experiments utilize a hybrid-parallel implementation where the computational work is distributed over multiple nodes using MPI and parallelized on each node using OpenMP, and a series of runs are conducted to measure both the impact of soft faults and the effectiveness of the recovery methods. Trials are run to compare two models for simulating the occurrence of a fault as well as techniques for recovering from the effects of a fault. The results show that the proposed strategies can effectively recover from the impact of a fault and that the numerical model for simulating soft faults consistently produces fault effects that enable the investigation and tuning of recovery techniques in action.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Addou, A., Benahmed, A.: Parallel synchronous algorithm for nonlinear fixed point problems. Int. J. Math. Math. Sci. 19, 3175–3183 (2005)

    Article  MathSciNet  Google Scholar 

  2. Agullo, E., Cools, S., Fatih-Yetkin, E., Giraud, L., Vanroose, W.: On soft errors in the conjugate gradient method: sensitivity and robust numerical detection. Research Report 9226, Inria Bordeaux Sud-Ouest (2018)

  3. Anzt, H., Dongarra, J., Quintana-Ortí, E.S.: Fine-grained bit-flip protection for relaxation methods. J. Comput. Sci. (2016)

  4. Avron, H., Druinsky, A., Gupta, A.: Revisiting asynchronous linear solvers: Provable convergence rate through randomization. J. ACM (JACM) 62(6), 1–27 (2015)

    Article  MathSciNet  Google Scholar 

  5. Bahi, J.M., Contassot-Vivier, S., Couturier, R.: Parallel Iterative Algorithms: From Sequential to Grid Computing. Chapman and Hall/CRC, Boca Raton (2007)

    Book  Google Scholar 

  6. Baudet, G.M.: Asynchronous iterative methods for multiprocessors. J. ACM (JACM) 25(2), 226–244 (1978)

    Article  MathSciNet  Google Scholar 

  7. Bertsekas, D.P., Tsitsiklis, J.N.: Convergence rate and termination of asynchronous iterative algorithms. In: Proceedings of the 3rd International Conference on Supercomputing, ACM, pp 461–470 (1989a)

  8. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice hall Englewood Cliffs, Upper Saddle River (1989b)

    MATH  Google Scholar 

  9. Bethune, I., Bull, J.M., Dingle, N.J., Higham, N.J.: Performance analysis of asynchronous Jacobi’s method implemented in MPI, SHMEM and OpenMP. Int. J. High Performance Comput. Appl. 28(1), 97–111 (2014)

    Article  Google Scholar 

  10. Bridges, P.G., Ferreira, K.B., Heroux, M.A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability (2012). arXiv:1206.1390

  11. Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of the 22nd annual international conference on Supercomputing, ACM, pp 155–164 (2008)

  12. Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1(1), 5–28 (2014)

    Google Scholar 

  13. Chazan, D., Miranker, W.: Chaotic relaxation. Linear Algebra Appl. 2(2), 199–222 (1969)

    Article  MathSciNet  Google Scholar 

  14. Chen, Z.: Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices ACM 48, 167–176 (2013)

    Google Scholar 

  15. Chow, E., Patel, A.: Fine-grained parallel incomplete LU factorization. SIAM J. Sci. Comput. 37(2), C169–C193 (2015)

    Article  MathSciNet  Google Scholar 

  16. Coleman, E., Sosonkina, M.: Self-stabilizing fine-grained parallel incomplete LU factorization. Sustain. Comput. Inf. Syst. 19, 291–304 (2018)

    Google Scholar 

  17. Coleman, E., Jensen, E.J., Sosonkina, M.: Impacts of three soft-fault models on hybrid parallel asynchronous iterative methods. In: 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), IEEE, pp. 458–465 (2018)

  18. Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, IEEE, pp. 1193–1202 (2014a)

  19. Elliott, J., Hoemmen, M., Mueller, F.: Resilience in numerical methods: a position on fault models and methodologies (2014b). arXiv:1401.3013

  20. Elliott, J., Hoemmen, M., Mueller, F.: A Numerical Soft Fault Model for Iterative Linear Solvers. In: Proceedings of the 24nd International Symposium on High-Performance Parallel and Distributed Computing (2015)

  21. Frommer, A., Szyld, D.B.: On asynchronous iterations. J. Comput. Appl. Math. 123(1), 201–216 (2000)

    Article  MathSciNet  Google Scholar 

  22. Jensen, E.J., Coleman, E., Sosonkina, M.: Predictive modeling of the performance of asynchronous iterative methods. J. Supercomput. 75(8), 5084–5105 (2019)

    Article  Google Scholar 

  23. Jezequel, F., Couturier, R., Denis, C.: Solving large sparse linear systems in a grid environment: the gremlins code versus the petsc library. J. Supercomput. 59(3), 1517–1532 (2012)

    Article  Google Scholar 

  24. Magoulès, F., Gbikpi-Benissan, G.: Distributed convergence detection based on global residual error under asynchronous iterations. IEEE Trans. Parallel Distributed Syst. 29(4), 819–829 (2017)

    Article  Google Scholar 

  25. Magoules, F., Szyld, D.B., Venet, C.: Asynchronous optimized Schwarz methods with and without overlap. Numerische Mathematik pp 1–29 (2015)

  26. Miellou, J., Spiteri, P., El Baz, D.: A new stopping criterion for linear perturbed asynchronous iterations. J. Comput. Appl. Math. 219(2), 471–483 (2008)

    Article  MathSciNet  Google Scholar 

  27. Miellou, J.C., Spiteri, P., El Baz, D.: Stopping criteria, forward and backward errors for perturbed asynchronous linear fixed point methods in finite precision. IMA J. Numer. Anal. 25(3), 429–442 (2005)

    Article  MathSciNet  Google Scholar 

  28. Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems, pp. 693–701 (2011)

  29. Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 1–8 (2013)

  30. Savarí, S.A., Bertsekas, D.P.: Finite termination of asynchronous iterative algorithms. Parallel Comput. 22(1), 39–56 (1996)

    Article  MathSciNet  Google Scholar 

  31. Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing, ACM, pp. 152–161 (2011)

  32. Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing, ACM, pp. 69–78 (2012)

  33. Sloan, J., Kumar, R., Bronevetsky, G.: Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on, IEEE, pp. 1–12 (2012)

  34. Spiteri, P., Miellou, J.C., El Baz, D.: Perturbation of parallel asynchronous linear iterations by floating point errors. Electron. Trans. Numer. Anal. 13, 38–55 (2002)

    MathSciNet  MATH  Google Scholar 

  35. Stoyanov, M., Webster, C.: Numerical analysis of fixed point algorithms in the presence of hardware faults. SIAM J. Sci. Comput. 37(5), C532–C553 (2015)

    Article  MathSciNet  Google Scholar 

  36. Wolfson-Pou, J., Chow, E.: Distributed southwell: an iterative method with low communication costs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)

Download references

Acknowledgements

This work was supported in part by the U.S. Department of Energy (DOE) Office of Advanced Scientific Computing Research under the Grant DE-SC-0016564 and the Exascale Computing Project (ECP) through the Ames Laboratory, operated by Iowa State University under contract No. DE-AC00-07CH11358, the Turing High Performance Computing cluster at Old Dominion University, by the National Science Foundation under Grant CNS-1828593, and through the In-House Laboratory Independent Research (ILIR) program at the Naval Surface Warfare Center, Dahlgren Division. The authors would also like to thank the reviewers for their thoughtful comments that helped to improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Evan Coleman.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Coleman, E., Jensen, E.J. & Sosonkina, M. Fault Recovery Methods for Asynchronous Linear Solvers. Int J Parallel Prog 49, 51–80 (2021). https://doi.org/10.1007/s10766-020-00676-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-020-00676-w

Keywords

Navigation