Abstract
This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. A theoretical investigation into the performance of the asynchronous iterative methods is presented and used to motivate several fault recovery methods for asynchronous linear solvers. The numerical experiments utilize a hybrid-parallel implementation where the computational work is distributed over multiple nodes using MPI and parallelized on each node using OpenMP, and a series of runs are conducted to measure both the impact of soft faults and the effectiveness of the recovery methods. Trials are run to compare two models for simulating the occurrence of a fault as well as techniques for recovering from the effects of a fault. The results show that the proposed strategies can effectively recover from the impact of a fault and that the numerical model for simulating soft faults consistently produces fault effects that enable the investigation and tuning of recovery techniques in action.
Similar content being viewed by others
References
Addou, A., Benahmed, A.: Parallel synchronous algorithm for nonlinear fixed point problems. Int. J. Math. Math. Sci. 19, 3175–3183 (2005)
Agullo, E., Cools, S., Fatih-Yetkin, E., Giraud, L., Vanroose, W.: On soft errors in the conjugate gradient method: sensitivity and robust numerical detection. Research Report 9226, Inria Bordeaux Sud-Ouest (2018)
Anzt, H., Dongarra, J., Quintana-Ortí, E.S.: Fine-grained bit-flip protection for relaxation methods. J. Comput. Sci. (2016)
Avron, H., Druinsky, A., Gupta, A.: Revisiting asynchronous linear solvers: Provable convergence rate through randomization. J. ACM (JACM) 62(6), 1–27 (2015)
Bahi, J.M., Contassot-Vivier, S., Couturier, R.: Parallel Iterative Algorithms: From Sequential to Grid Computing. Chapman and Hall/CRC, Boca Raton (2007)
Baudet, G.M.: Asynchronous iterative methods for multiprocessors. J. ACM (JACM) 25(2), 226–244 (1978)
Bertsekas, D.P., Tsitsiklis, J.N.: Convergence rate and termination of asynchronous iterative algorithms. In: Proceedings of the 3rd International Conference on Supercomputing, ACM, pp 461–470 (1989a)
Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods, vol. 23. Prentice hall Englewood Cliffs, Upper Saddle River (1989b)
Bethune, I., Bull, J.M., Dingle, N.J., Higham, N.J.: Performance analysis of asynchronous Jacobi’s method implemented in MPI, SHMEM and OpenMP. Int. J. High Performance Comput. Appl. 28(1), 97–111 (2014)
Bridges, P.G., Ferreira, K.B., Heroux, M.A., Hoemmen, M.: Fault-tolerant linear solvers via selective reliability (2012). arXiv:1206.1390
Bronevetsky, G., de Supinski, B.: Soft error vulnerability of iterative linear algebra methods. In: Proceedings of the 22nd annual international conference on Supercomputing, ACM, pp 155–164 (2008)
Cappello, F., Geist, A., Gropp, W., Kale, S., Kramer, B., Snir, M.: Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1(1), 5–28 (2014)
Chazan, D., Miranker, W.: Chaotic relaxation. Linear Algebra Appl. 2(2), 199–222 (1969)
Chen, Z.: Online-abft: an online algorithm based fault tolerance scheme for soft error detection in iterative methods. ACM SIGPLAN Notices ACM 48, 167–176 (2013)
Chow, E., Patel, A.: Fine-grained parallel incomplete LU factorization. SIAM J. Sci. Comput. 37(2), C169–C193 (2015)
Coleman, E., Sosonkina, M.: Self-stabilizing fine-grained parallel incomplete LU factorization. Sustain. Comput. Inf. Syst. 19, 291–304 (2018)
Coleman, E., Jensen, E.J., Sosonkina, M.: Impacts of three soft-fault models on hybrid parallel asynchronous iterative methods. In: 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), IEEE, pp. 458–465 (2018)
Elliott, J., Hoemmen, M., Mueller, F.: Evaluating the impact of SDC on the GMRES iterative solver. In: Parallel and Distributed Processing Symposium, 2014 IEEE 28th International, IEEE, pp. 1193–1202 (2014a)
Elliott, J., Hoemmen, M., Mueller, F.: Resilience in numerical methods: a position on fault models and methodologies (2014b). arXiv:1401.3013
Elliott, J., Hoemmen, M., Mueller, F.: A Numerical Soft Fault Model for Iterative Linear Solvers. In: Proceedings of the 24nd International Symposium on High-Performance Parallel and Distributed Computing (2015)
Frommer, A., Szyld, D.B.: On asynchronous iterations. J. Comput. Appl. Math. 123(1), 201–216 (2000)
Jensen, E.J., Coleman, E., Sosonkina, M.: Predictive modeling of the performance of asynchronous iterative methods. J. Supercomput. 75(8), 5084–5105 (2019)
Jezequel, F., Couturier, R., Denis, C.: Solving large sparse linear systems in a grid environment: the gremlins code versus the petsc library. J. Supercomput. 59(3), 1517–1532 (2012)
Magoulès, F., Gbikpi-Benissan, G.: Distributed convergence detection based on global residual error under asynchronous iterations. IEEE Trans. Parallel Distributed Syst. 29(4), 819–829 (2017)
Magoules, F., Szyld, D.B., Venet, C.: Asynchronous optimized Schwarz methods with and without overlap. Numerische Mathematik pp 1–29 (2015)
Miellou, J., Spiteri, P., El Baz, D.: A new stopping criterion for linear perturbed asynchronous iterations. J. Comput. Appl. Math. 219(2), 471–483 (2008)
Miellou, J.C., Spiteri, P., El Baz, D.: Stopping criteria, forward and backward errors for perturbed asynchronous linear fixed point methods in finite precision. IMA J. Numer. Anal. 25(3), 429–442 (2005)
Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems, pp. 693–701 (2011)
Sao, P., Vuduc, R.: Self-stabilizing iterative solvers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp. 1–8 (2013)
Savarí, S.A., Bertsekas, D.P.: Finite termination of asynchronous iterative algorithms. Parallel Comput. 22(1), 39–56 (1996)
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing, ACM, pp. 152–161 (2011)
Shantharam, M., Srinivasmurthy, S., Raghavan, P.: Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: Proceedings of the 26th ACM International Conference on Supercomputing, ACM, pp. 69–78 (2012)
Sloan, J., Kumar, R., Bronevetsky, G.: Algorithmic approaches to low overhead fault detection for sparse linear algebra. In: Dependable Systems and Networks (DSN), 2012 42nd Annual IEEE/IFIP International Conference on, IEEE, pp. 1–12 (2012)
Spiteri, P., Miellou, J.C., El Baz, D.: Perturbation of parallel asynchronous linear iterations by floating point errors. Electron. Trans. Numer. Anal. 13, 38–55 (2002)
Stoyanov, M., Webster, C.: Numerical analysis of fixed point algorithms in the presence of hardware faults. SIAM J. Sci. Comput. 37(5), C532–C553 (2015)
Wolfson-Pou, J., Chow, E.: Distributed southwell: an iterative method with low communication costs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2017)
Acknowledgements
This work was supported in part by the U.S. Department of Energy (DOE) Office of Advanced Scientific Computing Research under the Grant DE-SC-0016564 and the Exascale Computing Project (ECP) through the Ames Laboratory, operated by Iowa State University under contract No. DE-AC00-07CH11358, the Turing High Performance Computing cluster at Old Dominion University, by the National Science Foundation under Grant CNS-1828593, and through the In-House Laboratory Independent Research (ILIR) program at the Naval Surface Warfare Center, Dahlgren Division. The authors would also like to thank the reviewers for their thoughtful comments that helped to improve this paper.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Coleman, E., Jensen, E.J. & Sosonkina, M. Fault Recovery Methods for Asynchronous Linear Solvers. Int J Parallel Prog 49, 51–80 (2021). https://doi.org/10.1007/s10766-020-00676-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10766-020-00676-w