当前位置: X-MOL 学术Int. J. High Perform. Comput. Appl. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction
The International Journal of High Performance Computing Applications ( IF 3.1 ) Pub Date : 2021-02-08 , DOI: 10.1177/1094342021990433
Tommaso Benacchio 1 , Luca Bonaventura 1 , Mirco Altenbernd 2 , Chris D Cantwell 3 , Peter D Düben 4, 5 , Mike Gillard 6 , Luc Giraud 7 , Dominik Göddeke 2 , Erwan Raffin 8 , Keita Teranishi 9 , Nils Wedi 4
Affiliation  

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.



中文翻译:

用于数值天气和气候预测的高性能计算中的弹性和容错能力

数值天气预报和气候预测准确性的进步很大程度上取决于可用计算能力的增长。随着顶级计算设施中内核的数量达到数百万个,增加的平均硬件和软件故障频率迫使用户检查其算法和系统,以防止仿真崩溃。本报告调查了与时间关键型数字天气和气候预测系统特别相关的硬件,应用程序级别和算法级别的弹性方法。分析了适用的现有策略,其中包括用于数值方案的插值重新启动和压缩检查点,内存中检查点,用户级故障缓解以及基于系统的备份方法。数值示例展示了该技术在解决故障方面的性能,特别强调了线性系统的迭代求解器,这是大气流体流动求解器的主要组成部分。讨论了这些策略的潜在影响,并将其与当前针对百亿级天气预报的数值天气预报算法和系统的发展相关。分析了弹性策略的性能,效率和有效性之间的权衡,并概述了对未来发展的一些建议。

更新日期:2021-02-08
down
wechat
bug