Understanding failures through the lifetime of a top-level supercomputer,Journal of Parallel and Distributed Computing

当前位置： X-MOL 学术 › J. Parallel Distrib. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Understanding failures through the lifetime of a top-level supercomputer
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2021-04-20 , DOI: 10.1016/j.jpdc.2021.04.001
Elvis Rojas , Esteban Meneses , Terry Jones , Don Maxwell

High performance computing systems are required to solve grand challenges in many scientific disciplines. These systems assemble many components to be powerful enough for solving extremely complex problems. An inherent consequence is the intricacy of the interaction of all those components, especially when failures come into the picture. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms in the future. This paper presents the results on studying multi-year failure and workload records of a powerful supercomputer that topped the world rankings. We provide a thorough analysis of the data and characterize the reliability of the system through several dimensions: failure classification, failure-rate modelling, and interplay between failures and workload. The results shed some light on the dynamics of top-level supercomputers and sensitive areas ripe for improvement.

中文翻译：

在顶级超级计算机的整个生命周期中了解故障

需要高性能计算系统来解决许多科学学科中的巨大挑战。这些系统组装了许多组件，使其功能强大到足以解决极其复杂的问题。内在的后果是所有这些组件之间的相互作用错综复杂，尤其是当出现故障时。对这些系统将来如何无法设计可靠的超级计算平台的理解至关重要。本文介绍了研究一台功能强大的超级计算机的多年故障和工作量记录的结果，该超级计算机名列世界前茅。我们提供数据的全面分析，并通过几个维度来表征系统的可靠性：故障分类，故障率建模以及故障与工作负载之间的相互作用。

更新日期：2021-04-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11