当前位置: X-MOL 学术Lobachevskii J. Math. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
‘‘Endless’’ Workload Analysis of Large-Scale Supercomputers
Lobachevskii Journal of Mathematics Pub Date : 2021-02-26 , DOI: 10.1134/s1995080221010236
P. A. Shvets , V. V. Voevodin

Abstract

Modern supercomputers are so large and complex that some of their hardware components inevitably go out of order from time to time. Therefore, supercomputer systems require constant and careful health monitoring, and such control is set up in everyday practice of any large HPC center. But a lot of attention should be also paid to the quality of supercomputer usage, describing how fully and efficiently computational resources are utilized. And this task is still far from being solved, leading to system administrators of most supercomputers knowing very little about the quality of their supercomputer job flow as well as possible ways to improve it. In this paper, we present a looped report system that allows to obtain and analyze information of any level of detail about all important aspects describing the quality of the supercomputer workload, starting from the overall system functioning and up to individual job launches. It provides great flexibility by offering an ‘‘endless’’ number of workload analysis scenarios, which allows to determine root causes of various cases of performance degradation using the same approach. This report system is built upon the previously developed TASC software package, aimed at identifying and analyzing performance issues both at the level of individual parallel applications and the entire supercomputer as a whole.



中文翻译:

大型超级计算机的“无尽”工作量分析

摘要

现代超级计算机是如此之大和复杂,以至于其某些硬件组件不可避免地会不时出现故障。因此,超级计算机系统需要持续和仔细的健康监控,并且在任何大型HPC中心的日常实践中都会建立这种控制。但是,还应特别注意超级计算机的使用质量,它们描述了如何充分而有效地利用计算资源。而且,这项任务还远远没有解决,这导致大多数超级计算机的系统管理员对超级计算机工作流程的质量以及改进它的可能方法知之甚少。在本文中,我们提供了一个循环报告系统,该系统允许获取和分析有关描述超级计算机工作负载质量的所有重要方面的任何详细程度的信息,从整个系统的功能开始,直到个别的工作启动。它通过提供“无数”的工作负载分析方案来提供极大的灵活性,该方案允许使用相同的方法确定各种性能下降情况的根本原因。该报告系统建立在以前开发的TASC软件包的基础上,旨在识别和分析单个并行应用程序级别以及整个超级计算机整体上的性能问题。

更新日期:2021-02-26
down
wechat
bug