当前位置: X-MOL 学术J. Parallel Distrib. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Study of interconnect errors, network congestion, and applications characteristics for throttle prediction on a large scale HPC system
Journal of Parallel and Distributed Computing ( IF 3.4 ) Pub Date : 2021-03-22 , DOI: 10.1016/j.jpdc.2021.03.001
Mohit Kumar , Saurabh Gupta , Tirthak Patel , Michael Wilder , Weisong Shi , Song Fu , Christian Engelmann , Devesh Tiwari

Today’s High Performance Computing (HPC) systems contain thousand of nodes which work together to provide performance in the order of petaflops. The performance of these systems depends on various components like processors, memory, and interconnect. Among all, interconnect plays a major role as it glues together all the hardware components in an HPC system. A slow interconnect can impact a scientific application running on multiple processes severely as they rely on fast network messages to communicate and synchronize frequently. Unfortunately, the HPC community lacks a study that explores different interconnect errors, congestion events and applications characteristics on a large-scale HPC system. In our previous work, we process and analyze interconnect data of the Titan supercomputer to develop a thorough understanding of interconnects faults, errors, and congestion events. In this work, we first show how congestion events can impact application performance. We then investigate application characteristics interaction with interconnect errors and network congestion to predict applications encountering congestion with more than 90% accuracy.



中文翻译:

研究互连错误,网络拥塞以及在大型HPC系统上进行油门预测的应用特性

当今的高性能计算(HPC)系统包含数千个节点,这些节点协同工作以提供petaflops的性能。这些系统的性能取决于各种组件,例如处理器,内存和互连。其中,互连在将HPC系统中的所有硬件组件粘合在一起方面起着重要作用。缓慢的互连会严重影响在多个进程上运行的科学应用程序,因为它们依赖快速的网络消息进行频繁的通信和同步。不幸的是,HPC社区缺乏一项针对大型HPC系统探索不同互连错误,拥塞事件和应用程序特性的研究。在先前的工作中,我们处理和分析Titan超级计算机的互连数据,以全面了解互连故障,错误和拥塞事件。在这项工作中,我们首先展示拥塞事件如何影响应用程序性能。然后,我们调查应用程序特征与互连错误和网络拥塞的相互作用,以90%以上的准确性预测遇到拥塞的应用程序。

更新日期:2021-04-04
down
wechat
bug