当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact
arXiv - CS - Performance Pub Date : 2021-03-04 , DOI: arxiv-2103.03175
Ayesha Afzal, Georg Hager, Gerhard Wellein

Most distributed-memory bulk-synchronous parallel programs in HPC assume that compute resources are available continuously and homogeneously across the allocated set of compute nodes. However, long one-off delays on individual processes can cause global disturbances, so-called idle waves, by rippling through the system. This process is mainly governed by the communication topology of the underlying parallel code. This paper makes significant contributions to the understanding of idle wave dynamics. We study the propagation mechanisms of idle waves across the ranks of MPI-parallel programs. We present a validated analytic model for their propagation velocity with respect to communication parameters and topology, with a special emphasis on sparse communication patterns. We study the interaction of idle waves with MPI collectives and show that, depending on the implementation, a collective may be transparent to the wave. Finally we analyze two mechanisms of idle wave decay: topological decay, which is rooted in differences in communication characteristics among parts of the system, and noise-induced decay, which is caused by system or application noise. We show that noise-induced decay is largely independent of noise characteristics but depends only on the overall noise power. An analytic expression for idle wave decay rate with respect to noise power is derived. For model validation we use microbenchmarks and stencil algorithms on three different supercomputing platforms.

中文翻译:

并行程序中的空闲波解析模型:通信,集群拓扑和噪声影响

HPC中的大多数分布式内存批量同步并行程序都假定计算资源在分配的计算节点集合中连续且同质可用。但是,单个过程上长时间的一次性延迟会通过整个系统的波动而引起全局干扰,即所谓的空闲波。此过程主要由底层并行代码的通信拓扑控制。本文对了解空闲波动力学做出了重大贡献。我们研究了跨MPI并行程序行的空闲波的传播机制。我们针对通信参数和拓扑的传播速度提出了一种经过验证的分析模型,其中特别强调了稀疏的通信模式。我们研究了闲置波与MPI集合体的相互作用,结果表明,取决于实现方式,集体可能对浪潮是透明的。最后,我们分析了空闲波衰减的两种机制:拓扑衰减(其根源于系统各部分之间的通信特性差异)和噪声引起的衰减(由系统或应用程序噪声引起)。我们表明,噪声引起的衰减在很大程度上与噪声特性无关,但仅取决于整体噪声功率。推导了相对于噪声功率的空闲波衰减率的解析表达式。对于模型验证,我们在三种不同的超级计算平台上使用微基准和模板算法。原因是系统各部分之间的通信特性不同,以及由系统或应用程序噪声引起的噪声引起的衰减。我们表明,噪声引起的衰减在很大程度上与噪声特性无关,但仅取决于整体噪声功率。推导了相对于噪声功率的空闲波衰减率的解析表达式。对于模型验证,我们在三种不同的超级计算平台上使用微基准和模板算法。原因是系统各部分之间的通信特性不同,以及由系统或应用程序噪声引起的噪声引起的衰减。我们表明,噪声引起的衰减在很大程度上与噪声特性无关,但仅取决于整体噪声功率。推导了相对于噪声功率的空闲波衰减率的解析表达式。对于模型验证,我们在三种不同的超级计算平台上使用微基准和模板算法。
更新日期:2021-03-05
down
wechat
bug