当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph Kernels
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2021-05-18 , DOI: 10.1109/tpds.2021.3081530
Dylan Chapp 1 , Nigel Tan 1 , Sanjukta Bhowmick 2 , Michela Taufer 1
Affiliation  

As the scientific community prepares to deploy an increasingly complex and diverse set of applications on exascale platforms, the need to assess reproducibility of simulations and identify the root causes of reproducibility failures increases correspondingly. One of the greatest challenges facing reproducibility issues at exascale is the inherent non-determinism at the level of inter-process communication. The use of non-deterministic communication constructs is necessary to boost performance, but communication non-determinism can also hamper software correctness and result reproducibility. To address this challenge, we propose a software framework for identifying the percentage and sources of communication non-determinism. We model parallel executions as directed graphs and leverage graph kernels to characterize run-to-run variations in inter-process communication. We demonstrate the effectiveness of graph kernel similarity as a proxy for non-determinism, by showing that these kernels can quantify the type and degree of non-determinism present in communication patterns. To demonstrate our framework’s ability to link and quantify runtime non-determinism to root sources, demonstrate with present for an adaptive mesh refinement application, where our framework automatically quantifies the impact of function calls on non-determinism, and a Monte Carlo application, where our framework automatically quantifies the impact of parameter configurations on non-determinism.

中文翻译:

通过图内核识别 MPI 应用程序中非确定性的程度和来源

随着科学界准备在百亿亿级平台上部署越来越复杂和多样化的应用程序,评估模拟的再现性和确定再现性失败的根本原因的需求也相应增加。百亿亿级重现性问题面临的最大挑战之一是进程间通信级别的固有非确定性。非确定性通信结构的使用对于提高性能是必要的,但通信非确定性也会妨碍软件的正确性和结果的可重复性。为了应对这一挑战,我们提出了一个软件框架,用于识别通信非确定性的百分比和来源。我们将并行执行建模为有向图,并利用图内核来表征进程间通信中运行到运行的变化。我们通过展示这些内核可以量化通信模式中存在的非确定性的类型和程度,证明了图核相似性作为非确定性代理的有效性。为了演示我们的框架将运行时不确定性链接和量化到根源的能力,演示了自适应网格细化应用程序的当前,其中我们的框架自动量化函数调用对不确定性的影响,以及蒙特卡罗应用程序,其中我们的框架自动量化参数配置对非确定性的影响。我们通过展示这些内核可以量化通信模式中存在的非确定性的类型和程度,证明了图核相似性作为非确定性代理的有效性。为了演示我们的框架将运行时不确定性链接和量化到根源的能力,演示了自适应网格细化应用程序的当前,其中我们的框架自动量化函数调用对不确定性的影响,以及蒙特卡罗应用程序,其中我们的框架自动量化参数配置对非确定性的影响。我们通过展示这些内核可以量化通信模式中存在的非确定性的类型和程度,证明了图核相似性作为非确定性代理的有效性。为了演示我们的框架将运行时不确定性链接和量化到根源的能力,演示了自适应网格细化应用程序的当前,其中我们的框架自动量化函数调用对不确定性的影响,以及蒙特卡罗应用程序,其中我们的框架自动量化参数配置对非确定性的影响。
更新日期:2021-06-04
down
wechat
bug