当前位置: X-MOL 学术J. Comput. Sci. Tech. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MPI-RCDD: A Framework for MPI Runtime Communication Deadlock Detection
Journal of Computer Science and Technology ( IF 1.2 ) Pub Date : 2020-03-01 , DOI: 10.1007/s11390-020-9701-4
Hong-Mei Wei , Jian Gao , Peng Qing , Kang Yu , Yan-Fei Fang , Ming-Lu Li

The message passing interface (MPI) has become a de facto standard for programming models of high-performance computing, but its rich and flexible interface semantics makes the program easy to generate communication deadlock, which seriously affects the usability of the system. However, the existing detection tools for MPI communication deadlock are not scalable enough to adapt to the continuous expansion of system scale. In this context, we propose a framework for MPI runtime communication deadlock detection, namely MPI-RCDD, which contains three kinds of main mechanisms. Firstly, MPI-RCDD has a message logging protocol that is associated with deadlock detection to ensure that the communication messages required for deadlock analysis are not lost. Secondly, it uses the asynchronous processing thread provided by the MPI to implement the transfer of dependencies between processes, so that multiple processes can participate in deadlock detection simultaneously, thus alleviating the performance bottleneck problem of centralized analysis. In addition, it uses an AND⊕OR model based algorithm named AODA to perform deadlock analysis work. The AODA algorithm combines the advantages of both timeout-based and dependency-based deadlock analysis approaches, and allows the processes in the timeout state to search for a deadlock circle or knot in the process of dependency transfer. Further, the AODA algorithm cannot lead to false positives and can represent the source of the deadlock accurately. The experimental results on typical MPI communication deadlock benchmarks such as Umpire Test Suit demonstrate the capability of MPI-RCDD. Additionally, the experiments on the NPB benchmarks obtain the satisfying performance cost, which show that the MPI-RCDD has strong scalability.

中文翻译:

MPI-RCDD:MPI 运行时通信死锁检测框架

消息传递接口(MPI)已经成为高性能计算编程模型的事实标准,但其丰富灵活的接口语义使得程序容易产生通信死锁,严重影响系统的可用性。然而,现有的MPI通信死锁检测工具的可扩展性不足以适应系统规模的不断扩大。在此背景下,我们提出了一个 MPI 运行时通信死锁检测框架,即 MPI-RCDD,它包含三种主要机制。首先,MPI-RCDD有一个消息记录协议,它与死锁检测相关联,以确保死锁分析所需的通信消息不会丢失。第二,它利用MPI提供的异步处理线程来实现进程间依赖的传递,使得多个进程可以同时参与死锁检测,从而缓解集中分析的性能瓶颈问题。此外,它使用一种名为 AODA 的基于 AND⊕OR 模型的算法来执行死锁分析工作。AODA算法结合了基于超时和基于依赖的死锁分析方法的优点,允许处于超时状态的进程在依赖传递过程中寻找死锁圈或死锁结。此外,AODA 算法不会导致误报,并且可以准确地表示死锁的来源。在 Umpire Test Suit 等典型 MPI 通信死锁基准上的实验结果证明了 MPI-RCDD 的能力。此外,在 NPB 基准上的实验获得了令人满意的性能成本,这表明 MPI-RCDD 具有很强的可扩展性。
更新日期:2020-03-01
down
wechat
bug