当前位置: X-MOL 学术J. Braz. Comput. Soc. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Running resilient MPI applications on a Dynamic Group of Recommended Processes
Journal of the Brazilian Computer Society Pub Date : 2018-03-12 , DOI: 10.1186/s13173-018-0069-z
Edson Tavares de Camargo , Elias P. Duarte

High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers.

中文翻译:

在推荐进程的动态组上运行弹性 MPI 应用程序

高性能计算系统运行的应用程序可能需要几个小时才能执行,并且必须处理可能发生的大量故障。这些系统的大多数现有容错策略都假设崩溃故障是很容易检测到的永久性事件。在几个真实系统中情况并非如此,特别是在共享集群中,其中即使负载变化也可能导致性能问题,这实际上等同于故障。在这项工作中,我们提出了一个新模型来处理这个问题,其中进程在它们之间执行测试以确定是否推荐或不推荐它们运行的​​处理器(或内核)。分类为推荐的进程形成运行应用程序的动态推荐进程组 (DGRP)。DGRP 仅由未经所有 DGRP 流程测试为非推荐的流程组成。在 DGRP 流程执行的一轮共识之后,一个不在 DGRP 中并按照推荐进行持续测试的流程可以重新加入 DGRP。实验结果是从基于 MPI 的实现中获得的,其中 HyperQuickSort 并行排序算法在运行时重新配置自身以容忍最多 N - 1 个错误(在具有 N 个进程的系统中),同时排序最多 10 亿个整数。
更新日期:2018-03-12
down
wechat
bug