当前位置: X-MOL 学术arXiv.cs.PF › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
TeaMPI -- Replication-based Resilience without the (Performance) Pain
arXiv - CS - Performance Pub Date : 2020-05-25 , DOI: arxiv-2005.12091
Philipp Samfass, Tobias Weinzierl, Benjamin Hazelwood, Michael Bader

In an era where we can not afford to checkpoint frequently, replication is a generic way forward to construct numerical simulations that can continue to run even if hardware parts fail. Yet, replication often is not employed on larger scales, as na\"ively mirroring a computation once effectively halves the machine size, and as keeping replicated simulations consistent with each other is not trivial. We demonstrate for the ExaHyPE engine -- a task-based solver for hyperbolic equation systems -- that it is possible to realise resiliency without major code changes on the user side, while we introduce a novel algorithmic idea where replication reduces the time-to-solution. The redundant CPU cycles are not burned "for nothing". Our work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running. Our key performance idea is to let the tasks of the replicated simulations share some of their outcomes, while we shuffle the actual task execution order per replica. This way, replicated ranks can skip some local computations and automatically start to synchronise with each other. Our experiments with a production-level seismic wave-equation solver provide evidence that this novel concept has the potential to make replication affordable for large-scale simulations in high-performance computing.

中文翻译:

TeaMPI——没有(性能)痛苦的基于复制的弹性

在我们负担不起频繁检查点的时代,复制是构建数值模拟的通用方法,即使硬件部件出现故障也可以继续运行。然而,复制通常不会在更大规模的情况下使用,因为一旦有效地将机器大小减半,天真地镜像计算,并且保持复制的模拟彼此一致并非易事。我们为 ExaHyPE 引擎演示——一项任务-基于双曲方程系统的求解器 - 无需在用户端进行重大代码更改即可实现弹性,同时我们引入了一种新颖的算法思想,其中复制减少了解决时间。冗余 CPU 周期不会被消耗“因为没有什么”。我们的工作采用弱一致性数据模型,其中副本独立运行,但通过心跳消息相互通知它们是否仍在运行。我们的关键性能思想是让复制模拟的任务共享它们的一些结果,同时我们对每个副本的实际任务执行顺序进行打乱。这样,复制的 rank 可以跳过一些本地计算并自动开始彼此同步。我们使用生产级地震波方程求解器进行的实验证明,这种新颖的概念有可能使高性能计算中的大规模模拟能够负担得起复制。而我们将每个副本的实际任务执行顺序打乱。这样,复制的 rank 可以跳过一些本地计算并自动开始彼此同步。我们使用生产级地震波方程求解器进行的实验证明,这种新颖的概念有可能使高性能计算中的大规模模拟能够负担得起复制。而我们将每个副本的实际任务执行顺序打乱。这样,复制的 rank 可以跳过一些本地计算并自动开始彼此同步。我们对生产级地震波方程求解器的实验证明,这种新颖的概念有可能使高性能计算中的大规模模拟能够负担得起复制。
更新日期:2020-07-02
down
wechat
bug