Checkpointing and Localized Recovery for Nested Fork-Join Programs,arXiv - CS - Distributed, Parallel, and Cluster Computing

当前位置： X-MOL 学术 › arXiv.cs.DC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Checkpointing and Localized Recovery for Nested Fork-Join Programs
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2021-02-25 , DOI: arxiv-2102.12941
Claudia Fohry

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.

中文翻译：

嵌套Fork-Join程序的检查点和本地化恢复

虽然检查点通常与整个应用程序的重新启动结合在一起，但本地化恢复允许除受影响的进程以外的所有进程继续进行。例如，在基于任务的群集编程中，可以在完整的节点上完成应用程序，然后重新分配丢失的任务。这个扩展的摘要建议将最初为独立任务开发的检查点和本地恢复技术应用于嵌套的fork-join程序。我们考虑在分布式内存设置中采用工作优先策略的类似于Cilk的工作窃取方案，并描述所需的算法更改。原始技术的检查点开销低于1％，恢复成本可忽略不计，我们希望新算法能够实现类似的性能。

更新日期：2021-02-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>