McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression,Scientific Programming

当前位置： X-MOL 学术 › Sci. Program. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression
Scientific Programming ( IF 1.672 ) Pub Date : 2013 , DOI: 10.3233/spr-130371
Tanzima Zerin Islam, Kathryn Mohror, Saurabh Bagchi, Adam Moody, Bronis R. de Supinski, Rudolf Eigenmann

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.

中文翻译：

McrEngine：使用数据感知聚合和压缩的可扩展检查点系统

高性能计算（HPC）系统使用检查点重新启动来容忍故障。通常，应用程序将其状态存储在并行文件系统（PFS）上的检查点中。随着应用程序的扩展，由于争用PFS资源，检查点重新启动会产生高昂的开销。高开销迫使大型应用程序降低检查点频率，这意味着发生故障时会浪费更多的计算时间。我们通过可扩展的检查点重启系统mcrEngine缓解了此问题。McrEngine借助可通过广泛使用的I / O库（例如HDF5和netCDF）获得的数据语义知识，汇总来自多个应用程序进程的检查点，并将其压缩。我们的新颖方案通过简单的串联和压缩将检查点的压缩率提高了115％。

更新日期：2020-09-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>