当前位置: X-MOL 学术arXiv.cs.OS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
HyCoR: Fault-Tolerant Replicated Containers Based on Checkpoint and Replay
arXiv - CS - Operating Systems Pub Date : 2021-01-23 , DOI: arxiv-2101.09584
Diyu Zhou, Yuval Tamir

HyCoR is a fully-operational fault tolerance mechanism for multiprocessor workloads, based on container replication, using a hybrid of checkpointing and replay. HyCoR derives from two insights regarding replication mechanisms: 1) deterministic replay can overcome a key disadvantage of checkpointing alone -- unacceptably long delays of outputs to clients, and 2) checkpointing can overcome a key disadvantage of active replication with deterministic replay alone -- vulnerability to even rare replay failures due to an untracked nondeterministic events. With HyCoR, the primary sends periodic checkpoints to the backup and logs the outcomes of sources of nondeterminism. Outputs to clients are delayed only by the short time it takes to send the corresponding log to the backup. Upon primary failure, the backup replays only the short interval since the last checkpoint, thus minimizing the window of vulnerability. HyCoR includes a "best effort" mechanism that results in a high recovery rate even in the presence of data races, as long as their rate is low. The evaluation includes measurement of the recovery rate and recovery latency based on fault injection. On average, HyCoR delays responses to clients by less than 1ms and recovers in less than 1s. For a set of eight real-world benchmarks, if data races are eliminated, the performance overhead of HyCoR is under 59%.

中文翻译:

HyCoR:基于检查点和重放的容错复制容器

HyCoR是基于容器复制的多处理器工作负载的完全可操作的容错机制,它使用检查点和重播的混合。HyCoR从关于复制机制的两个见解中得出:1)确定性重播可以克服仅检查点的主要缺点-客户端输出的长时间延迟是不可接受的; 2)检查点可以克服仅确定性重播的主动复制的主要缺点-漏洞甚至由于未跟踪的不确定事件而导致的罕见重放失败。借助HyCoR,主数据库会定期向备份发送检查点,并记录不确定性来源的结果。客户端的输出只会延迟将相应的日志发送到备份所花费的时间。在发生主要故障时,备份仅重播自上一个检查点以来的较短间隔,从而最大程度地减少了漏洞窗口。HyCoR包含一种“尽力而为”的机制,即使存在数据争用,只要其回收率低,它也可以实现很高的恢复率。评估包括基于故障注入的恢复速率和恢复等待时间的测量。HyCoR平均将对客户端的响应延迟不到1毫秒,并在不到1秒的时间内恢复。对于一组八个真实世界的基准,如果消除了数据争用,HyCoR的性能开销将低于59%。评估包括基于故障注入的恢复速率和恢复等待时间的测量。HyCoR平均将对客户端的响应延迟不到1毫秒,并在不到1秒的时间内恢复。对于一组八个真实世界的基准,如果消除了数据争用,HyCoR的性能开销将低于59%。评估包括基于故障注入的恢复速率和恢复等待时间的测量。HyCoR平均将对客户端的响应延迟不到1毫秒,并在不到1秒的时间内恢复。对于一组八个真实世界的基准,如果消除了数据争用,HyCoR的性能开销将低于59%。
更新日期:2021-01-26
down
wechat
bug