Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance,arXiv - CS - Performance

当前位置： X-MOL 学术 › arXiv.cs.PF › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance
arXiv - CS - Performance Pub Date : 2020-07-10 , DOI: arxiv-2007.05261
Jovan Nikolic, Nursultan Jubatyrov, Evangelos Pournaras

Large-scale decentralized systems of autonomous agents interacting via asynchronous communication often experience the following self-healing dilemma: Fault-detection inherits network uncertainties making a faulty process indistinguishable from a slow process. The implications can be dramatic: Self-healing mechanisms become biased and cost-ineffective. In particular, triggering an undesirable fault-correction results in new faults that could be prevented with fault-tolerance instead. Nevertheless, fault-tolerance alone without eventually correcting persistent faults makes systems underperforming as well. Measuring, understanding and resolving such self-healing dilemmas is a timely challenge and critical requirement given the rise of distributed ledgers, edge computing, the Internet of Things in several application domains of energy, transport and health. This paper introduces a novel and general-purpose modeling of fault scenarios. They can accurately measure and predict inconsistencies generated by fault-correction and fault-tolerance when each node in a network can monitor the health status of another node, while both can defect. In contrast to related work, no information about the computational/application scenario, overlying algorithms or application data is required. A rigorous experimental methodology is designed that evaluates 696 experimental settings of different fault scales, fault profiles and fault detection thresholds, each with almost 9M measurements of inconsistencies in a prototyped decentralized network of 3000 nodes. The prediction performance of the modeled fault scenarios is validated in a challenging application scenario of decentralized and dynamic in-network aggregation using real-world data from a Smart Grid pilot project. Findings confirm the origin of inconsistencies at design phase and provide new insights how to tune self-healing mechanisms at design phase.

中文翻译：

分布式系统中的自愈困境：故障纠正与容错

通过异步通信交互的自治代理的大规模分散系统经常遇到以下自我修复困境：故障检测继承了网络不确定性，使故障过程与缓慢过程无法区分。其影响可能是巨大的：自我修复机制变得有偏见且成本效益低。特别是，触发不希望的故障纠正会导致新的故障，而这些故障可以通过容错来防止。然而，仅靠容错而不最终纠正持续性故障也会使系统表现不佳。鉴于分布式账本、边缘计算、物联网在能源等多个应用领域的兴起，衡量、理解和解决此类自我修复困境是一项及时的挑战和关键要求。交通和健康。本文介绍了一种新颖的通用故障场景建模。当网络中的每个节点都可以监控另一个节点的健康状态，而两者都可以缺陷时，他们可以准确地测量和预测由故障纠正和容错产生的不一致。与相关工作相比，不需要关于计算/应用场景、叠加算法或应用数据的信息。设计了一种严格的实验方法，可以评估 696 种不同故障规模、故障概况和故障检测阈值的实验设置，每个设置都有近 900 万次在 3000 个节点的原型分散网络中测量不一致性。使用来自智能电网试点项目的真实数据，在分布式和动态网络聚合的挑战性应用场景中验证了建模故障场景的预测性能。结果证实了设计阶段不一致的根源，并提供了如何在设计阶段调整自愈机制的新见解。

更新日期：2020-07-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>