当前位置:
X-MOL 学术
›
arXiv.cs.PF
›
论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Self-healing Dilemmas in Distributed Systems: Fault-correction vs. Fault-tolerance
arXiv - CS - Performance Pub Date : 2020-07-10 , DOI: arxiv-2007.05261 Jovan Nikolic, Nursultan Jubatyrov, Evangelos Pournaras
arXiv - CS - Performance Pub Date : 2020-07-10 , DOI: arxiv-2007.05261 Jovan Nikolic, Nursultan Jubatyrov, Evangelos Pournaras
Large-scale decentralized systems of autonomous agents interacting via
asynchronous communication often experience the following self-healing dilemma:
Fault-detection inherits network uncertainties making a faulty process
indistinguishable from a slow process. The implications can be dramatic:
Self-healing mechanisms become biased and cost-ineffective. In particular,
triggering an undesirable fault-correction results in new faults that could be
prevented with fault-tolerance instead. Nevertheless, fault-tolerance alone
without eventually correcting persistent faults makes systems underperforming
as well. Measuring, understanding and resolving such self-healing dilemmas is a
timely challenge and critical requirement given the rise of distributed
ledgers, edge computing, the Internet of Things in several application domains
of energy, transport and health. This paper introduces a novel and
general-purpose modeling of fault scenarios. They can accurately measure and
predict inconsistencies generated by fault-correction and fault-tolerance when
each node in a network can monitor the health status of another node, while
both can defect. In contrast to related work, no information about the
computational/application scenario, overlying algorithms or application data is
required. A rigorous experimental methodology is designed that evaluates 696
experimental settings of different fault scales, fault profiles and fault
detection thresholds, each with almost 9M measurements of inconsistencies in a
prototyped decentralized network of 3000 nodes. The prediction performance of
the modeled fault scenarios is validated in a challenging application scenario
of decentralized and dynamic in-network aggregation using real-world data from
a Smart Grid pilot project. Findings confirm the origin of inconsistencies at
design phase and provide new insights how to tune self-healing mechanisms at
design phase.
中文翻译:
分布式系统中的自愈困境:故障纠正与容错
通过异步通信交互的自治代理的大规模分散系统经常遇到以下自我修复困境:故障检测继承了网络不确定性,使故障过程与缓慢过程无法区分。其影响可能是巨大的:自我修复机制变得有偏见且成本效益低。特别是,触发不希望的故障纠正会导致新的故障,而这些故障可以通过容错来防止。然而,仅靠容错而不最终纠正持续性故障也会使系统表现不佳。鉴于分布式账本、边缘计算、物联网在能源等多个应用领域的兴起,衡量、理解和解决此类自我修复困境是一项及时的挑战和关键要求。交通和健康。本文介绍了一种新颖的通用故障场景建模。当网络中的每个节点都可以监控另一个节点的健康状态,而两者都可以缺陷时,他们可以准确地测量和预测由故障纠正和容错产生的不一致。与相关工作相比,不需要关于计算/应用场景、叠加算法或应用数据的信息。设计了一种严格的实验方法,可以评估 696 种不同故障规模、故障概况和故障检测阈值的实验设置,每个设置都有近 900 万次在 3000 个节点的原型分散网络中测量不一致性。使用来自智能电网试点项目的真实数据,在分布式和动态网络聚合的挑战性应用场景中验证了建模故障场景的预测性能。结果证实了设计阶段不一致的根源,并提供了如何在设计阶段调整自愈机制的新见解。
更新日期:2020-07-13
中文翻译:
分布式系统中的自愈困境:故障纠正与容错
通过异步通信交互的自治代理的大规模分散系统经常遇到以下自我修复困境:故障检测继承了网络不确定性,使故障过程与缓慢过程无法区分。其影响可能是巨大的:自我修复机制变得有偏见且成本效益低。特别是,触发不希望的故障纠正会导致新的故障,而这些故障可以通过容错来防止。然而,仅靠容错而不最终纠正持续性故障也会使系统表现不佳。鉴于分布式账本、边缘计算、物联网在能源等多个应用领域的兴起,衡量、理解和解决此类自我修复困境是一项及时的挑战和关键要求。交通和健康。本文介绍了一种新颖的通用故障场景建模。当网络中的每个节点都可以监控另一个节点的健康状态,而两者都可以缺陷时,他们可以准确地测量和预测由故障纠正和容错产生的不一致。与相关工作相比,不需要关于计算/应用场景、叠加算法或应用数据的信息。设计了一种严格的实验方法,可以评估 696 种不同故障规模、故障概况和故障检测阈值的实验设置,每个设置都有近 900 万次在 3000 个节点的原型分散网络中测量不一致性。使用来自智能电网试点项目的真实数据,在分布式和动态网络聚合的挑战性应用场景中验证了建模故障场景的预测性能。结果证实了设计阶段不一致的根源,并提供了如何在设计阶段调整自愈机制的新见解。