当前位置: X-MOL 学术IEEE Trans. Serv. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Toward a Smart Cloud: A Review of Fault-tolerance Methods in Cloud Systems
IEEE Transactions on Services Computing ( IF 8.1 ) Pub Date : 2018-01-01 , DOI: 10.1109/tsc.2018.2816644
Mukosi Abraham Mukwevho , Turgay Celik

This paper presents a comprehensive survey of the state-of-the-art work on fault tolerance methods proposed for cloud computing. The survey classifies fault-tolerance methods into three categories: 1) ReActive Methods (RAMs); 2) PRoactive Methods (PRMs); and 3) ReSilient Methods (RSMs). RAMs allow the system to enter into a fault status and then try to recover the system. PRMs tend to prevent the system from entering a fault status by implementing mechanisms that enable them to avoid errors before they affect the system. On the other hand, recently emerging RSMs aim to minimize the amount of time it takes for a system to recover from a fault. Machine Learning and Artificial Intelligence have played an active role in RSM domain in such a way that the recovery time is mapped to a function to be optimized (i.e by converging the recovery time to a fraction of milliseconds). As the system learns to deal with new faults, the recovery time will become shorter. In addition, current issues and challenges in cloud fault tolerance are also discussed to identify promising areas for future research.

中文翻译:

迈向智能云:云系统容错方法综述

本文全面介绍了为云计算提出的容错方法的最新工作。该调查将容错方法分为三类:1) 响应式方法 (RAM);2) 主动方法 (PRM);和 3) 弹性方法 (RSM)。RAM 允许系统进入故障状态,然后尝试恢复系统。PRM 倾向于通过实施使它们能够在错误影响系统之前避免错误的机制来防止系统进入故障状态。另一方面,最近出现的 RSM 旨在最大限度地减少系统从故障中恢复所需的时间。机器学习和人工智能在 RSM 领域发挥了积极作用,将恢复时间映射到要优化的函数(即。e 通过将恢复时间收敛到几毫秒)。随着系统学会处理新故障,恢复时间将变得更短。此外,还讨论了云容错方面的当前问题和挑战,以确定未来研究的有前景的领域。
更新日期:2018-01-01
down
wechat
bug