当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Design and Evaluation of a Risk-Aware Failure Identification Scheme for Improved RAS in Erasure-Coded Data Centers
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2021-01-01 , DOI: 10.1109/tpds.2020.3010048
Weichen Huang , Juntao Fang , Shenggang Wan , Changsheng Xie , Xubin He

Data reliability and availability, and serviceability (RAS) of erasure-coded data centers are highly affected by data repair induced by node failures. In a traditional failure identification scheme, all chunks share the same identification time threshold, thus losing opportunities to further improve the RAS. To solve this problem, we propose RAFI, a novel risk-aware failure identification scheme. In RAFI, chunk failures in stripes experiencing different numbers of failed chunks are identified using different time thresholds. For those chunks in a high-risk stripe, a shorter identification time is adopted, thus improving the overall data reliability and availability. For those chunks in a low-risk stripe, a longer identification time is adopted, thus reducing the repair network traffic. Therefore, RAS can be improved simultaneously. We also propose three optimization techniques to reduce the additional overhead that RAFI imposes on management nodes and to ensure that RAFI can work properly under large-scale clusters. We use simulation, emulation, and prototyping implementation to evaluate RAFI from multiple aspects. Simulation and prototype results prove the effectiveness and correctness of RAFI, and the performance improvement of the optimization techniques on RAFI is demonstrated by running the emulator.

中文翻译:

用于改进纠删码数据中心 RAS 的风险感知故障识别方案的设计和评估

纠删码数据中心的数据可靠性和可用性以及可服务性 (RAS) 受到节点故障引起的数据修复的高度影响。在传统的故障识别方案中,所有块共享相同的识别时间阈值,从而失去了进一步改进 RAS 的机会。为了解决这个问题,我们提出了 RAFI,一种新颖的风险感知故障识别方案。在 RAFI 中,使用不同的时间阈值来识别经历不同数量的失败块的条带中的块失败。对于高风险条带中的那些块,采用更短的识别时间,从而提高整体数据的可靠性和可用性。对于低风险条带中的那些块,采用更长的识别时间,从而减少修复网络流量。因此,可以同时改进RAS。我们还提出了三种优化技术,以减少 RAFI 对管理节点施加的额外开销,并确保 RAFI 在大规模集群下能够正常工作。我们使用模拟、仿真和原型实现从多个方面评估 RAFI。仿真和原型结果证明了RAFI的有效性和正确性,并通过运行仿真器证明了优化技术在RAFI上的性能提升。
更新日期:2021-01-01
down
wechat
bug