当前位置: X-MOL 学术ACM Trans. Storage › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Fail-Slow at Scale
ACM Transactions on Storage ( IF 2.1 ) Pub Date : 2018-10-03 , DOI: 10.1145/3242086
Haryadi S. Gunawi 1 , Riza O. Suminto 1 , Russell Sears 2 , Casey Golliher 2 , Swaminathan Sundararaman 3 , Xing Lin 4 , Tim Emami 4 , Weiguang Sheng 5 , Nematollah Bidokhti 5 , Caitie McCaffrey 6 , Deepthi Srinivasan 7 , Biswaranjan Panda 7 , Andrew Baptist 8 , Gary Grider 9 , Parks M. Fields 9 , Kevin Harms 10 , Robert B. Ross 10 , Andree Jacobson 11 , Robert Ricci 12 , Kirk Webb 12 , Peter Alvaro 13 , H. Birali Runesha 14 , Mingzhe Hao 1 , Huaicheng Li 1
Affiliation  

Fail-slow hardware is an under-studied failure mode. We present a study of 114 reports of fail-slow hardware incidents, collected from large-scale cluster deployments in 14 institutions. We show that all hardware types such as disk, SSD, CPU, memory, and network components can exhibit performance faults. We made several important observations such as faults convert from one form to another, the cascading root causes and impacts can be long, and fail-slow faults can have varying symptoms. From this study, we make suggestions to vendors, operators, and systems designers.

中文翻译:

大规模失败

Fail-slow 硬件是一种未被充分研究的故障模式。我们对来自 14 个机构的大规模集群部署的 114 份故障缓慢硬件事件报告进行了研究。我们展示了所有硬件类型,如磁盘、SSD、CPU、内存和网络组件都可能出现性能故障。我们进行了一些重要的观察,例如故障从一种形式转换为另一种形式,级联的根本原因和影响可能很长,并且故障缓慢的故障可能有不同的症状。根据这项研究,我们向供应商、运营商和系统设计人员提出建议。
更新日期:2018-10-03
down
wechat
bug