当前位置: X-MOL 学术arXiv.cs.OS › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
On Failure Diagnosis of the Storage Stack
arXiv - CS - Operating Systems Pub Date : 2020-05-06 , DOI: arxiv-2005.02547
Duo Zhang, Om Rameshwar Gatla, Runzhou Han, Mai Zheng

Diagnosing storage system failures is challenging even for professionals. One example is the "When Solid State Drives Are Not That Solid" incident occurred at Algolia data center, where Samsung SSDs were mistakenly blamed for failures caused by a Linux kernel bug. With the system complexity keeps increasing, such obscure failures will likely occur more often. As one step to address the challenge, we present our on-going efforts called X-Ray. Different from traditional methods that focus on either the software or the hardware, X-Ray leverages virtualization to collects events across layers, and correlates them to generate a correlation tree. Moreover, by applying simple rules, X-Ray can highlight critical nodes automatically. Preliminary results based on 5 failure cases shows that X-Ray can effectively narrow down the search space for failures.



即使对于专业人员来说,诊断存储系统故障也具有挑战性。一个例子是在 Algolia 数据中心发生的“当固态驱动器不是那么坚固”事件时,三星 SSD 被错误地归咎于 Linux 内核错误导致的故障。随着系统复杂度的不断增加,这种模糊的故障可能会更频繁地发生。作为应对挑战的一个步骤,我们展示了我们正在进行的名为 X 射线的工作。与专注于软件或硬件的传统方法不同,X-Ray 利用虚拟化跨层收集事件,并将它们关联起来生成关联树。此外,通过应用简单的规则,X-Ray 可以自动突出显示关键节点。