A Modeling Framework for Reliability of Erasure Codes in SSD Arrays,IEEE Transactions on Computers

当前位置： X-MOL 学术 › IEEE Trans. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Modeling Framework for Reliability of Erasure Codes in SSD Arrays
IEEE Transactions on Computers ( IF 3.6 ) Pub Date : 2020-05-01 , DOI: 10.1109/tc.2019.2962691
Mostafa Kishani , Saba Ahmadian , Hossein Asadi

Emergence of Solid-State Drives (SSDs) have evolved the data storage industry where they are rapidly replacing Hard Disk Drives (HDDs) due to their superiority in performance and power. Meanwhile, SSDs have reliability issues due to bit errors, bad blocks, and bad chips. To help reliability, Redundant Array of Independent Disks (RAID) configurations, originally proposed to increase both performance and reliability of HDDs, are also applied to SSD arrays. However, the conventional reliability models of HDD RAID cannot be intactly applied to SSD arrays, as the nature of failures in SSDs are totally different from HDDs. Previous studies on the reliability of SSD arrays are based on the deprecated SSD failure data, and only focus on limited failure types, device failures, and page failures caused by the bit errors, while recent field studies have reported other failure types including bad blocks and bad chips, and a high correlation between failures. In this paper, we investigate the reliability of SSD arrays using field storage traces and real-system implementation of conventional and emerging erasure codes. The reliability is evaluated by statistical fault injection experiments that post-process the usage logs obtained from the real-system implementation, while the fault/failure attributes are obtained from the state-of-the-art field data by previous works. As a case study, we examine conventional RAID5 and RAID6 and emerging Partial-MDS (PMDS) codes, Sector-Disk (SD) codes, and STAIR codes in terms of both reliability and performance using an open-source software RAID controller, MD (in Linux kernel version 3.10.0-327), and arrays of Samsung 850 Pro SSDs. Our detailed analysis on the data loss breakdown shows that a) emerging erasure codes fail to replace RAID6 in terms of reliability, b) row-wise erasure codes are the most efficient choices for contemporary SSD devices, and c) previous models overestimate the SSD array reliability by up to six orders of magnitude, as they just focus on the coincidence of bad pages (bit errors) and bad chips within a data stripe that holds the minority of root cause of data loss in SSD arrays. Our experiments show that the combination of bad chips with bad blocks is recognized as the major source of data loss in RAID5 and emerging codes (contributing more than 54 and 90 percent of data loss in RAID5 and emerging codes, respectively), while RAID6 remains robust under these failure combinations. Finally, the fault injection results reveal that SSD array reliability, as well as the failure breakdown is significantly correlated with SSD type.

中文翻译：

SSD 阵列中纠删码可靠性的建模框架

固态硬盘 (SSD) 的出现推动了数据存储行业的发展，由于其在性能和功率方面的优势，它们正在迅速取代硬盘驱动器 (HDD)。同时，由于误码、坏块和坏芯片，SSD 存在可靠性问题。为了提高可靠性，最初提议用于提高 HDD 性能和可靠性的独立磁盘冗余阵列 (RAID) 配置也应用于 SSD 阵列。然而，传统的HDD RAID可靠性模型无法完整地应用于SSD阵列，因为SSD的故障性质与HDD完全不同。以往对SSD阵列可靠性的研究都是基于过时的SSD故障数据，只关注有限的故障类型、设备故障、误码引起的页面故障，而最近的现场研究报告了其他故障类型，包括坏块和坏芯片，以及故障之间的高度相关性。在本文中，我们使用现场存储跟踪和传统和新兴擦除代码的实际系统实现来研究 SSD 阵列的可靠性。可靠性是通过统计故障注入实验来评估的，该实验对从实际系统实现中获得的使用日志进行后处理，而故障/故障属性是从以前工作的最新现场数据中获得的。作为案例研究，我们使用开源软件 RAID 控制器 MD 在可靠性和性能方面检查了传统的 RAID5 和 RAID6 以及新兴的部分 MDS (PMDS) 代码、扇区磁盘 (SD) 代码和 STAIR 代码。在 Linux 内核版本 3.10.0-327 中），以及三星 850 Pro SSD 阵列。我们对数据丢失细分的详细分析表明，a) 新兴的擦除代码在可靠性方面无法取代 RAID6，b) 按行擦除代码是当代 SSD 设备的最有效选择，以及 c) 以前的模型高估了 SSD 阵列可靠性提高了六个数量级，因为它们只关注数据条带中坏页（位错误）和坏芯片的巧合，该数据条带包含 SSD 阵列中数据丢失的少数根本原因。我们的实验表明，坏芯片和坏块的组合被认为是 RAID5 和新兴代码中数据丢失的主要来源（分别占 RAID5 和新兴代码中数据丢失的 54% 和 90% 以上），而 RAID6 仍然保持稳健在这些故障组合下。最后，故障注入结果表明 SSD 阵列可靠性，

更新日期：2020-05-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11