当前位置: X-MOL 学术IEEE Trans. Device Mat Reliab. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Reliability Analysis of Storage Systems With Partially Repairable Devices
IEEE Transactions on Device and Materials Reliability ( IF 2 ) Pub Date : 2021-05-05 , DOI: 10.1109/tdmr.2021.3077848
Serkay Olmez

Modern storage devices such as hard disk drives (HDDs) and solid state drives (SSDs) have reached capacities beyond 18TB. Failure of such devices requires data recovery from parities. Given the large capacities, the recovery process may take up to a few days depending on the bandwidth and the erasure coding scheme implemented. During the recovery, the system is vulnerable to data loss if additional device failures are encountered. Therefore, it is important to complete the recovery as quickly as possible. The recovery can be accelerated if the data on the failed device is only partially corrupted, and the remaining portion is still accessible. This is indeed the case for storage devices that consist of multiple physical units of recording subsystems. For example, modern HDDs have up to 18 heads, and SSDs have multiple flash chips. These subsystems may fail independently without affecting the rest of the components in the device. In this work, we study the durability of data when the device is allowed to stay online even when a number of subcomponents fail. In addition to extending the lifetime of the devices, this also allows for faster recovery of the critical data stored on the failed subsystem, which results in significant gains in the overall data durability for the storage system.

中文翻译:

具有部分可修复设备的存储系统的可靠性分析

硬盘驱动器 (HDD) 和固态驱动器 (SSD) 等现代存储设备的容量已超过 18TB。此类设备的故障需要从奇偶校验数据恢复。鉴于容量很大,恢复过程可能需要几天时间,具体取决于带宽和实施的擦除编码方案。在恢复期间,如果遇到其他设备故障,系统很容易丢失数据。因此,尽快完成恢复非常重要。如果故障设备上的数据仅部分损坏,而其余部分仍可访问,则可以加快恢复速度。对于由多个记录子系统的物理单元组成的存储设备来说,情况确实如此。例如,现代 HDD 最多有 18 个磁头,而 SSD 有多个闪存芯片。这些子系统可能会独立发生故障,而不会影响设备中的其余组件。在这项工作中,我们研究了即使在许多子组件出现故障时也允许设备保持在线状态时数据的持久性。除了延长设备的使用寿命外,这还可以更快地恢复存储在发生故障的子系统上的关键数据,从而显着提高存储系统的整体数据持久性。
更新日期:2021-06-08
down
wechat
bug