当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Minority Disk Failure Prediction based on Transfer Learning in Large Data Centers of Heterogeneous Disk Systems
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2020-09-01 , DOI: 10.1109/tpds.2020.2985346
Ji Zhang , Ke Zhou , Ping Huang , Xubin He , Ming Xie , Bin Cheng , Yongguang Ji , Yinhu Wang

The storage system in large scale data centers is typically built upon thousands or even millions of disks, where disk failures constantly happen. A disk failure could lead to serious data loss and thus system unavailability or even catastrophic consequences if the lost data cannot be recovered. While replication and erasure coding techniques have been widely deployed to guarantee storage availability and reliability, disk failure prediction is gaining popularity as it has the potential to prevent disk failures from occurring in the first place. Recent trends have turned toward applying machine learning approaches based on disk SMART attributes for disk failure predictions. However, traditional machine learning (ML) approaches require a large set of training data in order to deliver good predictive performance. In large-scale storage systems, new disks enter gradually to augment the storage capacity or to replace failed disks, leading storage systems to consist of small amounts of new disks from different vendors and/or different models from the same vendor as time goes on. We refer to this relatively small amount of disks as minority disks. Due to the lack of sufficient training data, traditional ML approaches fail to deliver satisfactory predictive performance in evolving storage systems which consist of heterogeneous minority disks. To address this challenge and improve the predictive performance for minority disks in large data centers, we propose a minority disk failure prediction model named TLDFP based on a transfer learning approach. Our evaluation results in two realistic datasets have demonstrated that TLDFP can deliver much more precise results and lower additional maintenance cost, compared to four popular prediction models based on traditional ML algorithms and two state-of-the-art transfer learning methods.

中文翻译:

基于迁移学习的异构磁盘系统大型数据中心少数磁盘故障预测

大型数据中心的存储系统通常建立在数千甚至数百万个磁盘上,磁盘故障不断发生。磁盘故障可能会导致严重的数据丢失,从而导致系统无法使用,如果丢失的数据无法恢复,甚至会带来灾难性的后果。虽然复制和擦除编码技术已被广泛部署以保证存储可用性和可靠性,但磁盘故障预测正越来越受欢迎,因为它有可能首先防止磁盘故障的发生。最近的趋势已经转向应用基于磁盘 SMART 属性的机器学习方法来预测磁盘故障。然而,传统的机器学习 (ML) 方法需要大量的训练数据才能提供良好的预测性能。在大型存储系统中,新磁盘逐渐进入以增加存储容量或替换故障磁盘,导致存储系统随着时间的推移由来自不同供应商和/或来自同一供应商的不同型号的少量新磁盘组成。我们将这种相对较少的磁盘称为少数磁盘。由于缺乏足够的训练数据,传统的机器学习方法无法在由异构少数磁盘组成的不断发展的存储系统中提供令人满意的预测性能。为了应对这一挑战并提高大型数据中心少数磁盘的预测性能,我们提出了一种基于迁移学习方法的名为 TLDFP 的少数磁盘故障预测模型。
更新日期:2020-09-01
down
wechat
bug