当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Deterministic Data Distribution for Efficient Recovery in Erasure-Coded Storage Systems
IEEE Transactions on Parallel and Distributed Systems ( IF 5.6 ) Pub Date : 2020-10-01 , DOI: 10.1109/tpds.2020.2987837
Liangliang Xu , Min Lyu , Zhipeng Li , Yongkun Li , Yinlong Xu

Due to individual unreliable commodity components, failures are common in large-scale distributed storage systems. Erasure codes are widely deployed in practical storage systems to provide fault tolerance with low storage overhead. However, random data distribution (RDD), commonly used in erasure-coded storage systems, induces heavy cross-rack traffic, load imbalance, and random access, which adversely affects failure recovery. In this article, with orthogonal arrays, we define a Deterministic Data Distribution ($D^3$D3) to uniformly distribute data/parity blocks among nodes, and propose an efficient failure recovery approach based on $D^3$D3, which minimizes the cross-rack repair traffic against a single node failure. Thanks to the uniformity of $D^3$D3, the proposed recovery approach balances the repair traffic not only among nodes within a rack but also among racks. We implement $D^3$D3 over Reed-Solomon codes and Locally Repairable Codes in Hadoop Distributed File System (HDFS) with a cluster of 28 machines. Compared with RDD, our experiments show that $D^3$D3 significantly speeds up the failure recovery up to 2.49 times for RS codes and 1.38 times for LRCs. Moreover, $D^3$D3 supports front-end applications better than RDD in both of normal and recovery states.

中文翻译:

用于在纠删码存储系统中进行有效恢复的确定性数据分布

由于个别不可靠的商品组件,故障在大型分布式存储系统中很常见。纠删码广泛部署在实际存储系统中,以提供低存储开销的容错能力。然而,通常用于纠删码存储系统的随机数据分布(RDD)会导致大量的跨机架流量、负载不平衡和随机访问,这对故障恢复产生不利影响。在本文中,使用正交数组,我们定义了确定性数据分布($D^3$D3) 在节点之间均匀分布数据/奇偶校验块,并提出一种基于 $D^3$D3,这最大限度地减少了针对单个节点故障的跨机架修复流量。由于均匀性$D^3$D3,建议的恢复方法不仅在机架内的节点之间而且在机架之间平衡了修复流量。我们实施$D^3$D3Hadoop 分布式文件系统 (HDFS) 中的 Reed-Solomon 代码和本地可修复代码,具有 28 台机器的集群。与 RDD 相比,我们的实验表明$D^3$D3RS 码的故障恢复速度显着提高了 2.49 倍,LRC 的故障恢复速度提高了 1.38 倍。而且,$D^3$D3 在正常和恢复状态下都比 RDD 更好地支持前端应用程序。
更新日期:2020-10-01
down
wechat
bug