当前位置: X-MOL 学术J. Big Data › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
HaRD: a heterogeneity-aware replica deletion for HDFS
Journal of Big Data ( IF 8.1 ) Pub Date : 2019-10-21 , DOI: 10.1186/s40537-019-0256-6
Hilmi Egemen Ciritoglu , John Murphy , Christina Thorpe

The Hadoop distributed file system (HDFS) is responsible for storing very large data-sets reliably on clusters of commodity machines. The HDFS takes advantage of replication to serve data requested by clients with high throughput. Data replication is a trade-off between better data availability and higher disk usage. Recent studies propose different data replication management frameworks that alter the replication factor of files dynamically in response to the popularity of the data, keeping more replicas for in-demand data to enhance the overall performance of the system. When data gets less popular, these schemes reduce the replication factor, which changes the data distribution and leads to unbalanced data distribution. Such an unbalanced data distribution causes hot spots, low data locality and excessive network usage in the cluster. In this work, we first confirm that reducing the replication factor causes unbalanced data distribution when using Hadoop’s default replica deletion scheme. Then, we show that even keeping a balanced data distribution using WBRD (data-distribution-aware replica deletion scheme) that we proposed in previous work performs sub-optimally on heterogeneous clusters. In order to overcome this issue, we propose a heterogeneity-aware replica deletion scheme (HaRD). HaRD considers the nodes’ processing capabilities when deleting replicas; hence it stores more replicas on the more powerful nodes. We implemented HaRD on top of HDFS and conducted a performance evaluation on a 23-node dedicated heterogeneous cluster. Our results show that HaRD reduced execution time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively.

中文翻译:

HaRD:HDFS的异构感知副本删除

Hadoop分布式文件系统(HDFS)负责在商用机器集群上可靠地存储非常大的数据集。HDFS利用复制的优势以高吞吐量为客户端请求的数据提供服务。数据复制是更好的数据可用性和更高的磁盘使用率之间的折衷。最近的研究提出了不同的数据复制管理框架,这些框架可根据数据的流行程度动态更改文件的复制因子,从而为需求数据保留更多副本,以增强系统的整体性能。当数据不那么受欢迎时,这些方案会降低复制因子,从而改变数据分布并导致数据分布不平衡。这种不平衡的数据分布会导致群集中出现热点,数据局部性低和网络使用过多的情况。在这项工作中,我们首先确认减少复制因子会导致使用Hadoop的默认副本删除方案时数据分布不均衡。然后,我们证明,即使使用先前工作中提出的WBRD(可识别数据分布的副本删除方案)来保持平衡的数据分布,也对异构群集的性能表现欠佳。为了克服此问题,我们提出了一种异构感知的副本删除方案(HaRD)。HaRD在删除副本时会考虑节点的处理能力。因此,它将更多副本存储在功能更强大的节点上。我们在HDFS之上实施了HaRD,并对23个节点的专用异构集群进行了性能评估。我们的结果表明,与Hadoop和WBRD相比,HaRD分别减少了60%和17%的执行时间。我们首先确认减少复制因子会导致使用Hadoop的默认副本删除方案时数据分布不均衡。然后,我们证明,即使使用先前工作中提出的WBRD(可识别数据分布的副本删除方案)来保持平衡的数据分布,也对异构群集的性能表现欠佳。为了克服此问题,我们提出了一种异构感知的副本删除方案(HaRD)。HaRD在删除副本时会考虑节点的处理能力。因此,它将更多副本存储在功能更强大的节点上。我们在HDFS之上实施了HaRD,并对23个节点的专用异构集群进行了性能评估。我们的结果表明,与Hadoop和WBRD相比,HaRD分别减少了60%和17%的执行时间。我们首先确认减少复制因子会导致使用Hadoop的默认副本删除方案时数据分布不均衡。然后,我们证明,即使使用先前工作中提出的WBRD(可识别数据分布的副本删除方案)来保持平衡的数据分布,也对异构群集的性能表现欠佳。为了克服此问题,我们提出了一种异构感知的副本删除方案(HaRD)。HaRD在删除副本时会考虑节点的处理能力。因此,它将更多副本存储在功能更强大的节点上。我们在HDFS之上实施了HaRD,并对23个节点的专用异构集群进行了性能评估。我们的结果表明,与Hadoop和WBRD相比,HaRD分别减少了60%和17%的执行时间。
更新日期:2019-10-21
down
wechat
bug