当前位置: X-MOL 学术IEEE Trans. Parallel Distrib. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Design of Fast Content-Defined Chunking for Data Deduplication based Storage Systems
IEEE Transactions on Parallel and Distributed Systems ( IF 5.3 ) Pub Date : 2020-09-01 , DOI: 10.1109/tpds.2020.2984632
Wen Xia , Xiangyu Zou , Hong Jiang , Yukun Zhou , Chuanyi Liu , Dan Feng , Yu Hua , Yuchong Hu , Yucheng Zhang

Content-Defined Chunking (CDC) has been playing a key role in data deduplication systems recently due to its high redundancy detection ability. However, existing CDC-based approaches introduce heavy CPU overhead because they declare the chunk cut-points by computing and judging the rolling hashes of the data stream byte by byte. In this article, we propose FastCDC, a Fast and efficient Content-Defined Chunking approach, for data deduplication-based storage systems. The key idea behind FastCDC is the combined use of five key techniques, namely, gear based fast rolling hash, simplifying and enhancing the Gear hash judgment, skipping sub-minimum chunk cut-points, normalizing the chunk-size distribution in a small specified region to address the problem of the decreased deduplication ratio stemming from the cut-point skipping, and last but not least, rolling two bytes each time to further speed up CDC. Our evaluation results show that, by using a combination of the five techniques, FastCDC is 3-12X faster than the state-of-the-art CDC approaches, while achieving nearly the same and even higher deduplication ratio as the classic Rabin-based CDC. In addition, our study on the deduplication throughput of FastCDC-based Destor (an open source deduplication project) indicates that FastCDC helps achieve 1.2-3.0X higher throughput than Destor based on state-of-the-art chunkers.

中文翻译:

基于重复数据删除的存储系统的快速内容定义分块设计

内容定义分块 (CDC) 由于其高冗余检测能力,最近在重复数据删除系统中发挥着关键作用。然而,现有的基于 CDC 的方法引入了大量的 CPU 开销,因为它们通过逐字节计算和判断数据流的滚动哈希来声明块切割点。在本文中,我们为基于重复数据删除的存储系统提出了 FastCDC,一种快速高效的内容定义分块方法。FastCDC背后的关键思想是结合使用五种关键技术,即基于齿轮的快速滚动哈希、简化和增强齿轮哈希判断、跳过亚最小块切割点、将块大小分布归一化在一个小的指定区域解决由于切点跳过而导致重复数据删除率降低的问题,最后但并非最不重要的,每次滚动两个字节以进一步加速 CDC。我们的评估结果表明,通过结合使用这五种技术,FastCDC 比最先进的 CDC 方法快 3-12 倍,同时实现与基于 Rabin 的经典 CDC 几乎相同甚至更高的重复数据删除率. 此外,我们对基于 FastCDC 的 Destor(一个开源重复数据删除项目)的重复数据删除吞吐量的研究表明,FastCDC 有助于实现比基于最先进分块器的 Destor 高 1.2-3.0 倍的吞吐量。
更新日期:2020-09-01
down
wechat
bug