当前位置: X-MOL 学术Connect. Sci. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A feature-based intelligent deduplication compression system with extreme resemblance detection
Connection Science ( IF 3.2 ) Pub Date : 2020-12-21 , DOI: 10.1080/09540091.2020.1862058
Xiaotong Wu 1 , Jiaquan Gao 1 , Genlin Ji 1 , Taotao Wu 2 , Yuan Tian 3 , Najla Al-Nabhan 4
Affiliation  

ABSTRACT

With the fast development of various computing paradigms, the amount of data is rapidly increasing that brings the huge storage overhead. However, the existing data deduplication techniques do not make full use of similarity detection to improve the storage efficiency and data transmission rate. In this paper, we study the problem of utilising the duplicate and resemblance detection techniques to further compress data. We first present a framework of FIDCS-ERD, a feature-based intelligent deduplication compression system with extreme resemblance detection. We also introduce the main components and the detailed workflow of our compression system. We propose a content-defined chunking algorithm for duplicate detection and a Bloom filter-based resemblance detection algorithm. FIDCS-ERD implements the intelligent file chunking and the fast duplicate and resemblance detection. By extensive experiments over the real datasets, we demonstrate that FIDCS-ERD has better compression effect and more accurate resemblance detection compared to the existing approaches.



中文翻译:

一种基于特征的智能重复数据删除压缩系统,具有极端相似性检测

摘要

随着各种计算范式的快速发展,数据量迅速增加,带来了巨大的存储开销。然而,现有的重复数据删除技术并没有充分利用相似性检测来提高存储效率和数据传输速率。在本文中,我们研究了利用重复和相似检测技术进一步压缩数据的问题。我们首先提出一个框架FIDCS-ERD,一种基于特征的智能重复数据删除压缩系统,具有极端相似性检测功能。我们还介绍了我们的压缩系统的主要组件和详细工作流程。我们提出了一种用于重复检测的内容定义分块算法和一种基于布隆过滤器的相似性检测算法。FIDCS-ERD实现智能文件分块和快速重复和相似检测。通过对真实数据集的大量实验,我们证明FIDCS-ERD 与现有方法相比,具有更好的压缩效果和更准确的相似性检测。

更新日期:2020-12-21
down
wechat
bug