当前位置: X-MOL 学术IEEE Trans. Inform. Theory › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Data Deduplication With Random Substitutions
IEEE Transactions on Information Theory ( IF 2.2 ) Pub Date : 5-25-2022 , DOI: 10.1109/tit.2022.3176778
Hao Lou 1 , Farzad Farnoud 2
Affiliation  

Data deduplication saves storage space by identifying and removing repeats in the data stream. Compared with traditional compression methods, data deduplication schemes are more computationally efficient and are thus widely used in large scale storage systems. In this paper, we provide an information-theoretic analysis of the performance of deduplication algorithms on data streams in which repeats are not exact. We introduce a source model in which probabilistic substitutions are considered. More precisely, each symbol in a repeated string is substituted with a given edit probability. Deduplication algorithms in both the fixed-length scheme and the variable-length scheme are studied. The fixed-length deduplication algorithm is shown to be unsuitable for the proposed source model as it does not take into account the edit probability. Two modifications are proposed and shown to have performances within a constant factor of optimal for a specific class of source models with the knowledge of model parameters. We also study the conventional variable-length deduplication algorithm and show that as source entropy becomes smaller, the size of the compressed string vanishes relative to the length of the uncompressed string, leading to high compression ratios.

中文翻译:


通过随机替换进行重复数据删除



重复数据删除通过识别和删除数据流中的重复项来节省存储空间。与传统的压缩方法相比,重复数据删除方案的计算效率更高,因此广泛应用于大规模存储系统中。在本文中,我们对重复不精确的数据流上的重复数据删除算法的性能进行了信息论分析。我们引入了一个考虑概率替换的源模型。更准确地说,重复字符串中的每个符号都被替换为给定的编辑概率。研究了定长方案和变长方案的重复数据删除算法。固定长度重复数据删除算法被证明不适合所提出的源模型,因为它没有考虑编辑概率。提出了两种修改,并表明对于具有模型参数知识的特定类别的源模型,其性能在最佳常数因子内。我们还研究了传统的可变长度重复数据删除算法,结果表明,随着源熵变小,压缩字符串的大小相对于未压缩字符串的长度消失,从而实现高压缩比。
更新日期:2024-08-26
down
wechat
bug