当前位置: X-MOL 学术Big Data Res. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
PXDedup: Deduplicating Massive Visually Identical JPEG Image Data
Big Data Research ( IF 3.5 ) Pub Date : 2020-11-20 , DOI: 10.1016/j.bdr.2020.100171
Hengxiang Xie , Yuhui Deng , Hao Feng , Lei Si

The explosive growth of data brings a big challenge for the data storage and backup of data centers. Moreover, existing techniques of mobile phone make image become one of the main ways for information presentation. Most images are compressed to JPEG format, and the image data accounts for a large part of the data growth. To reduce the storage cost, data deduplication is proposed and has now become a requisite component of backup systems. However, traditional deduplication techniques based on binary stream are not efficient for JPEG files, for the image compression process breaks redundancy. This paper proposes a deduplication method for JPEG files from a new perspective: visual redundancy which is named PXDedup. Different from traditional deduplication techniques, PXDedup decompresses a JPEG file to an image first, and then partitions the image to chunks. The visually identical image chunks are regarded as redundant chunks and are eliminated. Besides, PXDedup recompresses unique image chunks before storing them on disk. This operation further reduces the data size by making use of the feature that a high quality JPEG image can be recompressed with a bit lower quality parameter without quality loss. Experimental results show that PXDedup achieves a good reduction ratio when making a backup of a dataset consisting of large amounts of JPEG files, especially when the dataset includes many JPEG images which have few visual differences.



中文翻译:

PXDedup:对大量视觉上相同的JPEG图像数据进行重复数据删除

数据的爆炸性增长为数据存储和数据中心备份带来了巨大挑战。而且,现有的手机技术使图像成为信息呈现的主要方式之一。大多数图像被压缩为JPEG格式,并且图像数据在数据增长中占很大一部分。为了降低存储成本,提出了重复数据删除技术,现在它已成为备份系统的必需组件。但是,传统的基于二进制流的重复数据删除技术对于JPEG文件而言效率不高,因为图像压缩过程会破坏冗余。本文从一个新的角度提出了一种用于JPEG文件的重复数据删除方法:视觉冗余,称为PXDedup。与传统的重复数据删除技术不同,PXDedup首先将JPEG文件解压缩为图像,然后将图像划分为多个块。视觉上相同的图像块被视为冗余块并被消除。此外,PXDedup会先压缩唯一的图像块,然后再将它们存储在磁盘上。此操作利用以下特性进一步减小了数据大小:可以用较低的质量参数重新压缩高质量的JPEG图像而不会降低质量。实验结果表明,当对包含大量JPEG文件的数据集进行备份时,尤其是当数据集包含许多几乎没有视觉差异的JPEG图像时,PXDedup可以实现良好的还原率。此操作利用以下特性进一步减小了数据大小:可以用较低的质量参数重新压缩高质量的JPEG图像而不会降低质量。实验结果表明,当对包含大量JPEG文件的数据集进行备份时,尤其是当数据集包含许多几乎没有视觉差异的JPEG图像时,PXDedup可以实现良好的还原率。此操作利用以下特性进一步减小了数据大小:可以用较低的质量参数重新压缩高质量的JPEG图像而不会降低质量。实验结果表明,当对包含大量JPEG文件的数据集进行备份时,尤其是当数据集包含许多几乎没有视觉差异的JPEG图像时,PXDedup可以实现良好的还原率。

更新日期:2020-11-23
down
wechat
bug