Big Data Research ( IF 2.673 ) Pub Date : 2020-11-20 , DOI: 10.1016/j.bdr.2020.100171 Hengxiang Xie; Yuhui Deng; Hao Feng; Lei Si
The explosive growth of data brings a big challenge for the data storage and backup of data centers. Moreover, existing techniques of mobile phone make image become one of the main ways for information presentation. Most images are compressed to JPEG format, and the image data accounts for a large part of the data growth. To reduce the storage cost, data deduplication is proposed and has now become a requisite component of backup systems. However, traditional deduplication techniques based on binary stream are not efficient for JPEG files, for the image compression process breaks redundancy. This paper proposes a deduplication method for JPEG files from a new perspective: visual redundancy which is named PXDedup. Different from traditional deduplication techniques, PXDedup decompresses a JPEG file to an image first, and then partitions the image to chunks. The visually identical image chunks are regarded as redundant chunks and are eliminated. Besides, PXDedup recompresses unique image chunks before storing them on disk. This operation further reduces the data size by making use of the feature that a high quality JPEG image can be recompressed with a bit lower quality parameter without quality loss. Experimental results show that PXDedup achieves a good reduction ratio when making a backup of a dataset consisting of large amounts of JPEG files, especially when the dataset includes many JPEG images which have few visual differences.