Full length articleScalable Hash From Triplet Loss Feature Aggregation For Video De-duplication
Introduction
Modern dynamic adaptive video streaming methods such as MPEG-DASH [1], Apple HLS [2] and Microsoft Smooth Streaming [3] have a great impact on how content providers store and serve the media contents in the cloud, such as a content delivery network (CDN). OTT (over the top) content providers are also pushing subscription-based video on demand (VoD) services that offer streaming services on television. The media content creation, sharing and consumption process generate many duplicates but are not necessarily identical in bit stream. There is a de-duplication of media content use case for example. If a content identification scheme can support identification of duplicates in network caches in core networks and edge nodes, then traffic can be localized and bandwidth saved. This creates challenges to the existing Content Delivery Network (CDN) and storage de-duplication schemes like those based on MD5 [4] hashing of file chunks. New compact rate agnostic and coding scheme agnostic content identification and hashing solution are needed, to characterize media segments across different representations and with totally different bit streams. Scalable and robust signatures for media content to support de-duplication at fine granular spatio-temporal segments granularity, are important to rip the full benefits of storage de-duplication.
Therefore the massive multimedia data is pushing forward the paradigm of effective storage on cluster servers. Fig. 1 depicts that various contents of resolutions and quantized parameters(REQP) are consumed by very diversified consumers’ platform. In current media content storage scheme, the storage side has to hold all of the REQP media content, which is error-prone and not cost-effective. We define the version as the combination of resolution and quantization parameter namely REQP in this paper. If users apply the same version (REQP) of videos from the server ignoring the identical ones in the content delivery network (CDN), the pressure on the network from video delivery and storage will be quite large. Hence, how to retrieve and remove the duplicated versions of videos is an essential task for researchers.
In spite of that leveraging the video de-duplication [5] scheme is quite necessary and promising, the micro improvement of its performance exhibits it is difficult to develop. First of all, multimedia data on the cloud cluster and CDN is all the cherish product from industry and user, so it is extremely strict to remove any videos. This results in that we should derive the system of high accuracy and recall. Especially, we focus on the true positive rate (TPR) under false positive rate equals because we cannot allow the judgment is error and the product is deleted accidentally. Secondly, a tremendous quantity of videos cost the system much time to recognize and match the video identity if the algorithm is not precise and efficient. A high-delay method cannot satisfy the real time requirement in the social media time.
To alleviate the stated problems above, there are two groups of methods on video de-duplication depending on the comparison domain. The first group tries to perform video de-duplication directly in the pixel or frequency domain. They make use of the geometry correlation in a frame or the time correlation in a sequence to decide based on comparing the pixels information. The second group tries to use the hashing representation to replace to pixels. The most representative work is to use the deep learning features to derive the hash. Though this method utilized deep learning method to obtain some performance improvements, the cross-entropy loss function is in essence unsuitable for the video de-duplication task. And the lack of dataset is not convincing enough to claim good video de-duplication results.
Therefore, we propose a novel deep learning based scheme to de-duplicate the replicated videos in the cluster. Our method comprise two parts: off-line training model and on-line aggregating model. The off-line train model means that we employ triplets dataset to train out triplet loss function embedded VGG11 network. To acquire the hard and valuable training triplets, we apply the binary-tree partitioning the samples according to their attributions. Afterward, we perform the mature triplets VGG11 [6] model to train a variety of Primary Components Analysis (PCA) [7] models and Gaussian mixture models (GMM) [8]. For the on-line aggregation model, we first aggregate the fisher vector [9] by the trained triplets VGG11, PCA and GMM models above. Then we binary hash the fisher vector with different bits to get the scalable hash code which is a brief and effective representation for video de-duplication.
We proposed a deduplication method in our previous work [10]. In this paper, we propose a novel deep learning based scheme for deduplications. We provide more motivation, analysis, experimental results and comparison of related works on our proposed method. Additionally, in order to validate the efficiency of our algorithm, we implement more ablation studies for comparison. Our method comprises both a offline training and online aggregation model:
- •
Offline training model: employ triplets dataset to train out triplet loss function embedded Visual Geometry Group (VGG) network and acquire the hard and valuable training triplets by applying the binary tree partitioning the samples according to their attributions. Then mature triplets VGG11 model is performed to train a variety of Primary Components Analysis (PCA) models and Gaussian mixture models (GMM)
- •
Online aggregation model: binary hashing the fisher vector (FV) with aggregated trained triplets VGG11, PCA and GMM models obtained from offline training.
Our contribution towards video deduplications are summarized as below:
- (1)
We consider combining triplet loss with Visual Geometry Group (VGG) deep learning network which is trained of outstanding performance by huge media dataset to derive the features. Triplet loss function based network can learn convolutional features which is invariant to coding method and bit rates.
- (2)
We propose applying fisher vector to the features for feature aggregation. We utilize proposed algorithm to extract fisher vectors from outputs of VGG with triplet loss function. Fisher vector exhibits the powerful expression ability of main features for a video frame.
- (3)
Particularly, we propose employing binary tree to obtain the triplets to boost the performance of the triplet-loss based VGG network.
- (4)
We also utilize the extracting algorithm generating the scalable binary hash. The scalable binary hash can obtain different trade-offs according to different bitrate requirements.
The experimental results show that the proposed binary-tree embedded triplet loss network combining with scalable hash from fisher vector (BTF) algorithm outperforms cross-entropy [11] loss function with PCA (CP) approach in various scalable hashes.
The remaining of this paper is organized as follows. The instruction of related work will be in Section 2. We elaborate the principle of triplet loss function embedding into VGG network and the integral network structure in Section 3. In Section 4, the Binary-Tree algorithm to produce the triplets with similar variance attributions is introduced in detail. We experiment on large-scale video dataset and give the whole process and results of this in Section 5. We conclude the whole paper in Section 6.
Section snippets
Related work
As mentioned in Section 1, we can divide the video de-duplication work into two categories. The first type is traditional methods using the comparison information of pixel or frequency domain. The other one is deep learning based approaches extracting the convolutional features as the match evidences.
For conventional ways, Katiyar et al. [13] used a 2-phase video comparing scheme which is for localizing a short frames clip in a long video. Paisitkriangkrai et al. [14] defined a new heuristic
Triplet loss network for binary hashing model
The overall framework of the proposed scalable hash scheme is illustrated in Fig. 2. It consists of two components: (a) Triplet loss network feature representation generation in Section 3.1; (b) The fisher vector (FV) feature aggregation using fisher vector for generating scalable hash in Section 3.2.
Triplets generation
In this section, we will introduce how binary tree divides the dataset and generates the triplets. The training samples will be elaborated in Section 4.1. The binary-tree based generation process will be introduced in Section 4.2.
Experimental results
In this section, we will first introduce the experimental results of the overall framework in Section 5.1. Then we will show the influences of the various aggregation parameters in Section 5.2. In Section 5.3, we will illustrate the improvements of the proposed algorithm with a few subjective samples.
Conclusion
Prosperous development on multiple media big data producing, transmission and depleting have occupied the massive memory and storage in all kinds of devices, network systems, and data clusters of clouds. Improving the theory and algorithm to recognize the duplications of multiple media on every layer is an essential and urgent topic for transmitting and caching media big data quickly and efficiently. In this paper, we propose a distinct video de-duplication framework involving a triplet loss
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
The work is partially supported by a grant from NSF under award 1747751.
References (41)
- et al.
Robust PCA via principal component pursuit: A review for a comparative evaluation in video surveillance
Comput. Vis. Image Underst.
(2014) - et al.
Revisiting the fisher vector for fine-grained classification
Pattern Recognit. Lett.
(2014) - et al.
Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix
Heredity
(2005) The mpeg-dash standard for multimedia streaming over the internet
IEEE MultiMedia
(2011)Mpeg-dash vs. apple hls vs. microsoft smooth streaming vs. adobe hds
(2015)IIS smooth streaming technical overview
Microsoft Corp.
(2009)The MD5 Message-Digest AlgorithmTechnical Report
(1992)- et al.
Real-time video copy-location detection in large-scale repositories
IEEE MultiMedia
(2011) - et al.
Very deep convolutional networks for large-scale image recognition
(2014) Gaussian mixture models
Encycl. Biom.
(2015)
Triplet loss feature aggregation for scalable hash
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning
A multiple-comparisons method based on the distribution of the root node distance of a binary tree
J. Agric. Biol. Environ. Stat.
Videdup: An application-aware framework for video de-duplication
Scalable clip-based near-duplicate video detection with ordinal measure
Enabling encrypted cloud media center with secure deduplication
Toward encrypted cloud media center with secure deduplication
IEEE Trans. Multimedia
Proof of storage for video deduplication in the cloud
A secure video deduplication scheme in cloud storage environments using H. 264 compression
Cited by (3)
Discrete hashing with triple supervision learning
2021, Journal of Visual Communication and Image RepresentationCitation Excerpt :Recent popular hashing methods can be divided into two main types. Unsupervised methods extract the potential semantic information in the original feature space of the data sample [6–11]. Supervised methods [12–21] focus on supervision information (such as sample label information or sample similarity matrices), which they embed into the learning process of hash codes.
Inter-frame video forgery detection using UFS-MSRC algorithm and LSTM network
2023, International Journal of Modeling, Simulation, and Scientific ComputingSecure deduplication method based on tag clustering
2022, Proceedings - 20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022