Full length article
Scalable Hash From Triplet Loss Feature Aggregation For Video De-duplication

https://doi.org/10.1016/j.jvcir.2020.102908Get rights and content

Highlights

  • Triplet loss with Visual Geometry Group is trained of outstanding performance.

  • Fisher vector is applied for feature aggregation.

  • Binary tree is employed to obtain the triplets.

  • Scalable binary hash is used to obtain trade-off between performance and bitrate.

Abstract

The producing, sharing and consuming life cycle of video content creates massive amount of duplicates in video segments due to variable bit rate representation and fragmentation in the playbacks. The inefficiency of this duplicates to storage and communication motivate researchers in both academia and industry to come up with computationally efficient video deduplication solutions for storage and CDN providers. Moreover, the increasing demands of high resolution and quality aggravate the status of heavy burden of cluster storage side and restricted bandwidth resources. Hence, video de-duplication in storage and transmission is becoming an important feature for video cloud storage and Content Delivery Network (CDN) service providers. Despite of the necessity of optimizing the multimedia data de-duplication approach, it is a challenging task because we should match as many as possible duplicated videos under not removing videos by mistake. The current video de-duplication schemes mostly relies on the URL based solution, which is not able to deal with non-cacheable content like video, which the same piece of content may have totally different URL identification and fragmentation and different quality representations further complicate the problem. In this paper, we propose a novel content based video segmentation identification scheme that is invariant to the underlying codec and operational bit rates, it computes robust features from a triplet loss deep learning network that captures the invariance of the same content under different coding tools and strategy, while a scalable hashing solution is developed based on Fisher Vector aggregation of the convolutional features from the Triplet loss network. Our simulation results demonstrate the great improvement in terms of large scale video repository de-duplication compared with state-of-the-art methods.

Introduction

Modern dynamic adaptive video streaming methods such as MPEG-DASH [1], Apple HLS [2] and Microsoft Smooth Streaming [3] have a great impact on how content providers store and serve the media contents in the cloud, such as a content delivery network (CDN). OTT (over the top) content providers are also pushing subscription-based video on demand (VoD) services that offer streaming services on television. The media content creation, sharing and consumption process generate many duplicates but are not necessarily identical in bit stream. There is a de-duplication of media content use case for example. If a content identification scheme can support identification of duplicates in network caches in core networks and edge nodes, then traffic can be localized and bandwidth saved. This creates challenges to the existing Content Delivery Network (CDN) and storage de-duplication schemes like those based on MD5 [4] hashing of file chunks. New compact rate agnostic and coding scheme agnostic content identification and hashing solution are needed, to characterize media segments across different representations and with totally different bit streams. Scalable and robust signatures for media content to support de-duplication at fine granular spatio-temporal segments granularity, are important to rip the full benefits of storage de-duplication.

Therefore the massive multimedia data is pushing forward the paradigm of effective storage on cluster servers. Fig. 1 depicts that various contents of resolutions and quantized parameters(REQP) are consumed by very diversified consumers’ platform. In current media content storage scheme, the storage side has to hold all of the REQP media content, which is error-prone and not cost-effective. We define the version as the combination of resolution and quantization parameter namely REQP in this paper. If users apply the same version (REQP) of videos from the server ignoring the identical ones in the content delivery network (CDN), the pressure on the network from video delivery and storage will be quite large. Hence, how to retrieve and remove the duplicated versions of videos is an essential task for researchers.

In spite of that leveraging the video de-duplication [5] scheme is quite necessary and promising, the micro improvement of its performance exhibits it is difficult to develop. First of all, multimedia data on the cloud cluster and CDN is all the cherish product from industry and user, so it is extremely strict to remove any videos. This results in that we should derive the system of high accuracy and recall. Especially, we focus on the true positive rate (TPR) under false positive rate equals 0 because we cannot allow the judgment is error and the product is deleted accidentally. Secondly, a tremendous quantity of videos cost the system much time to recognize and match the video identity if the algorithm is not precise and efficient. A high-delay method cannot satisfy the real time requirement in the social media time.

To alleviate the stated problems above, there are two groups of methods on video de-duplication depending on the comparison domain. The first group tries to perform video de-duplication directly in the pixel or frequency domain. They make use of the geometry correlation in a frame or the time correlation in a sequence to decide based on comparing the pixels information. The second group tries to use the hashing representation to replace to pixels. The most representative work is to use the deep learning features to derive the hash. Though this method utilized deep learning method to obtain some performance improvements, the cross-entropy loss function is in essence unsuitable for the video de-duplication task. And the lack of dataset is not convincing enough to claim good video de-duplication results.

Therefore, we propose a novel deep learning based scheme to de-duplicate the replicated videos in the cluster. Our method comprise two parts: off-line training model and on-line aggregating model. The off-line train model means that we employ triplets dataset to train out triplet loss function embedded VGG11 network. To acquire the hard and valuable training triplets, we apply the binary-tree partitioning the samples according to their attributions. Afterward, we perform the mature triplets VGG11 [6] model to train a variety of Primary Components Analysis (PCA) [7] models and Gaussian mixture models (GMM) [8]. For the on-line aggregation model, we first aggregate the fisher vector [9] by the trained triplets VGG11, PCA and GMM models above. Then we binary hash the fisher vector with different bits to get the scalable hash code which is a brief and effective representation for video de-duplication.

We proposed a deduplication method in our previous work [10]. In this paper, we propose a novel deep learning based scheme for deduplications. We provide more motivation, analysis, experimental results and comparison of related works on our proposed method. Additionally, in order to validate the efficiency of our algorithm, we implement more ablation studies for comparison. Our method comprises both a offline training and online aggregation model:

  • Offline training model: employ triplets dataset to train out triplet loss function embedded Visual Geometry Group (VGG) network and acquire the hard and valuable training triplets by applying the binary tree partitioning the samples according to their attributions. Then mature triplets VGG11 model is performed to train a variety of Primary Components Analysis (PCA) models and Gaussian mixture models (GMM)

  • Online aggregation model: binary hashing the fisher vector (FV) with aggregated trained triplets VGG11, PCA and GMM models obtained from offline training.

Our contribution towards video deduplications are summarized as below:

  • (1)

    We consider combining triplet loss with Visual Geometry Group (VGG) deep learning network which is trained of outstanding performance by huge media dataset to derive the features. Triplet loss function based network can learn convolutional features which is invariant to coding method and bit rates.

  • (2)

    We propose applying fisher vector to the features for feature aggregation. We utilize proposed algorithm to extract fisher vectors from outputs of VGG with triplet loss function. Fisher vector exhibits the powerful expression ability of main features for a video frame.

  • (3)

    Particularly, we propose employing binary tree to obtain the triplets to boost the performance of the triplet-loss based VGG network.

  • (4)

    We also utilize the extracting algorithm generating the scalable binary hash. The scalable binary hash can obtain different trade-offs according to different bitrate requirements.

The experimental results show that the proposed binary-tree embedded triplet loss network combining with scalable hash from fisher vector (BTF) algorithm outperforms cross-entropy [11] loss function with PCA (CP) approach in various scalable hashes.

The remaining of this paper is organized as follows. The instruction of related work will be in Section 2. We elaborate the principle of triplet loss function embedding into VGG network and the integral network structure in Section 3. In Section 4, the Binary-Tree algorithm to produce the triplets with similar variance attributions is introduced in detail. We experiment on large-scale video dataset and give the whole process and results of this in Section 5. We conclude the whole paper in Section 6.

Section snippets

Related work

As mentioned in Section 1, we can divide the video de-duplication work into two categories. The first type is traditional methods using the comparison information of pixel or frequency domain. The other one is deep learning based approaches extracting the convolutional features as the match evidences.

For conventional ways, Katiyar et al. [13] used a 2-phase video comparing scheme which is for localizing a short frames clip in a long video. Paisitkriangkrai et al. [14] defined a new heuristic

Triplet loss network for binary hashing model

The overall framework of the proposed scalable hash scheme is illustrated in Fig. 2. It consists of two components: (a) Triplet loss network feature representation generation in Section 3.1; (b) The fisher vector (FV) feature aggregation using fisher vector for generating scalable hash in Section 3.2.

Triplets generation

In this section, we will introduce how binary tree divides the dataset and generates the triplets. The training samples will be elaborated in Section 4.1. The binary-tree based generation process will be introduced in Section 4.2.

Experimental results

In this section, we will first introduce the experimental results of the overall framework in Section 5.1. Then we will show the influences of the various aggregation parameters in Section 5.2. In Section 5.3, we will illustrate the improvements of the proposed algorithm with a few subjective samples.

Conclusion

Prosperous development on multiple media big data producing, transmission and depleting have occupied the massive memory and storage in all kinds of devices, network systems, and data clusters of clouds. Improving the theory and algorithm to recognize the duplications of multiple media on every layer is an essential and urgent topic for transmitting and caching media big data quickly and efficiently. In this paper, we propose a distinct video de-duplication framework involving a triplet loss

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

The work is partially supported by a grant from NSF under award 1747751.

References (41)

  • JiaW. et al.

    Triplet loss feature aggregation for scalable hash

    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    (2020)
  • RubinsteinReuven Y et al.

    The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation and Machine Learning

    (2013)
  • Di RienzoJulio Alejandro et al.

    A multiple-comparisons method based on the distribution of the root node distance of a binary tree

    J. Agric. Biol. Environ. Stat.

    (2002)
  • KatiyarAtul et al.

    Videdup: An application-aware framework for video de-duplication

  • PaisitkriangkraiSakrapee et al.

    Scalable clip-based near-duplicate video detection with ordinal measure

  • Spencer Greene, Transparent caching of repeated video content in a network, in: Google Patents, US Patent...
  • ZhengYifeng et al.

    Enabling encrypted cloud media center with secure deduplication

  • ZhengYifeng et al.

    Toward encrypted cloud media center with secure deduplication

    IEEE Trans. Multimedia

    (2017)
  • RashidFatema et al.

    Proof of storage for video deduplication in the cloud

  • RashidFatema et al.

    A secure video deduplication scheme in cloud storage environments using H. 264 compression

  • Cited by (3)

    • Discrete hashing with triple supervision learning

      2021, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      Recent popular hashing methods can be divided into two main types. Unsupervised methods extract the potential semantic information in the original feature space of the data sample [6–11]. Supervised methods [12–21] focus on supervision information (such as sample label information or sample similarity matrices), which they embed into the learning process of hash codes.

    • Inter-frame video forgery detection using UFS-MSRC algorithm and LSTM network

      2023, International Journal of Modeling, Simulation, and Scientific Computing
    • Secure deduplication method based on tag clustering

      2022, Proceedings - 20th IEEE International Symposium on Parallel and Distributed Processing with Applications, 12th IEEE International Conference on Big Data and Cloud Computing, 12th IEEE International Conference on Sustainable Computing and Communications and 15th IEEE International Conference on Social Computing and Networking, ISPA/BDCloud/SocialCom/SustainCom 2022
    View full text