Neural texture transfer assisted video coding with adaptive up-sampling

https://doi.org/10.1016/j.image.2022.116754Get rights and content

Highlights

  • We introduce reference-based SR in down/up-sampling-based video coding method, where target and reference images are not required to be texture-aligned as required in existing methods.

  • We proposed an adaptive group of pictures (GOP) method to automatically decide the adaptive sampling scheme.

  • The neural texture transfer model for reference-based SR produces realistic up-sampled frame at the decoding end.

Abstract

Deep learning techniques have been extensively investigated for the purpose of further increasing the efficiency of traditional video compression. Some deep learning techniques for down/up-sampling-based video coding were found to be especially effective when the bandwidth or storage is limited. Existing works mainly differ in the super-resolution models used. Some works simply use a single image super-resolution model, ignoring the rich information in the correlation between video frames, while others explore the correlation between frames by simply concatenating the features across adjacent frames. This, however, may fail when the textures are not well aligned. In this paper, we propose to utilize neural texture transfer which exploits the semantic correlation between frames and is able to explore the correlated information even when the textures are not aligned. Meanwhile, an adaptive group of pictures (GOP) method is proposed to automatically decide whether a frame should be down-sampled or not. Experimental results show that the proposed method outperforms the standard HEVC and state-of-the-art methods under different compression configurations. When compared to standard HEVC, the BD-rate (PSNR) and BD-rate (SSIM) of the proposed method are up to -19.1% and -26.5%, respectively.

Introduction

According to recent statistics, video traffic will account for about 82% of all Internet traffic by 2022 [1]. Video has become one of the major ways for information transmission and communication. At the same time, new video types, including Ultra-High Definition (UHD), Virtual Reality (VR), Wide Color Gamut (WCG), and High Dynamic Range (HDR), are emerging. These new video types provide a better user experience but at the cost of dramatically increased data volumes. Meanwhile, the number of video cameras in use keeps boosting in recent years, such as surveillance cameras, laptops, and smartphones. Consequently, the total amount of global video data doubles every two years, which has been the bottleneck for data processing, storage, and transmission [2]. Therefore, more advanced video compression techniques are of vital importance, which will support more efficient storage and transmission of videos.

During the last three decades, the development of traditional statistical video compression methods [3], [4], [5], [6], [7] has somewhat saturated and most recent endeavors turned to deep learning models [8], [9], [10], which have proved their capacity to discover knowledge from unstructured massive data and provide data-driven predictions. Deep learning has the potential to provide new opportunities for further upgrading video coding technologies. Inspired by the recent advances in deep learning, many works have been proposed to leverage deep learning in video compression and achieve significant improvements [11], [12]. Among these, Convolution Neural Network (CNN) based video enhancement was proposed as a post-processing procedure at the decoder to improve the perceptual quality of the reconstructed video [13], [14], [15]. To further increase the compression ratio, some works propose to down-sample the video prior to encoding and up-sample the decoded video using CNN-based video super-resolution model, which is known as down/up-sampling-based coding [16], [17], [18]. As reported in [19], [20], the coding of low-resolution video can perform both subjectively and objectively better than the direct coding of full-resolution version at low bit rates.

Prior work in down/up-sampling-based coding is mainly inspired by CNN-based single image super-resolution (SR), which outperforms traditional methods with a huge margin [21]. For example, a CNN-based up-sampling scheme is proposed in [16] for intra-frame coding. A video coding scheme is proposed in [22], where a group of pictures (GOP) is entirely down-sampled and compressed, and each frame is individually up-sampled using trained CNN models. Although these methods have demonstrated the potential of down/up-sampling-based coding with CNN for improving compression performance, they have not exploited the correlation between neighboring frames.

To this end, several attempts have been made to use the correlation between neighboring frames to further enhance the performance [17], [23], [24]. Lin et al. [25] proposed to adaptively divide frames into keyframes (KFs) and nonkey frames (NKFs), which are encoded at the original resolution and at a reduced resolution, respectively. At the decoder, NKFs are reconstructed with the corresponding motion estimation block in KFs using CNN. In addition to frame-level down/up-sampling-based coding, block-level down/up-sampling-based coding are also proposed to improve the performance [17], [26]. In [17], each block in the P/B frame can either be compressed at the original resolution or down-sampled and compressed at a lower resolution. At the decoder, low-resolution blocks are up-sampled by the CNN models. The block-based scheme provides the flexibility to deal with the spatially variant texture and motion characteristics in natural videos. However, it is quite computational intensive for block-level processing.

In this paper, frame-level down/up-sampling-based coding is employed to reduce the computational complexity. Meanwhile, a neural texture transfer-based frame up-sampling is proposed to cope with the spatially variant texture and motion characteristics in natural videos. As in the existing frame-level schemes [23], [25], features of the low-resolution frame and the full-resolution reference frame are simply concatenated, where only co-located features can be fused. However, the best matching feature is not necessarily co-located, thus leading to a sub-optimal result. While with neural texture transfer [27], a multi-level matching is conducted in the neural space, instead of the raw pixel space, to adaptively transfer texture from the reference images to the target image. This matching scheme facilitates the semantic texture transfer, which provides robust results even when irrelevant reference images are provided. To achieve optimized performance, the neural texture transfer model was fine-tuned on HEVC compressed video sequences. The proposed approach has been compared to both the HEVC anchor [28] and the block-level scheme [17], with results demonstrating consistent improvements on the HEVC common test sequences [29] for different QP ranges. Specifically, the main contributions of our work are as follows:

  • We propose to use the semantic texture transfer for down/up-sampling-based video coding, which exploits the semantic correlation between the reference frames and the target frame, leading to significant enhancement of the frame-level SR performance.

  • We propose a non-uniform compression scheme, where frames are adaptively compressed at the original or reduced resolution. Thus, frames encoded at the original resolution can be used as reference frames for the restoration of other frames.

  • Our model outperforms the state-of-the-art block-level down/up-sampling-based coding scheme while requiring a lower computational complexity.

The rest of the paper is organized as follows. In Section 2, the deep learning-based up-sampling methods and down/up-sampling-based video coding methods are reviewed. Section 3 introduces the proposed neural texture transfer assisted frame-level down/up-sampling-based coding scheme, the CNN architecture of the neural texture transfer, and the training strategy. Section 4 describes the experiments and results. Finally in Section 6, conclusion and future work are presented.

Section snippets

Related work

In this section, we briefly overview the most related works, including deep-learning-based image/video super-resolution and down/up-sampling-based video coding methods.

Methodology

As shown in Fig. 1, our proposed method first divides the video sequence adaptively into groups of pictures, called as adaptive GOP. The first frame of GOP (denoted as RF) remains full resolution, while the remaining frames in GOP (denoted as NF) are down-sampled to a reduced resolution (denoted as NFLR) using bicubic down-sampling (×1/2). As for encoding, RF is encoded at full resolution. Then, the encoder reconstructed version RFD is bicubic down-sampled (×1/2) to assist the encoding of NFLR.

Experimental settings

The proposed scheme is implemented based on the reference software of HEVC (HM 12.1). The HEVC common test sequences [29] are used to evaluate the performance of the proposed method, with various resolutions, known as Class A, B, C, D, E.2 None of these sequences was used in training the SR model in Fig. 3. The Low-Delay P

Ablation studies

Conclusion

A neural texture transfer-assisted video coding with an adaptive up-sampling scheme is proposed in this paper. This scheme adaptively decides whether a frame should be down-sampled or not. In the decoder, the down-sampled frames are restored by exploring their correlations with the frames that are not down-sampled using neural texture transfer in a multi-scale manner. Experimental results show that, compared with HEVC and state-of-the-art method [17], our model provides better performance in

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China under Grant 62002172, and Grant 61972323; and in part by the Natural Science Foundation of the Jiangsu Higher Education Institutions of China under Grant 19KJB510040; and in part by the Nanjing Scientific Innovation Foundation for the Returned Overseas Chinese Scholars under Grant R2019LZ04; and in part by the Jiangsu Provincial Double-Innovation Doctor Program; and in part by the Open Project Program of the

References (51)

  • V. Cisco, Cisco visual networking index: Forecast and trends, 2017–2022, White Paper 1,...
  • VanX.H. et al.

    HEVC backward compatible scalability: A low encoding complexity distributed video coding based approach

    Signal Process., Image Commun.

    (2015)
  • LiuD. et al.

    Deep learning-based video coding: A review and a case study

    ACM Comput. Surv.

    (2020)
  • MaS. et al.

    Image and video compression with neural networks: A review

    IEEE Trans. Circuits Syst. Video Technol.

    (2019)
  • ZhangF. et al.

    Enhancing VVC through CNN-based post-processing

  • GuanZ. et al.

    MFQE 2.0: A new approach for multi-frame quality enhancement on compressed video

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2019)
  • YuL. et al.

    Convolutional neural network for intermediate view enhancement in multiview streaming

    IEEE Trans. Multimed.

    (2017)
  • LiY. et al.

    Convolutional neural network-based block up-sampling for intra frame coding

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • LinJ. et al.

    Convolutional neural network-based block up-sampling for HEVC

    IEEE Trans. Circuits Syst. Video Technol.

    (2018)
  • LiY. et al.

    Learning a convolutional neural network for image compact-resolution

    IEEE Trans. Image Process.

    (2018)
  • BrucksteinA.M. et al.

    Down-scaling for better transform compression

    IEEE Trans. Image Process.

    (2003)
  • TakahashiK. et al.

    Rate-distortion analysis of super-resolution image/video decoding

  • DongC. et al.

    Image super-resolution using deep convolutional networks

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • BullD. et al.

    Description of SDR video coding technology proposal by University of Bristol (JVETJ0031)

  • KappelerA. et al.

    Video super-resolution with convolutional neural networks

    IEEE Trans. Comput. Imaging

    (2016)
  • Cited by (0)

    View full text