An explicit self-attention-based multimodality CNN in-loop filter for versatile video coding

Jia, Menghu; Gao, Yanbo; Li, Shuai; Yue, Jian; Ye, Mao

doi:10.1007/s11042-021-11214-2

An explicit self-attention-based multimodality CNN in-loop filter for versatile video coding

1221: Deep Learning for Image/Video Compression and Visual Quality Assessment
Published: 24 July 2021

Volume 81, pages 42497–42511, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Menghu Jia¹,
Yanbo Gao²,
Shuai Li ORCID: orcid.org/0000-0002-9938-0917²,
Jian Yue¹ &
…
Mao Ye¹

460 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

The newest video coding standard, versatile video coding (VVC), has just been published recently. While it greatly improves the performance over the last High Efficiency Video Coding (HEVC) standard, there are still blocking artifacts under the more flexible block partitioning structures. In order to reduce the blocking artifact and improve the quality of the reconstructed video frame, an explicit self-attention-based multimodality convolutional neural network (CNN) is proposed in this paper. It adaptively adjusts the restoration of different coding units (CU) according to the CU partition structure and texture of the reconstructed video frame, considering that the loss scales of different CUs can be quite different. The proposed method takes advantage of the CU partition map by using it as a different modality and combined with the attention mechanism. Moreover, the unfiltered reconstructed image is also used to enhance the attention branch, which forms an explicit self-attention model. Then a densely integrated multi-stage fusion is developed where the attention branch is densely fused to the main filtering CNN to adaptively adjust the overall image recovery scale. Thorough analysis on the proposed method is provided with ablation study on each module. Experimental results show that the proposed method achieves the state-of-the-art performance under all intra (AI) configuration, with 7.24% BD-rate savings on average compared with the VVC reference software (VTM).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Nonlocal-guided enhanced interaction spatial-temporal network for compressed video super-resolution

Article 25 July 2023

Junxiong Cheng, Shuhua Xiong, … Honggang Chen

Deep learning-based video quality enhancement for the new versatile video coding

Article 08 September 2021

Soulef Bouaafia, Randa Khemiri, … Fatma Ezahra Sayadi

Dual Attention with the Self-Attention Alignment for Efficient Video Super-resolution

Article 15 May 2021

Yuezhong Chu, Yunan Qiao, … Jungong Han

References

Birman R, Segal Y, Hadar O (2020) Overview of research in the field of video compression using deep neural networks. Multimed Tools Appl 79:11699–11722
Article Google Scholar
Bossen F, Boyce J, Li X, and Seregin V, Sühring K (2018) JVET common test conditions and software reference configurations for SDR video, document JVET-L1010, 12th JVET meeting: Macao, CN, pages 3–12
Bross B, Chen J, Liu S (2018) Versatile Video Coding (Draft 3), document JVET-L1001, Macao, CN, pages 3–12
Cavigelli L, Hager P, Benini L (2017) CAS-CNN: A deep convolutional neural network for image compression artifact suppression. 2017 International Joint Conference on Neural Networks (IJCNN), pages 752–759
Dai Y, Liu D, Wu F (2017) A convolutional neural network approach for post-processing in HEVC intra coding. MMM
Ding D, Kong L, Chen G, Liu Z, Fang Y (2020) A switchable deep learning approach for in-loop filtering in video coding. In IEEE Transactions on Circuits and Systems for Video Technology, pages 1871-1887
Dong C, Deng Y, Loy CC, Tang X (2015) Compression Artifacts Reduction by a Deep Convolutional Network. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 576–584
ElHarrouss O, Almaadeed N, Al-Máadeed S, Akbari Y (2019) Image Inpainting: a review. Neural Process Lett 51:2007–2028
Article Google Scholar
Fu C, Alshina E, Alshin A, Huang Y, Chen C, Tsai C, Hsu C, Lei S, Park J, Han W (2012) Sample adaptive offset in the HEVC standard. In IEEE Transactions on Circuits and Systems for Video Technology, pages 1755-1764
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778
He K, Zhang X, Ren S, Sun J (2016) Identity Mappings in Deep Residual Networks. In European Conference on Computer Vision (ECCV), pages 630–645
He X, Hu Q, Han X, Zhang X, Zhang C, Lin W (2018) Enhancing HEVC Compressed Videos with a Partition-Masked Convolutional Neural Network. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 216–220
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 2011-2023
Huang G, Liu Z., Weinberger KQ (2017) Densely Connected Convolutional Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2261–2269
Huang Z, Li Y, Sun J (2020) Multi-Gradient Convolutional Neural Network Based In-Loop Filter For Vvc. In 2020 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6
Jia, C., Wang, S., Zhang, X., Wang S, Liu J, Pu S, Ma S (2019) Content-aware convolutional neural network for in-loop filtering in high efficiency video coding. IEEE Transactions on Image Processing, pages 3343-3356
Kang J, Kim S, Lee KM (2017) Multi-modal/multi-scale convolutional neural network based in-loop filter design for next generation video codec. In 2017 IEEE International Conference on Image Processing (ICIP), pages 26–30
Kingma DP, Ba J (2015). Adam: a method for stochastic optimization. In 2015 international conference on learning representations (ICLR), 2015
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. Commun ACM 60:84–90
Article Google Scholar
Lai P, Wang J (2020) Multi-stage Attention Convolutional Neural Networks for HEVC In-Loop Filtering. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pages 173–177
Li D, Yu L (2019) An In-Loop Filter Based on Low-Complexity CNN using Residuals in Intra Video Coding. 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1–5
Li C, Song L, Xie R, Zhang W (2017) CNN based post-processing to improve HEVC. 2017 IEEE International Conference on Image Processing (ICIP), pages 4577–4580
Li S, Li W, Cook C, Zhu C, Gao Y. (2018). Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5457-5466, 2018
Li S, Li W, Cook C, Zhu C, Gao Y (2019) A fully trainable network with RNN-based pooling. CoRR
Li X, Wang W, Hu X, Yang J (2019) Selective Kernel Networks. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 510–519
Ma D, Zhang F, Bull D (2020) BVI-DVC: A Training Database for Deep Video Compression. CoRR
Minaee S, Boykov Y, Porikli F, Plaza A, Kehtarnavaz N, & Terzopoulos D. (2021) Image Segmentation Using Deep Learning: A Survey. IEEE transactions on pattern analysis and machine intelligence
Mishkin D, Matas J (2016). All you need is a good init. In 2016 international conference on learning representations (ICLR), 2016.
Norkin A, Bjøntegaard G, Fuldseth A, Narroschke M, Ikeda M, Andersson K, Zhou M, Auwera G (2012) HEVC Deblocking Filter. In IEEE Transactions on Circuits and Systems for Video Technology, pages 1746-1754
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. CoRR
Sullivan G, Ohm J, Han W, Wiegand T (2012) Overview of the high efficiency video coding (HEVC) standard. In IEEE Transactions on Circuits and Systems for Video Technology, pages 1649-1668
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the Inception Architecture for Computer Vision. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2826
Tian C, Fei L, Zheng W, Xu Y, Zuo W, Lin C (2020) Deep learning on image Denoising: an overview. Neural networks: the official journal of the International Neural Network Society 131:251–275
Article MATH Google Scholar
Tsai C, Chen C, Yamakage T, Chong I.S, Huang Y, Fu C, Itoh T, Watanabe T, Chujoh T, Karczewicz M, Lei S (2012) Adaptive loop filtering for video coding. In IEEE Journal of Selected Topics in Signal Processing, pages 934-945
Wang Z, Li F (2021) Convolutional neural network based low complexity HEVC intra encoder. Multimed Tools Appl 80:2441–2460
Article Google Scholar
Wang X, Girshick RB, Gupta A, He K (2018) Non-local Neural Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7794-7803, 2018.
Wang M, Wan S, Gong H, Ma M (2019) Attention-based dual-scale CNN in-loop filter for versatile video coding. In IEEE Access 7:145214–145226
Wang Z, Chen J, Hoi S (2020) Deep learning for image super-resolution: a survey. IEEE Trans Pattern Anal Mach Intell
Woo S, Park J, Lee J, Kweon I (2018) CBAM: Convolutional Block Attention Module. ECCV, 2018.
Yao J and Wang L (2019) CE13–2.1. Convolutional Neural Network Filter (CNNF) for Intra Frame, document JVET-N0169, 14th JVET Meeting, Geneva, Switzerland, pages 19–27
Zhang Y, Tian Y, Kong Y, Zhong B, Fu Y (2018) Residual Dense Network for Image Super-Resolution. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2472-2481
Zhang Y, Shen T, Ji X, Xiong R, Dai Q (2018) Residual highway convolutional neural networks for in-loop filtering in HEVC. In IEEE Transactions on Image Processing, pages 3827-3841
Zhang S, Fan Z, Ling N, Jiang M (2020) Recursive residual convolutional neural network- based in-loop filtering for intra frames. In IEEE Transactions on Circuits and Systems for Video Technology, pages 1888-1900

Download references

Acknowledgements

This work was partly supported by the National Key R&D Program of China (2018YFE0203900), the National Natural Science Foundation of China (No. 61901083 and 62001092), Project of Quzhou Municipal Government (2020D011), and SDU QILU Young Scholars program.

Author information

Authors and Affiliations

University of Electronic Science and Technology of China, Chengdu, China
Menghu Jia, Jian Yue & Mao Ye
Shandong University, Shandong, China
Yanbo Gao & Shuai Li

Authors

Menghu Jia
View author publications
You can also search for this author in PubMed Google Scholar
Yanbo Gao
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yue
View author publications
You can also search for this author in PubMed Google Scholar
Mao Ye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanbo Gao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Menghu Jia and Yanbo Gao co-first authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jia, M., Gao, Y., Li, S. et al. An explicit self-attention-based multimodality CNN in-loop filter for versatile video coding. Multimed Tools Appl 81, 42497–42511 (2022). https://doi.org/10.1007/s11042-021-11214-2

Download citation

Received: 07 April 2021
Revised: 11 June 2021
Accepted: 01 July 2021
Published: 24 July 2021
Issue Date: December 2022
DOI: https://doi.org/10.1007/s11042-021-11214-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An explicit self-attention-based multimodality CNN in-loop filter for versatile video coding

Abstract

Access this article

Similar content being viewed by others

Nonlocal-guided enhanced interaction spatial-temporal network for compressed video super-resolution

Deep learning-based video quality enhancement for the new versatile video coding

Dual Attention with the Self-Attention Alignment for Efficient Video Super-resolution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An explicit self-attention-based multimodality CNN in-loop filter for versatile video coding

Abstract

Access this article

Similar content being viewed by others

Nonlocal-guided enhanced interaction spatial-temporal network for compressed video super-resolution

Deep learning-based video quality enhancement for the new versatile video coding

Dual Attention with the Self-Attention Alignment for Efficient Video Super-resolution

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation