A temporal attention based appearance model for video object segmentation

Wang, Hui; Liu, Weibin; Xing, Weiwei

doi:10.1007/s10489-021-02547-4

A temporal attention based appearance model for video object segmentation

Published: 09 June 2021

Volume 52, pages 2290–2300, (2022)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Hui Wang¹,
Weibin Liu¹ &
Weiwei Xing²

493 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

More and more researchers have recently paid attention to video object segmentation because it is an important building block for numerous computer vision applications. Although many algorithms promote its development, there are still some open challenges. Efficient and robust pipelines are needed to address appearance changes and the distraction from similar background objects in the video object segmentation. This paper proposes a novel neural network that integrates a temporal attention based appearance model and a boundary-aware loss. The appearance model fuses the appearance information of the first frame, the previous frame, and the current frame in the feature space, which assists the proposed method to learn a discriminative and robust target representation and avoid the drift problem of traditional propagation schemes. Moreover, the boundary-aware loss is employed for network training. Equipped with the boundary-aware loss, the proposed method achieves more accurate segmentation results with clear boundaries. The proposed method is compared with several recent state-of-the-art algorithms on popular benchmark datasets. Comprehensive experiments show that the proposed method achieves favorable performance with a high frame rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

U-Net: Convolutional Networks for Biomedical Image Segmentation

SSD: Single Shot MultiBox Detector

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

Meng-Hao Guo, Tian-Xing Xu, … Shi-Min Hu

References

Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS (2019) Fast online object tracking and segmentation: A unifying approach. In: Proc. IEEE conf computer vision and pattern recognition (CVPR), pp 1328–1338
Huang hW, Gu J, Ma X, Li Y (2020) End-to-end multitask siamese network with residual hierarchical attention for real-time object tracking. Appl Intell 50:1908–1921
Article Google Scholar
Yao G, Lei T, Zhong J, Jiang P (2018) Learning multi-temporal-scale deep information for action recognition. Appl Intell 49:2017–2029
Article Google Scholar
Ignatov A (2018) Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl Soft Comput 62:915–922
Article Google Scholar
Wang W, Shen J, Porikli F (2017) Selective video object cutout. IEEE Trans Image Process 26:5645–5655
Article MathSciNet Google Scholar
Serrano A, Sitzmann V, Ruiz-Borau J, Wetzstein G, Gutierrez D, Masiá B (2017) Movie editing and cognitive event segmentation in virtual reality video. ACM Trans Graph 36:47:1–47:12
Article Google Scholar
Wu G, Han J, Guo Y, Liu L, Ding G, Ni Q, Shao L (2019) Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Trans Image Process 28:1993–2007
Article MathSciNet Google Scholar
Bhandarkar S, Chen F (2005) Similarity analysis of video sequences using an artificial neural network. Appl Intell 22:251–275
Article Google Scholar
Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2017) One-shot video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 5320–5329
Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops
Cheng J, Tsai YH, Wang S, Yang MH (2017) SegFlow: Joint learning for video object segmentation and optical flow. In: Proc IEEE Int Conf Computer Vision (ICCV), pp 686–695
Li X, Loy CC (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proc european conference computer vision (ECCV), pp 93–110
Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018) YouTube-VOS: sequence-to-sequence video object segmentation. In: Proc european conference computer vision (ECCV), pp 603–619
Luiten J, Voigtlaender P, Leibe B (2018) Premvos:Proposal-generation, refinement and merging for video object segmentation. In: Proc asian conference computer vision (ACCV)
Bao L, Wu B, Liu W (2018) Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 5977–5986
Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 3491–3500
Jampani V, Gadde R, Gehler P (2017) Video propagation networks. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 3154–3164
Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 6499–6507
Yoon SJ, Rameau F, Kim JS, Lee S, Shin S, Kweon SI (2017) Pixel-level matching for video object segmentation using convolutional neural networks. In: Proc IEEE int conf computer vision (ICCV), pp 2186–2195
Chen Y, Pont-Tuset J, Montes A, Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 1189–1198
Oh S, Lee JY, Sunkavalli K, Kim S (2018) Fast video object segmentation by reference-guided mask propagation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 7376–7385
Johnander J, Danelljan M, Brissman E, Khan FS, Felsberg M (2019) A generative appearance model for end-to-end video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 8945–8954
Wang Z, Simoncelli E, Bovik A (2003) Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, vol 2, pp 1398–1402
Wu Z, Shen C, Hengel AVD (2016) Bridging category-level and instance-level semantic image segmentation. arXiv:160506885
Perazzi F, Pont-Tuset J, Mcwilliams B, Gool LV, Gross M, Sorkine-Hornung A (2016) A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 724–732
Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A, Gool L (2017) The 2017 davis challenge on video object segmentation. arXiv:170400675
Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018) Youtube-vos: A large-scale video object segmentation benchmark. arXiv:180903327
Yeo D, Son J, Han B, Han J (2017) Superpixel-based tracking-by-segmentation using Markov Chains. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 511–520
Xiao H, Feng J, Lin G, Liu Y, Zhang M (2018) Monet: Deep motion exploitation for video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 1140–1148
Zhou Q, Huang Z, Huang L, Han S, Gong Y, Huang C, Liu W, Wang X (2019) Proposal, tracking and segmentation (pts): A cascaded network for video object segmentation. arXiv:190701203v2
Maninis KK, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixe L, Cremers D, Van Gool L (2019) Video object segmentation without temporal information. IEEE Trans Pattern Anal Mach Intell 41:1515–1530
Article Google Scholar
Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2019) Lucid Data Dreaming for Video Object Segmentation. Int J Comput Vis 127:1175–1197
Article Google Scholar
Xu K, Wen L, Li G, Bo L, Huang Q (2019) Spatiotemporal cnn for video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 1379–1388
Georgia Gkioxari KH, Dollár P, Girshick BR (2017) Mask r-cnn. pp 386–397
Hu YT, Huang JB, Schwing AG (2018) VideoMatch: matching based video object segmentation. In: Proc european conference computer vision (ECCV), pp 56–73
Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen LC (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 9473–9482
Wang Z, Xu J, Liu L, Zhu F, Shao L (2019) Ranet: Ranking attention network for fast video object segmentation. In: Proc IEEE int conf computer vision (ICCV), pp 3977–3986
Wu Z, Shen C, Avd Hengel (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recogn 90:119– 133
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 770–778
Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 248–255
Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111:98–136
Article Google Scholar
Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Proc european conference computer vision (ECCV), pp 740–755
Wang X, Girshick BR, Gupta A, He K (2018) Non-local neural networks. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 7794–7803
Cheng M, Mitra NJ, Huang X, Torr PHS, Hu S (2015) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37(3):569–582
Article Google Scholar
Fan DP, Cheng MM, Liu JJ, Gao SH, Hou Q, Borji A (2018) Salient objects in clutter: bringing salient object detection to the foreground. In: Proc european conference computer vision (ECCV), pp 196–212
Kingma DP (2014) Adam:A method for stochastic optimization. In: Proc inter conf learning representations (ICLR)
Hu YT, Huang JB (2017) Maskrnn:Instance level video object segmentation. In: Adv Neural Information Processing Systems (NIPS)
Hu P, Wang G, Kong X, Kuen J, Tan Y (2020) Motion-guided cascaded refinement network for video object segmentation. IEEE Trans Pattern Anal Mach Intell 42:1957–1967
Article Google Scholar
Xiao H, Kang B, Liu Y, Zhang M, Feng J (2020) Online meta adaptation for fast video object segmentation. IEEE Trans Pattern Anal Mach Intell 42:1205–1217
Google Scholar
Ventura C, Bellver M, Girbau A, Salvador A, Marqués F, i Nieto XG (2019) Rvos: End-to-end recurrent network for video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 5272–5281

Download references

Acknowledgements

This research is partially supported by the Beijing Natural Science Foundation (No.4212025), National Natural Science Foundation of China (No.61876018, No.61976017).

Author information

Authors and Affiliations

Institute of Information Science, Beijing Jiaotong University, 100044, Beijing, China
Hui Wang & Weibin Liu
School of Software Engineering, Beijing Jiaotong University, 100044, Beijing, China
Weiwei Xing

Authors

Hui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Weibin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Weiwei Xing
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weibin Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, H., Liu, W. & Xing, W. A temporal attention based appearance model for video object segmentation. Appl Intell 52, 2290–2300 (2022). https://doi.org/10.1007/s10489-021-02547-4

Download citation

Accepted: 19 May 2021
Published: 09 June 2021
Issue Date: January 2022
DOI: https://doi.org/10.1007/s10489-021-02547-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A temporal attention based appearance model for video object segmentation

Abstract

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

SSD: Single Shot MultiBox Detector

Attention mechanisms in computer vision: A survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A temporal attention based appearance model for video object segmentation

Abstract

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

SSD: Single Shot MultiBox Detector

Attention mechanisms in computer vision: A survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation