Skip to main content
Log in

A temporal attention based appearance model for video object segmentation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

More and more researchers have recently paid attention to video object segmentation because it is an important building block for numerous computer vision applications. Although many algorithms promote its development, there are still some open challenges. Efficient and robust pipelines are needed to address appearance changes and the distraction from similar background objects in the video object segmentation. This paper proposes a novel neural network that integrates a temporal attention based appearance model and a boundary-aware loss. The appearance model fuses the appearance information of the first frame, the previous frame, and the current frame in the feature space, which assists the proposed method to learn a discriminative and robust target representation and avoid the drift problem of traditional propagation schemes. Moreover, the boundary-aware loss is employed for network training. Equipped with the boundary-aware loss, the proposed method achieves more accurate segmentation results with clear boundaries. The proposed method is compared with several recent state-of-the-art algorithms on popular benchmark datasets. Comprehensive experiments show that the proposed method achieves favorable performance with a high frame rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS (2019) Fast online object tracking and segmentation: A unifying approach. In: Proc. IEEE conf computer vision and pattern recognition (CVPR), pp 1328–1338

  2. Huang hW, Gu J, Ma X, Li Y (2020) End-to-end multitask siamese network with residual hierarchical attention for real-time object tracking. Appl Intell 50:1908–1921

    Article  Google Scholar 

  3. Yao G, Lei T, Zhong J, Jiang P (2018) Learning multi-temporal-scale deep information for action recognition. Appl Intell 49:2017–2029

    Article  Google Scholar 

  4. Ignatov A (2018) Real-time human activity recognition from accelerometer data using convolutional neural networks. Appl Soft Comput 62:915–922

    Article  Google Scholar 

  5. Wang W, Shen J, Porikli F (2017) Selective video object cutout. IEEE Trans Image Process 26:5645–5655

    Article  MathSciNet  Google Scholar 

  6. Serrano A, Sitzmann V, Ruiz-Borau J, Wetzstein G, Gutierrez D, Masiá B (2017) Movie editing and cognitive event segmentation in virtual reality video. ACM Trans Graph 36:47:1–47:12

    Article  Google Scholar 

  7. Wu G, Han J, Guo Y, Liu L, Ding G, Ni Q, Shao L (2019) Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Trans Image Process 28:1993–2007

    Article  MathSciNet  Google Scholar 

  8. Bhandarkar S, Chen F (2005) Similarity analysis of video sequences using an artificial neural network. Appl Intell 22:251–275

    Article  Google Scholar 

  9. Caelles S, Maninis KK, Pont-Tuset J, Leal-Taixé L, Cremers D, Van Gool L (2017) One-shot video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 5320–5329

  10. Voigtlaender P, Leibe B (2017) Online adaptation of convolutional neural networks for the 2017 davis challenge on video object segmentation. The 2017 DAVIS Challenge on Video Object Segmentation - CVPR Workshops

  11. Cheng J, Tsai YH, Wang S, Yang MH (2017) SegFlow: Joint learning for video object segmentation and optical flow. In: Proc IEEE Int Conf Computer Vision (ICCV), pp 686–695

  12. Li X, Loy CC (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In: Proc european conference computer vision (ECCV), pp 93–110

  13. Xu N, Yang L, Fan Y, Yang J, Yue D, Liang Y, Price B, Cohen S, Huang T (2018) YouTube-VOS: sequence-to-sequence video object segmentation. In: Proc european conference computer vision (ECCV), pp 603–619

  14. Luiten J, Voigtlaender P, Leibe B (2018) Premvos:Proposal-generation, refinement and merging for video object segmentation. In: Proc asian conference computer vision (ACCV)

  15. Bao L, Wu B, Liu W (2018) Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 5977–5986

  16. Perazzi F, Khoreva A, Benenson R, Schiele B, Sorkine-Hornung A (2017) Learning video object segmentation from static images. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 3491–3500

  17. Jampani V, Gadde R, Gehler P (2017) Video propagation networks. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 3154–3164

  18. Yang L, Wang Y, Xiong X, Yang J, Katsaggelos AK (2018) Efficient video object segmentation via network modulation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 6499–6507

  19. Yoon SJ, Rameau F, Kim JS, Lee S, Shin S, Kweon SI (2017) Pixel-level matching for video object segmentation using convolutional neural networks. In: Proc IEEE int conf computer vision (ICCV), pp 2186–2195

  20. Chen Y, Pont-Tuset J, Montes A, Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 1189–1198

  21. Oh S, Lee JY, Sunkavalli K, Kim S (2018) Fast video object segmentation by reference-guided mask propagation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 7376–7385

  22. Johnander J, Danelljan M, Brissman E, Khan FS, Felsberg M (2019) A generative appearance model for end-to-end video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 8945–8954

  23. Wang Z, Simoncelli E, Bovik A (2003) Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, vol 2, pp 1398–1402

  24. Wu Z, Shen C, Hengel AVD (2016) Bridging category-level and instance-level semantic image segmentation. arXiv:160506885

  25. Perazzi F, Pont-Tuset J, Mcwilliams B, Gool LV, Gross M, Sorkine-Hornung A (2016) A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 724–732

  26. Pont-Tuset J, Perazzi F, Caelles S, Arbeláez P, Sorkine-Hornung A, Gool L (2017) The 2017 davis challenge on video object segmentation. arXiv:170400675

  27. Xu N, Yang L, Fan Y, Yue D, Liang Y, Yang J, Huang T (2018) Youtube-vos: A large-scale video object segmentation benchmark. arXiv:180903327

  28. Yeo D, Son J, Han B, Han J (2017) Superpixel-based tracking-by-segmentation using Markov Chains. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 511–520

  29. Xiao H, Feng J, Lin G, Liu Y, Zhang M (2018) Monet: Deep motion exploitation for video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 1140–1148

  30. Zhou Q, Huang Z, Huang L, Han S, Gong Y, Huang C, Liu W, Wang X (2019) Proposal, tracking and segmentation (pts): A cascaded network for video object segmentation. arXiv:190701203v2

  31. Maninis KK, Caelles S, Chen Y, Pont-Tuset J, Leal-Taixe L, Cremers D, Van Gool L (2019) Video object segmentation without temporal information. IEEE Trans Pattern Anal Mach Intell 41:1515–1530

    Article  Google Scholar 

  32. Khoreva A, Benenson R, Ilg E, Brox T, Schiele B (2019) Lucid Data Dreaming for Video Object Segmentation. Int J Comput Vis 127:1175–1197

    Article  Google Scholar 

  33. Xu K, Wen L, Li G, Bo L, Huang Q (2019) Spatiotemporal cnn for video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 1379–1388

  34. Georgia Gkioxari KH, Dollár P, Girshick BR (2017) Mask r-cnn. pp 386–397

  35. Hu YT, Huang JB, Schwing AG (2018) VideoMatch: matching based video object segmentation. In: Proc european conference computer vision (ECCV), pp 56–73

  36. Voigtlaender P, Chai Y, Schroff F, Adam H, Leibe B, Chen LC (2019) Feelvos: Fast end-to-end embedding learning for video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 9473–9482

  37. Wang Z, Xu J, Liu L, Zhu F, Shao L (2019) Ranet: Ranking attention network for fast video object segmentation. In: Proc IEEE int conf computer vision (ICCV), pp 3977–3986

  38. Wu Z, Shen C, Avd Hengel (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recogn 90:119– 133

    Article  Google Scholar 

  39. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 770–778

  40. Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 248–255

  41. Everingham M, Eslami SMA, Van Gool L, Williams CKI, Winn J, Zisserman A (2015) The pascal visual object classes challenge: a retrospective. Int J Comput Vis 111:98–136

    Article  Google Scholar 

  42. Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: Proc european conference computer vision (ECCV), pp 740–755

  43. Wang X, Girshick BR, Gupta A, He K (2018) Non-local neural networks. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 7794–7803

  44. Cheng M, Mitra NJ, Huang X, Torr PHS, Hu S (2015) Global contrast based salient region detection. IEEE Trans Pattern Anal Mach Intell 37(3):569–582

    Article  Google Scholar 

  45. Fan DP, Cheng MM, Liu JJ, Gao SH, Hou Q, Borji A (2018) Salient objects in clutter: bringing salient object detection to the foreground. In: Proc european conference computer vision (ECCV), pp 196–212

  46. Kingma DP (2014) Adam:A method for stochastic optimization. In: Proc inter conf learning representations (ICLR)

  47. Hu YT, Huang JB (2017) Maskrnn:Instance level video object segmentation. In: Adv Neural Information Processing Systems (NIPS)

  48. Hu P, Wang G, Kong X, Kuen J, Tan Y (2020) Motion-guided cascaded refinement network for video object segmentation. IEEE Trans Pattern Anal Mach Intell 42:1957–1967

    Article  Google Scholar 

  49. Xiao H, Kang B, Liu Y, Zhang M, Feng J (2020) Online meta adaptation for fast video object segmentation. IEEE Trans Pattern Anal Mach Intell 42:1205–1217

    Google Scholar 

  50. Ventura C, Bellver M, Girbau A, Salvador A, Marqués F, i Nieto XG (2019) Rvos: End-to-end recurrent network for video object segmentation. In: Proc IEEE conf computer vision and pattern recognition (CVPR), pp 5272–5281

Download references

Acknowledgements

This research is partially supported by the Beijing Natural Science Foundation (No.4212025), National Natural Science Foundation of China (No.61876018, No.61976017).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weibin Liu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, H., Liu, W. & Xing, W. A temporal attention based appearance model for video object segmentation. Appl Intell 52, 2290–2300 (2022). https://doi.org/10.1007/s10489-021-02547-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02547-4

Keywords

Navigation