VOSTR: Video Object Segmentation via Transferable Representations

Chen, Yi-Wen; Tsai, Yi-Hsuan; Lin, Yen-Yu; Yang, Ming-Hsuan

doi:10.1007/s11263-019-01224-x

VOSTR: Video Object Segmentation via Transferable Representations

Published: 03 February 2020

Volume 128, pages 931–949, (2020)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Yi-Wen Chen^1,2,
Yi-Hsuan Tsai³,
Yen-Yu Lin ORCID: orcid.org/0000-0002-7183-6070^2,4 &
…
Ming-Hsuan Yang^1,5

565 Accesses
7 Citations
Explore all metrics

Abstract

In order to learn video object segmentation models, conventional methods require a large amount of pixel-wise ground truth annotations. However, collecting such supervised data is time-consuming and labor-intensive. In this paper, we exploit existing annotations in source images and transfer such visual information to segment videos with unseen object categories. Without using any annotations in the target video, we propose a method to jointly mine useful segments and learn feature representations that better adapt to the target frames. The entire process is decomposed into three tasks: (1) refining the responses with fully-connected CRFs, (2) solving a submodular function for selecting object-like segments, and (3) learning a CNN model with a transferable module for adapting seen categories in the source domain to the unseen target video. We present an iterative update scheme between three tasks to self-learn the final solution for object segmentation. Experimental results on numerous benchmark datasets demonstrate that the proposed method performs favorably against the state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unseen Object Segmentation in Videos via Transferable Representations

Design Pseudo Ground Truth with Motion Cue for Unsupervised Video Object Segmentation

Fast target-aware learning for few-shot video object segmentation

Article 27 July 2022

Yadang Chen, Chuanyan Hao, … Enhua Wu

References

Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., & Gool, L. V. (2017). One-shot video object segmentation. In CVPR.
Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915.
Chen, Y., Pont-Tuset, J., Montes, A., & Gool, L. V. (2018a). Blazingly fast video object segmentation with pixel-wise metric learning. In CVPR.
Chen, Y. W., Tsai, Y. H., Yang, C. Y., Lin, Y. Y., & Yang, M. H. (2018b). Unseen object segmentation in videos via transferable representations. In ACCV.
Cheng, J., Tsai, Y. H., Wang, S., & Yang, M. H. (2017). Segflow: Joint learning for video object segmentation and optical flow. In ICCV.
Cheng, J., Tsai, Y. H., Hung, W. C., Wang, S., & Yang, M. H. (2018). Fast and accurate online video object segmentation via tracking parts. In CVPR.
Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. IJCV, 88(2), 303–338.
Article Google Scholar
Faktor, A., & Irani, M. (2014). Video segmentation by non-local consensus voting. In BMVC.
Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In ICML.
Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In ICCV.
Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In CVPR.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Hoffman, J., Guadarrama, S., Tzeng, E. S., Hu, R., Donahue, J., Girshick, R., Darrell, T., & Saenko, K. (2014). LSDA: Large scale detection through adaptation. In NIPS.
Hong, S., Oh, J., Lee, H., Han, B. (2016). Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In CVPR.
Hu, R., Dollár, P., He, K., Darrell, T., & Girshick, R. (2018). Learning to segment every thing. In CVPR.
Jain, S., Xiong, B., & Grauman, K. (2017). Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR.
Jain, S. D., & Grauman, K. (2014). Supervoxel-consistent foreground propagation in video. In ECCV.
Khoreva, A., Perazzi, F., Benenson, R., Schiele, B., & Sorkine-Hornung, A. (2017). Learning video object segmentation from static images. In CVPR.
Koh, Y. J., & Kim, C. S. (2017). Primary object segmentation in videos based on region augmentation and reduction. In CVPR.
Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.
Lazic, N., Givoni, I., Frey, B., & Aarabi, P. (2009). Floss: Facility location for subspace segmentation. In ICCV.
Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.
Li, F., Kim, T., Humayun, A., Tsai, D., & Rehg, J. M. (2013). Video segmentation by tracking many figure-ground segments. In ICCV.
Li, S., Seybold, B., Vorobyov, A., Lei, X., & Kuo, C. C. J. (2018). Unsupervised video object segmentation with motion-based bilateral networks. In ECCV.
Lim, J. J., Salakhutdinov, R., & Torralba A. (2011). Transfer learning by borrowing examples for multiclass object detection. In NIPS.
Liu, C. (2009). Beyond pixels: Exploring new representations and applications for motion analysis. PhD thesis, MIT.
Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.
Luo, Z., Zou, Y., Hoffman, J., & Fei-Fei, L. (2017). Label efficient learning of transferable representations across domains and tasks. In NIPS.
Märki, N., Perazzi, F., & Wang, O., & Sorkine-Hornung, A. (2016). Bilateral space video segmentation. In CVPR.
Ochs, P., & Brox, T. (2011). Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions. In ICCV.
Oh, S. W., Lee, J. Y., Sunkavalli, K., & Kim, S. J. (2018). Fast video object segmentation by reference-guided mask propagation. In CVPR.
Papazoglou, A., & Ferrari, V. (2013). Fast object segmentation in unconstrained video. In ICCV.
Patricia, N., & Caputo, B. (2014). Learning to learn, from transfer learning to domain adaptation: A unifying perspective. In CVPR.
Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP (pp. 1532–1543).
Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool L. V., Gross M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.
Perazzi, F., Wang, O., Gross, M., & Sorkine-Hornung, A. (2015). Fully connected object proposals for video segmentation. In CVPR.
Prest, A., Leistner, C., Civera, J., Schmid, C., & Ferrari, V. (2012). Learning object class detectors from weakly annotated video. In CVPR.
Rochan, M., & Wang, Y. (2015). Weakly supervised localization of novel objects using appearance transfer. In CVPR.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.
Article MathSciNet Google Scholar
Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new domains. In ECCV.
Saleh, F. S., Aliakbarian, M. S., Salzmann, M., Petersson, L., & Alvarez, J. M. (2017). Bringing background into the foreground: Making all classes equal in weakly-supervised video semantic segmentation. In ICCV.
Shi, Z., Yang, Y., Hospedales, T. M., & Xiang, T. (2017). Weakly-supervised image annotation and segmentation with objects and attributes. PAMI, 39(12), 2525–2538.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR. abs/1409.1556:1187–1200.
Strand, R., Ciesielski, K. C., Malmberg, F., & Saha, P. K. (2013). The minimum barrier distance. CVIU, 117(4), 429–437.
MATH Google Scholar
Tang, K., Sukthankar, R., Yagnik, J., & Fei-Fei, L. (2013). Discriminative segment annotation in weakly labeled video. In CVPR.
Taylor, B., Karasev, V., & Soatto, S. (2015). Causal video object segmentation from persistence of occlusions. In CVPR.
Tokmakov, P., Alahari, K., & Schmid, C. (2017a). Learning motion patterns in videos. In CVPR.
Tokmakov, P., Alahari, K., & Schmid, C. (2017b). Learning video object segmentation with visual memory. In ICCV.
Tommasi, T., Orabona, F., & Caputo, B. (2014). Learning categories from few examples with multi model knowledge transfer. PAMI, 36, 928–941.
Article Google Scholar
Tsai, Y. H., Hung, W. C., Schulter, S., Sohn, K., Yang, M. H., & Chandraker, M. (2018). Learning to adapt structured output space for semantic segmentation. In CVPR.
Tsai, Y. H., Yang, M. H., & Black, M. J. (2016a). Video segmentation via object flow. In CVPR.
Tsai, Y. H., Zhong, G., & Yang, M. H. (2016b). Semantic co-segmentation in videos. In ECCV.
Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2017). Weakly supervised actor-action segmentation via robust multi-task ranking. In CVPR.
Yang, L., Wang, Y., Xiong, X., Yang, J., & Katsaggelos, A. K. (2018). Efficient video object segmentation via network modulation. In CVPR.
Zhang, D., Yang, L., Meng, D., Xu, D., & Han, J. (2017). SPFTN: A self-paced fine-tuning network for segmenting objects in weakly labelled videos. In CVPR.
Zhang, J., Sclaroff, S., Lin, Z., Shen, X., Price, B., & Mech, R. (2015a). Minimum barrier salient object detection at 80 fps. In ICCV.
Zhang, Y., Chen, X., Li, J., Wang, C., & Xia, C. (2015b). Semantic object segmentation via detection in weakly labeled video. In CVPR.
Zhong, G., Tsai, Y. H., & Yang, M. H. (2016). Weakly-supervised video scene co-parsing. In ACCV.
Zhu, F., Jiang, Z., & Shao, L. (2014). Submodular object recognition. In CVPR.

Download references

Acknowledgements

Funding was provided by Ministry of Science and Technology (Grant Nos. MOST 107-2628-E-001-005-MY3 and MOST 108-2634-F-007-009).

Author information

Authors and Affiliations

University of California, Merced, Merced, CA, USA
Yi-Wen Chen & Ming-Hsuan Yang
Academia Sinica, Taipei, Taiwan
Yi-Wen Chen & Yen-Yu Lin
NEC Laboratories America, San Jose, CA, USA
Yi-Hsuan Tsai
National Chiao Tung University, Hsinchu, Taiwan
Yen-Yu Lin
Google, Mountain View, CA, USA
Ming-Hsuan Yang

Authors

Yi-Wen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yi-Hsuan Tsai
View author publications
You can also search for this author in PubMed Google Scholar
Yen-Yu Lin
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Hsuan Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yen-Yu Lin.

Additional information

Communicated by C.V. Jawahar, Hongdong Li, Greg Mori, Konrad Schindler.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, YW., Tsai, YH., Lin, YY. et al. VOSTR: Video Object Segmentation via Transferable Representations. Int J Comput Vis 128, 931–949 (2020). https://doi.org/10.1007/s11263-019-01224-x

Download citation

Received: 05 April 2019
Accepted: 28 August 2019
Published: 03 February 2020
Issue Date: April 2020
DOI: https://doi.org/10.1007/s11263-019-01224-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

VOSTR: Video Object Segmentation via Transferable Representations

Abstract

Access this article

Similar content being viewed by others

Unseen Object Segmentation in Videos via Transferable Representations

Design Pseudo Ground Truth with Motion Cue for Unsupervised Video Object Segmentation

Fast target-aware learning for few-shot video object segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

VOSTR: Video Object Segmentation via Transferable Representations

Abstract

Access this article

Similar content being viewed by others

Unseen Object Segmentation in Videos via Transferable Representations

Design Pseudo Ground Truth with Motion Cue for Unsupervised Video Object Segmentation

Fast target-aware learning for few-shot video object segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation