Skip to main content
Log in

VOSTR: Video Object Segmentation via Transferable Representations

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In order to learn video object segmentation models, conventional methods require a large amount of pixel-wise ground truth annotations. However, collecting such supervised data is time-consuming and labor-intensive. In this paper, we exploit existing annotations in source images and transfer such visual information to segment videos with unseen object categories. Without using any annotations in the target video, we propose a method to jointly mine useful segments and learn feature representations that better adapt to the target frames. The entire process is decomposed into three tasks: (1) refining the responses with fully-connected CRFs, (2) solving a submodular function for selecting object-like segments, and (3) learning a CNN model with a transferable module for adapting seen categories in the source domain to the unseen target video. We present an iterative update scheme between three tasks to self-learn the final solution for object segmentation. Experimental results on numerous benchmark datasets demonstrate that the proposed method performs favorably against the state-of-the-art algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Caelles, S., Maninis, K. K., Pont-Tuset, J., Leal-Taixé, L., Cremers, D., & Gool, L. V. (2017). One-shot video object segmentation. In CVPR.

  • Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2016). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915.

  • Chen, Y., Pont-Tuset, J., Montes, A., & Gool, L. V. (2018a). Blazingly fast video object segmentation with pixel-wise metric learning. In CVPR.

  • Chen, Y. W., Tsai, Y. H., Yang, C. Y., Lin, Y. Y., & Yang, M. H. (2018b). Unseen object segmentation in videos via transferable representations. In ACCV.

  • Cheng, J., Tsai, Y. H., Wang, S., & Yang, M. H. (2017). Segflow: Joint learning for video object segmentation and optical flow. In ICCV.

  • Cheng, J., Tsai, Y. H., Hung, W. C., Wang, S., & Yang, M. H. (2018). Fast and accurate online video object segmentation via tracking parts. In CVPR.

  • Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. IJCV, 88(2), 303–338.

    Article  Google Scholar 

  • Faktor, A., & Irani, M. (2014). Video segmentation by non-local consensus voting. In BMVC.

  • Ganin, Y., & Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In ICML.

  • Gopalan, R., Li, R., & Chellappa, R. (2011). Domain adaptation for object recognition: An unsupervised approach. In ICCV.

  • Grundmann, M., Kwatra, V., Han, M., & Essa, I. (2010). Efficient hierarchical graph-based video segmentation. In CVPR.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Hoffman, J., Guadarrama, S., Tzeng, E. S., Hu, R., Donahue, J., Girshick, R., Darrell, T., & Saenko, K. (2014). LSDA: Large scale detection through adaptation. In NIPS.

  • Hong, S., Oh, J., Lee, H., Han, B. (2016). Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In CVPR.

  • Hu, R., Dollár, P., He, K., Darrell, T., & Girshick, R. (2018). Learning to segment every thing. In CVPR.

  • Jain, S., Xiong, B., & Grauman, K. (2017). Fusionseg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos. In CVPR.

  • Jain, S. D., & Grauman, K. (2014). Supervoxel-consistent foreground propagation in video. In ECCV.

  • Khoreva, A., Perazzi, F., Benenson, R., Schiele, B., & Sorkine-Hornung, A. (2017). Learning video object segmentation from static images. In CVPR.

  • Koh, Y. J., & Kim, C. S. (2017). Primary object segmentation in videos based on region augmentation and reduction. In CVPR.

  • Krähenbühl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. In NIPS.

  • Lazic, N., Givoni, I., Frey, B., & Aarabi, P. (2009). Floss: Facility location for subspace segmentation. In ICCV.

  • Lee, Y. J., Kim, J., & Grauman, K. (2011). Key-segments for video object segmentation. In ICCV.

  • Li, F., Kim, T., Humayun, A., Tsai, D., & Rehg, J. M. (2013). Video segmentation by tracking many figure-ground segments. In ICCV.

  • Li, S., Seybold, B., Vorobyov, A., Lei, X., & Kuo, C. C. J. (2018). Unsupervised video object segmentation with motion-based bilateral networks. In ECCV.

  • Lim, J. J., Salakhutdinov, R., & Torralba A. (2011). Transfer learning by borrowing examples for multiclass object detection. In NIPS.

  • Liu, C. (2009). Beyond pixels: Exploring new representations and applications for motion analysis. PhD thesis, MIT.

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In CVPR.

  • Luo, Z., Zou, Y., Hoffman, J., & Fei-Fei, L. (2017). Label efficient learning of transferable representations across domains and tasks. In NIPS.

  • Märki, N., Perazzi, F., & Wang, O., & Sorkine-Hornung, A. (2016). Bilateral space video segmentation. In CVPR.

  • Ochs, P., & Brox, T. (2011). Object segmentation in video: A hierarchical variational approach for turning point trajectories into dense regions. In ICCV.

  • Oh, S. W., Lee, J. Y., Sunkavalli, K., & Kim, S. J. (2018). Fast video object segmentation by reference-guided mask propagation. In CVPR.

  • Papazoglou, A., & Ferrari, V. (2013). Fast object segmentation in unconstrained video. In ICCV.

  • Patricia, N., & Caputo, B. (2014). Learning to learn, from transfer learning to domain adaptation: A unifying perspective. In CVPR.

  • Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP (pp. 1532–1543).

  • Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool L. V., Gross M., & Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation methodology for video object segmentation. In CVPR.

  • Perazzi, F., Wang, O., Gross, M., & Sorkine-Hornung, A. (2015). Fully connected object proposals for video segmentation. In CVPR.

  • Prest, A., Leistner, C., Civera, J., Schmid, C., & Ferrari, V. (2012). Learning object class detectors from weakly annotated video. In CVPR.

  • Rochan, M., & Wang, Y. (2015). Weakly supervised localization of novel objects using appearance transfer. In CVPR.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Saenko, K., Kulis, B., Fritz, M., & Darrell, T. (2010). Adapting visual category models to new domains. In ECCV.

  • Saleh, F. S., Aliakbarian, M. S., Salzmann, M., Petersson, L., & Alvarez, J. M. (2017). Bringing background into the foreground: Making all classes equal in weakly-supervised video semantic segmentation. In ICCV.

  • Shi, Z., Yang, Y., Hospedales, T. M., & Xiang, T. (2017). Weakly-supervised image annotation and segmentation with objects and attributes. PAMI, 39(12), 2525–2538.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. CoRR. abs/1409.1556:1187–1200.

  • Strand, R., Ciesielski, K. C., Malmberg, F., & Saha, P. K. (2013). The minimum barrier distance. CVIU, 117(4), 429–437.

    MATH  Google Scholar 

  • Tang, K., Sukthankar, R., Yagnik, J., & Fei-Fei, L. (2013). Discriminative segment annotation in weakly labeled video. In CVPR.

  • Taylor, B., Karasev, V., & Soatto, S. (2015). Causal video object segmentation from persistence of occlusions. In CVPR.

  • Tokmakov, P., Alahari, K., & Schmid, C. (2017a). Learning motion patterns in videos. In CVPR.

  • Tokmakov, P., Alahari, K., & Schmid, C. (2017b). Learning video object segmentation with visual memory. In ICCV.

  • Tommasi, T., Orabona, F., & Caputo, B. (2014). Learning categories from few examples with multi model knowledge transfer. PAMI, 36, 928–941.

    Article  Google Scholar 

  • Tsai, Y. H., Hung, W. C., Schulter, S., Sohn, K., Yang, M. H., & Chandraker, M. (2018). Learning to adapt structured output space for semantic segmentation. In CVPR.

  • Tsai, Y. H., Yang, M. H., & Black, M. J. (2016a). Video segmentation via object flow. In CVPR.

  • Tsai, Y. H., Zhong, G., & Yang, M. H. (2016b). Semantic co-segmentation in videos. In ECCV.

  • Yan, Y., Xu, C., Cai, D., & Corso, J. J. (2017). Weakly supervised actor-action segmentation via robust multi-task ranking. In CVPR.

  • Yang, L., Wang, Y., Xiong, X., Yang, J., & Katsaggelos, A. K. (2018). Efficient video object segmentation via network modulation. In CVPR.

  • Zhang, D., Yang, L., Meng, D., Xu, D., & Han, J. (2017). SPFTN: A self-paced fine-tuning network for segmenting objects in weakly labelled videos. In CVPR.

  • Zhang, J., Sclaroff, S., Lin, Z., Shen, X., Price, B., & Mech, R. (2015a). Minimum barrier salient object detection at 80 fps. In ICCV.

  • Zhang, Y., Chen, X., Li, J., Wang, C., & Xia, C. (2015b). Semantic object segmentation via detection in weakly labeled video. In CVPR.

  • Zhong, G., Tsai, Y. H., & Yang, M. H. (2016). Weakly-supervised video scene co-parsing. In ACCV.

  • Zhu, F., Jiang, Z., & Shao, L. (2014). Submodular object recognition. In CVPR.

Download references

Acknowledgements

Funding was provided by Ministry of Science and Technology (Grant Nos. MOST 107-2628-E-001-005-MY3 and MOST 108-2634-F-007-009).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yen-Yu Lin.

Additional information

Communicated by C.V. Jawahar, Hongdong Li, Greg Mori, Konrad Schindler.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, YW., Tsai, YH., Lin, YY. et al. VOSTR: Video Object Segmentation via Transferable Representations. Int J Comput Vis 128, 931–949 (2020). https://doi.org/10.1007/s11263-019-01224-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01224-x

Keywords

Navigation