I-ME: iterative model evolution for learning from weakly labeled images and videos

Yalcinkaya, Ozge; Golge, Eren; Duygulu, Pinar

doi:10.1007/s00138-020-01079-0

I-ME: iterative model evolution for learning from weakly labeled images and videos

Original Paper
Published: 12 June 2020

Volume 31, article number 40, (2020)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Ozge Yalcinkaya¹,
Eren Golge² &
Pinar Duygulu¹

344 Accesses
1 Citation
4 Altmetric
Explore all metrics

Abstract

A significant bottleneck in building large-scale systems for image and video categorization is the requirement of labeled data. Manual labeling effort could be overcome by using the massive amount of web data. However, this type of data is collected through searching on the category names and is likely to inherit noise. In this study, (1) the primary objective is to improve utilizing weakly labeled data without any manual intervention. To this end, (2) we introduce a simple but effective method called “Iterative Model Evolution (I-ME)” where the goal is to discover representative instances by eliminating the irrelevant items so that the purified set can be directly used in training a model. In I-ME, (3) the elimination is done by leveraging the scores of two logistic regressors where the models are learned through iterations. We first apply our method for (4) recognizing complex human activities in images and videos and then a large-scale noisy web dataset, Clothing1M. (5) Our results are comparable to or better than the presented baselines on benchmark video datasets UCF-101, ActivityNet, FCVID and image dataset Action40. Through purifying with I-ME, we come up with only 40% of the noisy Clothing1M and we train the DNN with less but more representative training data without changing the network structure. (6) The success of I-ME over utilizing deep features supports that there is still room for improvement in exploiting large-scale weakly labeled data through mining to discover a smaller but more distinctive subset without increasing the complexity of the process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Omni-Sourced Webly-Supervised Learning for Video Recognition

Exploiting Privileged Information from Web Data for Action and Event Recognition

Article 13 November 2015

Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation

References

ActivityNet: Activitynet challenge. http://activity-net.org/challenges/2016/guidelines.html (2016)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101—mining discriminative components with random forests. In: Proceedings of the IEEE European Conference on Computer Vision, pp. 446–461. ECCV (2014)
Chen, X., Gupta, A.: Webly supervised learning of convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1431–1439. ICCV (2015)
Chen, X., Shrivastava, A., Gupta, A.: Neil: Extracting visual knowledge from web data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1409–1416. ICCV (2013)
Chu, W.S., Zhou, F., De la Torre, F.: Unsupervised temporal commonality discovery. In: Proceedings of the IEEE European Conference on Computer Vision, pp. 373–387. ECCV (2012)
Croitoru, I., Bogolin, S.V., Leordeanu, M.: Unsupervised learning of foreground object segmentation. Int. J. Comput. Vis. 127, 1279–1302 (2019)
Article Google Scholar
Debnath, S., Banerjee, A., Namboodiri, V.P.: Adapting RANSAC SVM to detect outliers for robust classification. In: Proceedings of the British Machine Vision Conference, pp. 168–1. BMVC (2015)
Doersch, C., Singh, S., Gupta, A., Sivic, J., Efros, A.: What makes Paris look like Paris? ACM Trans. Graph. 31(4), 101 (2012)
Fabian., C.H., Victor Escorcia, B.G., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970. CVPR (2015)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Golge, E., Duygulu, P.: Conceptmap: mining noisy web data for concept learning. In: Proceedings of the European Conference on Computer Vision, pp. 439–455. ECCV (2014)
Golge, E., Duygulu-Sahin, P.: Fame: Face association through model evolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 43–49. CVPR (2015)
Haller, E., Leordeanu, M.: Unsupervised object segmentation in video by efficient selection of highly probable positive features. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5085–5093 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. CVPR (2016)
Jain, A., Gupta, A., Rodriguez, M., Davis, L.S.: Representing videos using mid-level discriminative patches. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2571–2578. CVPR (2013)
Jiang, L., Meng, D., Yu, S.I., Lan, Z., Shan, S., Hauptmann, A.: Self-paced learning with diversity. In: Advances in Neural Information Processing Systems, pp. 2078–2086. NIPS (2014)
Jiang, L., Meng, D., Zhao, Q., Shan, S., Hauptmann, A.G.: Self-paced curriculum learning. In: Association for the Advancement of Artificial Intelligence, vol. 2, p. 6. AAAI (2015)
Jiang, Y.G., Wu, Z., Wang, J., Xue, X., Chang, S.F.: Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(2), 352–364 (2018)
Article Google Scholar
Juneja, M., Vedaldi, A., Jawahar, C., Zisserman, A.: Blocks that shout: distinctive parts for scene classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 923–930. CVPR (2013)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105. NIPS (2012)
Lan, T., Zhu, Y., Roshan Zamir, A., Savarese, S.: Action recognition by hierarchical mid-level action elements. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4552–4560. ICCV (2015)
Leordeanu, M., Radu, A., Baluja, S., Sukthankar, R.: Labeling the features not the samples: Efficient video classification with minimal supervision. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)
Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, L.J.: Learning from noisy labels with distillation. In: Proceedings of the IEEE International Conference on Computer Vision. ICCV (2017)
Liang, J., Jiang, L., Meng, D., Hauptmann, A.: Learning to detect concepts from webly-labeled video data. In: Joint Conference on Artificial Intelligence. IJCAI (2016)
Liang, X., Liu, S., Wei, Y., Liu, L., Lin, L., Yan, S.: Towards computational baby learning: a weakly-supervised approach for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 999–1007. ICCV (2015)
Liu, Y., Wen, K., Gao, Q., Gao, X., Nie, F.: SVM based multi-label learning with missing labels for image annotation. Pattern Recognit. 78, 307–317 (2018)
Article Google Scholar
Maxwell, A.E.: Comparing the classification of subjects by two independent judges. Br. J. Psychiatry 116, 651–655 (1970)
Article Google Scholar
Mettes, P., Koelma, D.C., Snoek, C.G.: The imagenet shuffle: reorganized pre-training for video event detection. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 175–182. ACM (2016)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119. NIPS (2013)
Misra, I., Shrivastava, A., Hebert, M.: Data-driven exemplar model selection. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 339–346. WACV (2014)
Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-weakly-supervised learning with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 685–694. CVPR (2015)
Patrini, G., Rozza, A., Menon, A.K., Nock, R., Qu, L.: Making deep neural networks robust to label noise: a loss correction approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2233–2241. CVPR (2017)
Pinto, N., Stone, Z., Zickler, T., Cox, D.: Scaling up biologically-inspired computer vision: a case study in unconstrained face recognition on facebook. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 35–42. CVPR (2011)
Qin, Z., Zhang, Z., Li, Y., Guo, J.: Making deep neural networks robust to label noise: cross-training with a novel loss function. IEEE Access 7, 130893–130902 (2019)
Article Google Scholar
Rim, D., Hasan, M.K., Puech, F., Pal, C.J.: Learning from weakly labeled faces and video in the wild. Pattern Recognit. 48(3), 759–771 (2015)
Article Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sapienza, M., Cuzzolin, F., Torr, P.H.: Learning discriminative space-time actions from weakly labelled videos. In: Proceedings of the British Machine Vision Conference, vol. 2, p. 3. BMVC (2012)
Shevade, S.K., Keerthi, S.S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)
Article Google Scholar
Singh, S., Gupta, A., Efros, A.: Unsupervised discovery of mid-level discriminative patches. In: Proceedings of the IEEE European Conference on Computer Vision, pp. 73–86. ECCV (2012)
Siva, P., Xiang, T.: Weakly supervised action detection. In: Proceedings of the British Machine Vision Conference, vol. 2, p. 6. BMVC (2011)
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint arXiv:1212.0402
Stuart, A.: A test for homogeneity of the marginal distributions in a two-way classification. Biometrika 42, 412–416 (1955)
Article MathSciNet Google Scholar
Su, Q.: Two-stage transfer network for weakly supervised action localization. Neurocomputing 339, 202–209 (2019)
Article Google Scholar
Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., Fergus, R.: Training convolutional networks with noisy labels. ICLR (2015)
Sultani, W., Shah, M.: Automatic action annotation in weakly labeled videos. Comput. Vis. Image Underst. 161, 77–86 (2017)
Article Google Scholar
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. CVPR (2015)
Tanaka, D., Ikami, D., Yamasaki, T., Aizawa, K.: Joint optimization framework for learning with noisy labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2018)
Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative segment annotation in weakly labeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2483–2490. CVPR (2013)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. ICCV (2015)
Wang, L., Qiao, Y., Tang, X.: Mining motion atoms and phrases for complex action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2680–2687. ICCV (2013)
Wang, Y., Liu, W., Ma, X., Bailey, J., Zha, H., Song, L., Xia, S.T.: Iterative learning with open-set noisy labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2018)
Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. CVPR (2015)
Xiong, Y., Wang, L., Wang, Z., Zhang, B., Song, H., Li, W., Lin, D., Qiao, Y., Van Gool, L., Tang, X.: CUHK & ETHZ & SIAT submission to activitynet challenge 2016 (2016). arXiv preprint arXiv:1608.00797
Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1798–1807. CVPR (2015)
Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1331–1338. ICCV (2011)
Yeung, S., Ramanathan, V., Russakovsky, O., Shen, L., Mori, G., Fei-Fei, L.: Learning to learn from noisy web videos. CVPR (2017)
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. CVPR (2016)
Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q.: Exemplar based deep discriminative and shareable feature learning for scene image classification. Pattern Recognit. 48(10), 3004–3015 (2015)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by TUBITAK, under project 116E685 and Science Academy’s Young Scientist Award (BAGEP) given in 2015.

Author information

Authors and Affiliations

Department of Computer Engineering, Hacettepe University, Ankara, Turkey
Ozge Yalcinkaya & Pinar Duygulu
Mozilla, Berlin, Germany
Eren Golge

Authors

Ozge Yalcinkaya
View author publications
You can also search for this author in PubMed Google Scholar
Eren Golge
View author publications
You can also search for this author in PubMed Google Scholar
Pinar Duygulu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ozge Yalcinkaya.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yalcinkaya, O., Golge, E. & Duygulu, P. I-ME: iterative model evolution for learning from weakly labeled images and videos. Machine Vision and Applications 31, 40 (2020). https://doi.org/10.1007/s00138-020-01079-0

Download citation

Received: 27 May 2019
Revised: 29 November 2019
Accepted: 07 April 2020
Published: 12 June 2020
DOI: https://doi.org/10.1007/s00138-020-01079-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

I-ME: iterative model evolution for learning from weakly labeled images and videos

Abstract

Access this article

Similar content being viewed by others

Omni-Sourced Webly-Supervised Learning for Video Recognition

Exploiting Privileged Information from Web Data for Action and Event Recognition

Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

I-ME: iterative model evolution for learning from weakly labeled images and videos

Abstract

Access this article

Similar content being viewed by others

Omni-Sourced Webly-Supervised Learning for Video Recognition

Exploiting Privileged Information from Web Data for Action and Event Recognition

Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation