Skip to main content
Log in

I-ME: iterative model evolution for learning from weakly labeled images and videos

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

A significant bottleneck in building large-scale systems for image and video categorization is the requirement of labeled data. Manual labeling effort could be overcome by using the massive amount of web data. However, this type of data is collected through searching on the category names and is likely to inherit noise. In this study, (1) the primary objective is to improve utilizing weakly labeled data without any manual intervention. To this end, (2) we introduce a simple but effective method called “Iterative Model Evolution (I-ME)” where the goal is to discover representative instances by eliminating the irrelevant items so that the purified set can be directly used in training a model. In I-ME, (3) the elimination is done by leveraging the scores of two logistic regressors where the models are learned through iterations. We first apply our method for (4) recognizing complex human activities in images and videos and then a large-scale noisy web dataset, Clothing1M. (5) Our results are comparable to or better than the presented baselines on benchmark video datasets UCF-101, ActivityNet, FCVID and image dataset Action40. Through purifying with I-ME, we come up with only 40% of the noisy Clothing1M and we train the DNN with less but more representative training data without changing the network structure. (6) The success of I-ME over utilizing deep features supports that there is still room for improvement in exploiting large-scale weakly labeled data through mining to discover a smaller but more distinctive subset without increasing the complexity of the process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

References

  1. ActivityNet: Activitynet challenge. http://activity-net.org/challenges/2016/guidelines.html (2016)

  2. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101—mining discriminative components with random forests. In: Proceedings of the IEEE European Conference on Computer Vision, pp. 446–461. ECCV (2014)

  3. Chen, X., Gupta, A.: Webly supervised learning of convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1431–1439. ICCV (2015)

  4. Chen, X., Shrivastava, A., Gupta, A.: Neil: Extracting visual knowledge from web data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1409–1416. ICCV (2013)

  5. Chu, W.S., Zhou, F., De la Torre, F.: Unsupervised temporal commonality discovery. In: Proceedings of the IEEE European Conference on Computer Vision, pp. 373–387. ECCV (2012)

  6. Croitoru, I., Bogolin, S.V., Leordeanu, M.: Unsupervised learning of foreground object segmentation. Int. J. Comput. Vis. 127, 1279–1302 (2019)

    Article  Google Scholar 

  7. Debnath, S., Banerjee, A., Namboodiri, V.P.: Adapting RANSAC SVM to detect outliers for robust classification. In: Proceedings of the British Machine Vision Conference, pp. 168–1. BMVC (2015)

  8. Doersch, C., Singh, S., Gupta, A., Sivic, J., Efros, A.: What makes Paris look like Paris? ACM Trans. Graph. 31(4), 101 (2012)

  9. Fabian., C.H., Victor Escorcia, B.G., Niebles, J.C.: Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970. CVPR (2015)

  10. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  11. Golge, E., Duygulu, P.: Conceptmap: mining noisy web data for concept learning. In: Proceedings of the European Conference on Computer Vision, pp. 439–455. ECCV (2014)

  12. Golge, E., Duygulu-Sahin, P.: Fame: Face association through model evolution. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 43–49. CVPR (2015)

  13. Haller, E., Leordeanu, M.: Unsupervised object segmentation in video by efficient selection of highly probable positive features. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5085–5093 (2017)

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. CVPR (2016)

  15. Jain, A., Gupta, A., Rodriguez, M., Davis, L.S.: Representing videos using mid-level discriminative patches. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2571–2578. CVPR (2013)

  16. Jiang, L., Meng, D., Yu, S.I., Lan, Z., Shan, S., Hauptmann, A.: Self-paced learning with diversity. In: Advances in Neural Information Processing Systems, pp. 2078–2086. NIPS (2014)

  17. Jiang, L., Meng, D., Zhao, Q., Shan, S., Hauptmann, A.G.: Self-paced curriculum learning. In: Association for the Advancement of Artificial Intelligence, vol. 2, p. 6. AAAI (2015)

  18. Jiang, Y.G., Wu, Z., Wang, J., Xue, X., Chang, S.F.: Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40(2), 352–364 (2018)

    Article  Google Scholar 

  19. Juneja, M., Vedaldi, A., Jawahar, C., Zisserman, A.: Blocks that shout: distinctive parts for scene classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 923–930. CVPR (2013)

  20. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105. NIPS (2012)

  21. Lan, T., Zhu, Y., Roshan Zamir, A., Savarese, S.: Action recognition by hierarchical mid-level action elements. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4552–4560. ICCV (2015)

  22. Leordeanu, M., Radu, A., Baluja, S., Sukthankar, R.: Labeling the features not the samples: Efficient video classification with minimal supervision. In: Thirtieth AAAI Conference on Artificial Intelligence (2016)

  23. Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, L.J.: Learning from noisy labels with distillation. In: Proceedings of the IEEE International Conference on Computer Vision. ICCV (2017)

  24. Liang, J., Jiang, L., Meng, D., Hauptmann, A.: Learning to detect concepts from webly-labeled video data. In: Joint Conference on Artificial Intelligence. IJCAI (2016)

  25. Liang, X., Liu, S., Wei, Y., Liu, L., Lin, L., Yan, S.: Towards computational baby learning: a weakly-supervised approach for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 999–1007. ICCV (2015)

  26. Liu, Y., Wen, K., Gao, Q., Gao, X., Nie, F.: SVM based multi-label learning with missing labels for image annotation. Pattern Recognit. 78, 307–317 (2018)

    Article  Google Scholar 

  27. Maxwell, A.E.: Comparing the classification of subjects by two independent judges. Br. J. Psychiatry 116, 651–655 (1970)

    Article  Google Scholar 

  28. Mettes, P., Koelma, D.C., Snoek, C.G.: The imagenet shuffle: reorganized pre-training for video event detection. In: Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval, pp. 175–182. ACM (2016)

  29. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119. NIPS (2013)

  30. Misra, I., Shrivastava, A., Hebert, M.: Data-driven exemplar model selection. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 339–346. WACV (2014)

  31. Oquab, M., Bottou, L., Laptev, I., Sivic, J.: Is object localization for free?-weakly-supervised learning with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 685–694. CVPR (2015)

  32. Patrini, G., Rozza, A., Menon, A.K., Nock, R., Qu, L.: Making deep neural networks robust to label noise: a loss correction approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2233–2241. CVPR (2017)

  33. Pinto, N., Stone, Z., Zickler, T., Cox, D.: Scaling up biologically-inspired computer vision: a case study in unconstrained face recognition on facebook. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 35–42. CVPR (2011)

  34. Qin, Z., Zhang, Z., Li, Y., Guo, J.: Making deep neural networks robust to label noise: cross-training with a novel loss function. IEEE Access 7, 130893–130902 (2019)

    Article  Google Scholar 

  35. Rim, D., Hasan, M.K., Puech, F., Pal, C.J.: Learning from weakly labeled faces and video in the wild. Pattern Recognit. 48(3), 759–771 (2015)

    Article  Google Scholar 

  36. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  37. Sapienza, M., Cuzzolin, F., Torr, P.H.: Learning discriminative space-time actions from weakly labelled videos. In: Proceedings of the British Machine Vision Conference, vol. 2, p. 3. BMVC (2012)

  38. Shevade, S.K., Keerthi, S.S.: A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics 19(17), 2246–2253 (2003)

    Article  Google Scholar 

  39. Singh, S., Gupta, A., Efros, A.: Unsupervised discovery of mid-level discriminative patches. In: Proceedings of the IEEE European Conference on Computer Vision, pp. 73–86. ECCV (2012)

  40. Siva, P., Xiang, T.: Weakly supervised action detection. In: Proceedings of the British Machine Vision Conference, vol. 2, p. 6. BMVC (2011)

  41. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv preprint arXiv:1212.0402

  42. Stuart, A.: A test for homogeneity of the marginal distributions in a two-way classification. Biometrika 42, 412–416 (1955)

    Article  MathSciNet  Google Scholar 

  43. Su, Q.: Two-stage transfer network for weakly supervised action localization. Neurocomputing 339, 202–209 (2019)

    Article  Google Scholar 

  44. Sukhbaatar, S., Bruna, J., Paluri, M., Bourdev, L., Fergus, R.: Training convolutional networks with noisy labels. ICLR (2015)

  45. Sultani, W., Shah, M.: Automatic action annotation in weakly labeled videos. Comput. Vis. Image Underst. 161, 77–86 (2017)

    Article  Google Scholar 

  46. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9. CVPR (2015)

  47. Tanaka, D., Ikami, D., Yamasaki, T., Aizawa, K.: Joint optimization framework for learning with noisy labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2018)

  48. Tang, K., Sukthankar, R., Yagnik, J., Fei-Fei, L.: Discriminative segment annotation in weakly labeled video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2483–2490. CVPR (2013)

  49. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4489–4497. ICCV (2015)

  50. Wang, L., Qiao, Y., Tang, X.: Mining motion atoms and phrases for complex action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2680–2687. ICCV (2013)

  51. Wang, Y., Liu, W., Ma, X., Bailey, J., Zha, H., Song, L., Xia, S.T.: Iterative learning with open-set noisy labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. CVPR (2018)

  52. Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2691–2699. CVPR (2015)

  53. Xiong, Y., Wang, L., Wang, Z., Zhang, B., Song, H., Li, W., Lin, D., Qiao, Y., Van Gool, L., Tang, X.: CUHK & ETHZ & SIAT submission to activitynet challenge 2016 (2016). arXiv preprint arXiv:1608.00797

  54. Xu, Z., Yang, Y., Hauptmann, A.G.: A discriminative CNN video representation for event detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1798–1807. CVPR (2015)

  55. Yao, B., Jiang, X., Khosla, A., Lin, A.L., Guibas, L., Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1331–1338. ICCV (2011)

  56. Yeung, S., Ramanathan, V., Russakovsky, O., Shen, L., Mori, G., Fei-Fei, L.: Learning to learn from noisy web videos. CVPR (2017)

  57. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2921–2929. CVPR (2016)

  58. Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q.: Exemplar based deep discriminative and shareable feature learning for scene image classification. Pattern Recognit. 48(10), 3004–3015 (2015)

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported in part by TUBITAK, under project 116E685 and Science Academy’s Young Scientist Award (BAGEP) given in 2015.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ozge Yalcinkaya.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yalcinkaya, O., Golge, E. & Duygulu, P. I-ME: iterative model evolution for learning from weakly labeled images and videos. Machine Vision and Applications 31, 40 (2020). https://doi.org/10.1007/s00138-020-01079-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-020-01079-0

Keywords

Navigation