Skip to main content
Log in

Discriminative Dictionary Design for Action Classification in Still Images and Videos

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

In this paper, we address the problem of action recognition from still images and videos. Traditional local features such as SIFT and STIP invariably pose two potential problems: 1) they are not evenly distributed in different entities of a given category and 2) many of such features are not exclusive of the visual concept the entities represent. In order to generate a dictionary taking the aforementioned issues into account, we propose a novel discriminative method for identifying robust and category specific local features which maximize the class separability to a greater extent. Specifically, we pose the selection of potent local descriptors as filtering-based feature selection problem, which ranks the local features per category based on a novel measure of distinctiveness. The underlying visual entities are subsequently represented based on the learned dictionary, and this stage is followed by action classification using the random forest model followed by label propagation refinement. The framework is validated on the action recognition datasets based on still images (Stanford-40) as well as videos (UCF-50). We get 51.2% and 66.7% recognition accuracy for Standford-40 and UCF-50, respectively. Compared to other representative methods from the literature, our approach exhibits superior performances. This proves the effectiveness of adaptive ranking methodology presented in this work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

References

  1. Alexe B, Deselaers T, Ferrari V. Measuring the objectness of image windows. IEEE Trans Pattern Anal Mach Intell. 2012;34(11):2189–202.

    Article  Google Scholar 

  2. Comaniciu D, Meer P. Mean shift: A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell. 2002;24(5):603–19.

    Article  Google Scholar 

  3. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y. Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. pp. 3360–3367.

  4. Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification. In European conference on computer vision. Springer, 2010. pp. 143–156.

  5. Cheng G, Wan Y, Saudagar AN, Namuduri K, Buckles BP. Advances in human action recognition: a survey. arXiv preprint 2015. arXiv:1501.05964

  6. Yang W, Wang Y, Mori G. Recognizing human actions from still images with latent poses. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. pp. 2030–2037.

  7. Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis. 2004;60(2):91–110.

    Article  Google Scholar 

  8. Chatfield K, Lempitsky VS, Vedaldi A, Zisserman A. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC. 2011. vol. 2, p. 8.

  9. Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell. 2010;32(9):1627–45.

    Article  Google Scholar 

  10. Wang Y, Tran D, Liao Z, Forsyth D. Discriminative hierarchical part-based models for human parsing and action recognition. J Mach Learn Res. 2012;13:3075–102.

    MathSciNet  MATH  Google Scholar 

  11. Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei-Fei L. Human action recognition by learning bases of action attributes and parts. In 2011 International Conference on Computer Vision. IEEE, 2011. pp. 1331–1338.

  12. Sharma G, Jurie F, Schmid C. Expanded parts model for human attribute and action recognition in still images. In Proc IEEE Conf Comput Vis Pattern Recognit. 2013. pp. 652–659.

  13. Zhang L, Li C, Peng P, Xiang X, Song J. Towards optimal vlad for human action recognition from still images. Image and Vision Computing. 2016;.

  14. Juneja M, Vedaldi A, Jawahar C, Zisserman A. Blocks that shout: Distinctive parts for scene classification. In Proc IEEE Conf Comput Vis Pattern Recognit. 2013. pp. 923–930.

  15. Sicre R, Jurie F. Discriminative part model for visual recognition. Comput Vis Image Underst. 2015;141:28–37.

    Article  Google Scholar 

  16. Zhou Y, Ni B, Hong R, Wang M, Tian Q. Interaction part mining: A mid-level approach for fine-grained action recognition. In Proc IEEE Conf Comput Vis Pattern Recognit. 2015. pp. 3323–3331.

  17. Laptev I. On space-time interest points. Int J Comput Vis. 2005;64(2–3):107–23.

    Article  Google Scholar 

  18. Chakraborty B, Holte MB, Moeslund TB, Gonzalez J, Roca FX. A selective spatio-temporal interest point detector for human action recognition in complex scenes. In Computer Vision (iccv), 2011 Ieee International Conference on. IEEE, 2011. pp. 1776–1783.

  19. Wang H, Schmid C. Action recognition with improved trajectories. In Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013. pp. 3551–3558.

  20. Xiong W, Lee JC-M. Efficient scene change detection and camera motion annotation for video classification. Comput Vis Image Underst. 1998;71(2):166–81.

    Article  Google Scholar 

  21. Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In Adv Neural Inf Proces Syst. 2014. pp. 568–576.

  22. Shukla P, Biswas KK, Kalra PK. Action recognition using temporal bag-of-words from depth maps. In MVA. 2013. pp. 41–44.

  23. Bettadapura V, Schindler G, Plötz T, Essa I. Augmenting bag-of-words: Data-driven discovery of temporal and structural information for activity recognition. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2619–2626. IEEE, 2013.

  24. Qiu Q, Jiang Z, Chellappa R. Sparse dictionary-based representation and recognition of action attributes. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011. pp. 707–714.

  25. Liu W, Wang Z, Tao D, Yu J. Hessian regularized sparse coding for human action recognition. In International Conference on Multimedia Modeling. Springer, 2015. pp. 502–511.

  26. Lu C, Shi J, Jia J. Abnormal event detection at 150 fps in matlab. In Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013. pp. 2720–2727.

  27. Fanello SR, Gori I, Metta G, Odone F. Keep it simple and sparse: Real-time action recognition. J Mach Learn Res. 2013;14(1):2617–40.

    Google Scholar 

  28. Xu K, Jiang X, Sun T. Two-stream dictionary learning architecture for action recognition. IEEE Trans Circuits Syst Video Technol. 2017;27(3):567–76.

    Article  Google Scholar 

  29. Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint 2014. arXiv:1405.3531

  30. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Adv Neural Inf Proces Syst. 2012. pp. 1097–1105.

  31. Roy A, Banerjee B, Murino V. Discriminative dictionary design for action classification in still images. In International Conference on Image Analysis and Processing. Springer, 2017. pp. 160–170.

  32. Pavan M, Pelillo M. Dominant sets and pairwise clustering. IEEE Trans Pattern Anal Mach Intell. 2007;29(1):167–72.

    Article  Google Scholar 

  33. Bishop CM. Pattern recognition. Mach Learn. 2006;128:

  34. Zhu X, Ghahramani Z. Learning from labeled and unlabeled data with label propagation. 2002.

  35. Reddy KK, Shah M. Recognizing 50 human action categories of web videos. Mach Vis Appl. 2013;24(5):971–81.

    Article  Google Scholar 

  36. Li LJ, Su H, Fei-Fei L, Xing EP. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Adv Neural Inf Proces Syst. 2010. pp. 1378–1386.

  37. Jiang Z, Lin Z, Davis LS. Learning a discriminative dictionary for sparse coding via label consistent k-svd. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011. pp. 1697–1704.

  38. Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE, 2006. vol. 2, pp. 2169–2178.

  39. Oliva A, Torralba A. Building the gist of a scene: The role of global image features in recognition. Prog Brain Res. 2006;155:23–36.

    Article  Google Scholar 

  40. Sadanand S, Corso JJ. Action bank: A high-level representation of activity in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012. pp. 1234–1241.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Soujanya Poria.

Ethics declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Roy, A., Banerjee, B., Hussain, A. et al. Discriminative Dictionary Design for Action Classification in Still Images and Videos. Cogn Comput 13, 698–708 (2021). https://doi.org/10.1007/s12559-021-09851-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-021-09851-8

Keywords

Navigation