Abstract
In this paper, we address the problem of action recognition from still images and videos. Traditional local features such as SIFT and STIP invariably pose two potential problems: 1) they are not evenly distributed in different entities of a given category and 2) many of such features are not exclusive of the visual concept the entities represent. In order to generate a dictionary taking the aforementioned issues into account, we propose a novel discriminative method for identifying robust and category specific local features which maximize the class separability to a greater extent. Specifically, we pose the selection of potent local descriptors as filtering-based feature selection problem, which ranks the local features per category based on a novel measure of distinctiveness. The underlying visual entities are subsequently represented based on the learned dictionary, and this stage is followed by action classification using the random forest model followed by label propagation refinement. The framework is validated on the action recognition datasets based on still images (Stanford-40) as well as videos (UCF-50). We get 51.2% and 66.7% recognition accuracy for Standford-40 and UCF-50, respectively. Compared to other representative methods from the literature, our approach exhibits superior performances. This proves the effectiveness of adaptive ranking methodology presented in this work.
Similar content being viewed by others
References
Alexe B, Deselaers T, Ferrari V. Measuring the objectness of image windows. IEEE Trans Pattern Anal Mach Intell. 2012;34(11):2189–202.
Comaniciu D, Meer P. Mean shift: A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell. 2002;24(5):603–19.
Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y. Locality-constrained linear coding for image classification. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. pp. 3360–3367.
Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification. In European conference on computer vision. Springer, 2010. pp. 143–156.
Cheng G, Wan Y, Saudagar AN, Namuduri K, Buckles BP. Advances in human action recognition: a survey. arXiv preprint 2015. arXiv:1501.05964
Yang W, Wang Y, Mori G. Recognizing human actions from still images with latent poses. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010. pp. 2030–2037.
Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis. 2004;60(2):91–110.
Chatfield K, Lempitsky VS, Vedaldi A, Zisserman A. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC. 2011. vol. 2, p. 8.
Felzenszwalb PF, Girshick RB, McAllester D, Ramanan D. Object detection with discriminatively trained part-based models. IEEE Trans Pattern Anal Mach Intell. 2010;32(9):1627–45.
Wang Y, Tran D, Liao Z, Forsyth D. Discriminative hierarchical part-based models for human parsing and action recognition. J Mach Learn Res. 2012;13:3075–102.
Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei-Fei L. Human action recognition by learning bases of action attributes and parts. In 2011 International Conference on Computer Vision. IEEE, 2011. pp. 1331–1338.
Sharma G, Jurie F, Schmid C. Expanded parts model for human attribute and action recognition in still images. In Proc IEEE Conf Comput Vis Pattern Recognit. 2013. pp. 652–659.
Zhang L, Li C, Peng P, Xiang X, Song J. Towards optimal vlad for human action recognition from still images. Image and Vision Computing. 2016;.
Juneja M, Vedaldi A, Jawahar C, Zisserman A. Blocks that shout: Distinctive parts for scene classification. In Proc IEEE Conf Comput Vis Pattern Recognit. 2013. pp. 923–930.
Sicre R, Jurie F. Discriminative part model for visual recognition. Comput Vis Image Underst. 2015;141:28–37.
Zhou Y, Ni B, Hong R, Wang M, Tian Q. Interaction part mining: A mid-level approach for fine-grained action recognition. In Proc IEEE Conf Comput Vis Pattern Recognit. 2015. pp. 3323–3331.
Laptev I. On space-time interest points. Int J Comput Vis. 2005;64(2–3):107–23.
Chakraborty B, Holte MB, Moeslund TB, Gonzalez J, Roca FX. A selective spatio-temporal interest point detector for human action recognition in complex scenes. In Computer Vision (iccv), 2011 Ieee International Conference on. IEEE, 2011. pp. 1776–1783.
Wang H, Schmid C. Action recognition with improved trajectories. In Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013. pp. 3551–3558.
Xiong W, Lee JC-M. Efficient scene change detection and camera motion annotation for video classification. Comput Vis Image Underst. 1998;71(2):166–81.
Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos. In Adv Neural Inf Proces Syst. 2014. pp. 568–576.
Shukla P, Biswas KK, Kalra PK. Action recognition using temporal bag-of-words from depth maps. In MVA. 2013. pp. 41–44.
Bettadapura V, Schindler G, Plötz T, Essa I. Augmenting bag-of-words: Data-driven discovery of temporal and structural information for activity recognition. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 2619–2626. IEEE, 2013.
Qiu Q, Jiang Z, Chellappa R. Sparse dictionary-based representation and recognition of action attributes. In Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011. pp. 707–714.
Liu W, Wang Z, Tao D, Yu J. Hessian regularized sparse coding for human action recognition. In International Conference on Multimedia Modeling. Springer, 2015. pp. 502–511.
Lu C, Shi J, Jia J. Abnormal event detection at 150 fps in matlab. In Computer Vision (ICCV), 2013 IEEE International Conference on. IEEE, 2013. pp. 2720–2727.
Fanello SR, Gori I, Metta G, Odone F. Keep it simple and sparse: Real-time action recognition. J Mach Learn Res. 2013;14(1):2617–40.
Xu K, Jiang X, Sun T. Two-stream dictionary learning architecture for action recognition. IEEE Trans Circuits Syst Video Technol. 2017;27(3):567–76.
Chatfield K, Simonyan K, Vedaldi A, Zisserman A. Return of the devil in the details: Delving deep into convolutional nets. arXiv preprint 2014. arXiv:1405.3531
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In Adv Neural Inf Proces Syst. 2012. pp. 1097–1105.
Roy A, Banerjee B, Murino V. Discriminative dictionary design for action classification in still images. In International Conference on Image Analysis and Processing. Springer, 2017. pp. 160–170.
Pavan M, Pelillo M. Dominant sets and pairwise clustering. IEEE Trans Pattern Anal Mach Intell. 2007;29(1):167–72.
Bishop CM. Pattern recognition. Mach Learn. 2006;128:
Zhu X, Ghahramani Z. Learning from labeled and unlabeled data with label propagation. 2002.
Reddy KK, Shah M. Recognizing 50 human action categories of web videos. Mach Vis Appl. 2013;24(5):971–81.
Li LJ, Su H, Fei-Fei L, Xing EP. Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Adv Neural Inf Proces Syst. 2010. pp. 1378–1386.
Jiang Z, Lin Z, Davis LS. Learning a discriminative dictionary for sparse coding via label consistent k-svd. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011. pp. 1697–1704.
Lazebnik S, Schmid C, Ponce J. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). IEEE, 2006. vol. 2, pp. 2169–2178.
Oliva A, Torralba A. Building the gist of a scene: The role of global image features in recognition. Prog Brain Res. 2006;155:23–36.
Sadanand S, Corso JJ. Action bank: A high-level representation of activity in video. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012. pp. 1234–1241.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of Interest
The authors declare that they have no conflict of interest.
Ethical Approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Rights and permissions
About this article
Cite this article
Roy, A., Banerjee, B., Hussain, A. et al. Discriminative Dictionary Design for Action Classification in Still Images and Videos. Cogn Comput 13, 698–708 (2021). https://doi.org/10.1007/s12559-021-09851-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12559-021-09851-8