Skip to main content
Log in

Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Action recognition in still images is an interesting subject in computer vision. One of the most important problems in still image-based action recognition is the lack of temporal information; At the same time, other existing problems such as cluttered backgrounds and diverse objects make the recognition task more challenging. However, there may be several salient regions in each action image, employing of which could lead to an improvement in the recognition performance. Moreover, since no unique and clear definition exists for detecting these salient regions in action recognition images, therefore, obtaining reliable ground truth salient regions is a highly challenging task. This paper presents a multi-attention guided network with weakly-supervised multiple salient regions detection for action recognition. A teacher-student structure is used to guide the attention of the student model into the salient regions. The teacher network with Salient Region Proposal (SRP) module generates weakly-supervised data for the student network in the training phase. The student network, with Multi-ATtention (MAT) module, proposes multiple salient regions and predicts the actions based on the found information in the evaluation phase. The proposed method obtains mean Average Precision (mAP) value of 94.2% and 93.80% on Stanford-40 Actions and PASCAL VOC2012 datasets, respectively. The experimental results, based on the ResNet-50 architecture, show the superiority of the proposed method compared to the existing ones on Stanford-40 and VOC2012 datasets. Also, we have made a major modification to the BU101 dataset which is now publicly available. The proposed method achieves mAP value of 90.16% on the new BU101 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. https://github.com/seyedsajadashrafi/bu101plus-action-recognition-dataset

References

  1. Aly S, Sayed A (2019) Human action recognition using bag of global and local Zernike moment features. Multimed Tools Appl 78(17):24923–24953. https://doi.org/10.1007/s11042-019-7674-5

    Article  Google Scholar 

  2. Amirkhani D, Bastanfard A (2019) Inpainted image quality evaluation based on saliency map features. https://doi.org/10.1109/ICSPIS48872.2019.9066140

  3. Beddiar DR, Nini B, Sabokrou M, Hadid A (2020) Vision-based human activity recognition: a survey. Multimed Tools Appl 79:1–47. https://doi.org/10.1007/s11042-020-09004-3

    Article  Google Scholar 

  4. Bulbul MF, Islam S, Ali H (2019) 3D human action analysis and recognition through GLAC descriptor on 2D motion and static posture images. Multimed Tools Appl 78(15):21085–21111. https://doi.org/10.1007/s11042-019-7365-2

    Article  Google Scholar 

  5. Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2018) OpenPose: Realtime multi-person 2D pose estimation using part affinity field. Accessed: Jun. 18, 2020. [online]. Available: http://arxiv.org/abs/1812.08008

  6. Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425. https://doi.org/10.1007/s11042-015-3177-1

    Article  Google Scholar 

  7. Delaitre V, Laptev I, Sivic J (2010) Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: Procedings of the British machine vision conference 2010, pp 97.1–97.11. https://doi.org/10.5244/C.24.97

    Chapter  Google Scholar 

  8. Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4

    Article  Google Scholar 

  9. Girshick R (2015) Fast R-CNN. Proc IEEE Int Conf Comput Vis 2015:1440–1448. https://doi.org/10.1109/ICCV.2015.169

  10. Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 580–587. https://doi.org/10.1109/CVPR.2014.81

    Chapter  Google Scholar 

  11. Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with R∗CNN, Proc IEEE Int Conf Comput Vis vol. 2015 inter, pp. 1080–1088 https://doi.org/10.1109/ICCV.2015.129

  12. Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recogn 47(10):3343–3361. https://doi.org/10.1016/j.patcog.2014.04.018

    Article  Google Scholar 

  13. Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789. https://doi.org/10.1109/TPAMI.2009.83

    Article  Google Scholar 

  14. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 2016-December, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

    Chapter  Google Scholar 

  15. Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21. https://doi.org/10.1016/j.imavis.2017.01.010

    Article  Google Scholar 

  16. Hu T, Qi H, Huang Q, Lu Y (2019) See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. Accessed: Apr. 13, 2020. [online]. Available: http://arxiv.org/abs/1901.09891

  17. Ikizler N, Cinbis RG, Pehlivan S, Duygulu P (2008) Recognizing actions from still images. https://doi.org/10.1109/icpr.2008.4761663

  18. Li LJ, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. https://doi.org/10.1109/ICCV.2007.4408872

  19. Li Z, Zheng Z, Lin F, Leung H, Li Q (2019) Action recognition from depth sequence using depth motion maps-based local ternary patterns and CNN. Multimed Tools Appl 78(14):19587–19601. https://doi.org/10.1007/s11042-019-7356-3

    Article  Google Scholar 

  20. Li Y, Li K, Wang X (2020) Recognizing actions in images by fusing multiple body structure cues. Pattern Recogn 104:107341. https://doi.org/10.1016/j.patcog.2020.107341

    Article  Google Scholar 

  21. Liao X, Li K, Zhu X, Liu KJR (2020) Robust detection of image operator chain with two-stream convolutional neural network. IEEE J Sel Top Signal Process 14(5):955–968. https://doi.org/10.1109/JSTSP.2020.3002391

    Article  Google Scholar 

  22. Liu L, Tan RT, You S (2019) Loss guided activation for action recognition in still images. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 11365 LNCS, pp 152–167. https://doi.org/10.1007/978-3-030-20873-8_10

    Chapter  Google Scholar 

  23. Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2020) See more, know more: unsupervised video object segmentation with co-attention Siamese networks,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 3618–3627. Accessed: Dec. 22, 2020. [Online]. Available: http://arxiv.org/abs/2001.06810

  24. Ludl D, Gulde T, Curio C (2019) Simple yet efficient real-time pose-based action recognition. In: 2019 IEEE intelligent transportation systems conference, ITSC 2019, pp 581–588. https://doi.org/10.1109/ITSC.2019.8917128

    Chapter  Google Scholar 

  25. Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: training CNNs for action recognition utilizing action images from the web. Pattern Recogn 68:334–345. https://doi.org/10.1016/j.patcog.2017.01.027

    Article  Google Scholar 

  26. McAuley J, Leskovec J (2012) Image labeling on a network: using social-network metadata for image classification. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 7575 LNCS, no. PART 4, pp 828–841. https://doi.org/10.1007/978-3-642-33765-9_59

    Chapter  Google Scholar 

  27. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 9912 LNCS, pp 483–499. https://doi.org/10.1007/978-3-319-46484-8_29

    Chapter  Google Scholar 

  28. Popoola OP, Wang K (2012) Video-based abnormal human behavior recognitiona review. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):865–878. https://doi.org/10.1109/TSMCC.2011.2178594

    Article  Google Scholar 

  29. PyTorch. (2016) https://pytorch.org/ (accessed September 1, 2016).

  30. Qi T, Xu Y, Quan Y, Wang Y, Ling H (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488. https://doi.org/10.1016/j.neucom.2017.06.041

    Article  Google Scholar 

  31. Raja K, Laptev I, Pérez P, Oisel L (2011) Joint pose estimation and action recognition in image graphs. In: Proceedings - international conference on image processing, ICIP, pp 25–28. https://doi.org/10.1109/ICIP.2011.6116197

    Chapter  Google Scholar 

  32. Redmon J, Divvala S, Girshick R, Farhadi A (2015) You only look once: unified, real-time object detection. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit vol. 2016-December, pp. 779–788. Accessed: Apr. 12, 2020. [Online]. Available: http://arxiv.org/abs/1506.02640

  33. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  34. Sadeghi H, Raie AA (2019) Histogram distance metric learning for facial expression recognition. J Vis Commun Image Represent 62:152–165. https://doi.org/10.1016/j.jvcir.2019.05.004

    Article  Google Scholar 

  35. Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 07–12-June-2015, pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594

    Chapter  Google Scholar 

  36. Thurau C, Hlaváč V (2008) Pose primitive based human action recognition in videos or still images. https://doi.org/10.1109/CVPR.2008.4587721

  37. Tian D, Lu ZM, Chen X, Ma LH (2020) An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition. Multimed Tools Appl 79(17–18):12679–12697. https://doi.org/10.1007/s11042-020-08611-4

    Article  Google Scholar 

  38. Wang Y, Jiang H, Drew MS, Li ZN, Mori G (2006) Unsupervised discovery of action classes. Proc IEEE Comput Soc Confer Comput Vis Pattern Recog 2:1654–1661. https://doi.org/10.1109/CVPR.2006.321

    Article  Google Scholar 

  39. Xin M, Wang S, Cheng J (2019) Entanglement loss for context-based still image action recognition. In: Proceedings - IEEE international conference on multimedia and expo, vol. 2019-July, pp 1042–1047. https://doi.org/10.1109/ICME.2019.00183

    Chapter  Google Scholar 

  40. Yan S, Smith JS, Zhang B (2017) Action recognition from still images based on deep VLAD spatial pyramids. Signal Process Image Commun 54:118–129. https://doi.org/10.1016/j.image.2017.03.010

    Article  Google Scholar 

  41. Yan S, Smith JS, Lu W, Zhang B (2018) Multibranch attention networks for action recognition in still images. IEEE Trans Cogn Dev Syst 10(4):1116–1125. https://doi.org/10.1109/TCDS.2017.2783944

    Article  Google Scholar 

  42. Yang W, Huang H, Zhang Z, Chen X, Huang K, Zhang S (2019) Towards rich feature discovery with class activation maps augmentation for person re-identification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 2019-June, pp 1389–1398. https://doi.org/10.1109/CVPR.2019.00148

    Chapter  Google Scholar 

  43. Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 17–24. https://doi.org/10.1109/CVPR.2010.5540235

    Chapter  Google Scholar 

  44. Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei-Fei L (2011) Human action recognition by learning bases of action attributes and parts. In: Proceedings of the IEEE international conference on computer vision, pp 1331–1338. https://doi.org/10.1109/ICCV.2011.6126386

    Chapter  Google Scholar 

  45. Yao H, Zhang S, Hong R, Zhang Y, Xu C, Tian Q (2019) Deep representation learning with part loss for person re-identification. IEEE Trans Image Process 28(6):2860–2871. https://doi.org/10.1109/TIP.2019.2891888

    Article  MathSciNet  MATH  Google Scholar 

  46. Zagoruyko S, Komodakis N (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer, 5th Int. Conf. Learn. Represent. ICLR 2017 - Conf. Track proc. Accessed: Jun. 19, 2020. [Online]. Available: http://arxiv.org/abs/1612.03928

  47. Zhao Z, Ma H, You S (2017) Single image action recognition using semantic body part actions. In: Proceedings of the IEEE international conference on computer vision, vol. 2017-October, pp 3411–3419. https://doi.org/10.1109/ICCV.2017.367

    Chapter  Google Scholar 

  48. Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE international conference on computer vision, vol. 2017-October, pp 5219–5227. https://doi.org/10.1109/ICCV.2017.557

    Chapter  Google Scholar 

  49. Zhou W, Li H, Tian Q (2020) Recent advance in content-based image retrieval: a literature survey, Jun. 2017. Accessed: Jun. 20, 2020. [Online]. Available: http://arxiv.org/abs/1706.06064

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shahriar B. Shokouhi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ashrafi, S.S., Shokouhi, S.B. & Ayatollahi, A. Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection. Multimed Tools Appl 80, 32567–32593 (2021). https://doi.org/10.1007/s11042-021-11215-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-021-11215-1

Keywords

Navigation