Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection

Ashrafi, Seyed Sajad; Shokouhi, Shahriar B.; Ayatollahi, Ahmad

doi:10.1007/s11042-021-11215-1

Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection

Published: 31 July 2021

Volume 80, pages 32567–32593, (2021)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Seyed Sajad Ashrafi¹,
Shahriar B. Shokouhi ORCID: orcid.org/0000-0001-6266-6607¹ &
Ahmad Ayatollahi¹

519 Accesses
6 Citations
1 Altmetric
Explore all metrics

Abstract

Action recognition in still images is an interesting subject in computer vision. One of the most important problems in still image-based action recognition is the lack of temporal information; At the same time, other existing problems such as cluttered backgrounds and diverse objects make the recognition task more challenging. However, there may be several salient regions in each action image, employing of which could lead to an improvement in the recognition performance. Moreover, since no unique and clear definition exists for detecting these salient regions in action recognition images, therefore, obtaining reliable ground truth salient regions is a highly challenging task. This paper presents a multi-attention guided network with weakly-supervised multiple salient regions detection for action recognition. A teacher-student structure is used to guide the attention of the student model into the salient regions. The teacher network with Salient Region Proposal (SRP) module generates weakly-supervised data for the student network in the training phase. The student network, with Multi-ATtention (MAT) module, proposes multiple salient regions and predicts the actions based on the found information in the evaluation phase. The proposed method obtains mean Average Precision (mAP) value of 94.2% and 93.80% on Stanford-40 Actions and PASCAL VOC2012 datasets, respectively. The experimental results, based on the ResNet-50 architecture, show the superiority of the proposed method compared to the existing ones on Stanford-40 and VOC2012 datasets. Also, we have made a major modification to the BU101 dataset which is now publicly available. The proposed method achieves mAP value of 90.16% on the new BU101 dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning Using Privileged Information for Zero-Shot Action Recognition

Spatiotemporal Saliency Based Multi-stream Networks for Action Recognition

Attention-Based Top-Down Single-Task Action Recognition in Still Images

Notes

https://github.com/seyedsajadashrafi/bu101plus-action-recognition-dataset

References

Aly S, Sayed A (2019) Human action recognition using bag of global and local Zernike moment features. Multimed Tools Appl 78(17):24923–24953. https://doi.org/10.1007/s11042-019-7674-5
Article Google Scholar
Amirkhani D, Bastanfard A (2019) Inpainted image quality evaluation based on saliency map features. https://doi.org/10.1109/ICSPIS48872.2019.9066140
Beddiar DR, Nini B, Sabokrou M, Hadid A (2020) Vision-based human activity recognition: a survey. Multimed Tools Appl 79:1–47. https://doi.org/10.1007/s11042-020-09004-3
Article Google Scholar
Bulbul MF, Islam S, Ali H (2019) 3D human action analysis and recognition through GLAC descriptor on 2D motion and static posture images. Multimed Tools Appl 78(15):21085–21111. https://doi.org/10.1007/s11042-019-7365-2
Article Google Scholar
Cao Z, Hidalgo G, Simon T, Wei S-E, Sheikh Y (2018) OpenPose: Realtime multi-person 2D pose estimation using part affinity field. Accessed: Jun. 18, 2020. [online]. Available: http://arxiv.org/abs/1812.08008
Chen C, Jafari R, Kehtarnavaz N (2017) A survey of depth and inertial sensor fusion for human action recognition. Multimed Tools Appl 76(3):4405–4425. https://doi.org/10.1007/s11042-015-3177-1
Article Google Scholar
Delaitre V, Laptev I, Sivic J (2010) Recognizing human actions in still images: a study of bag-of-features and part-based representations. In: Procedings of the British machine vision conference 2010, pp 97.1–97.11. https://doi.org/10.5244/C.24.97
Chapter Google Scholar
Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4
Article Google Scholar
Girshick R (2015) Fast R-CNN. Proc IEEE Int Conf Comput Vis 2015:1440–1448. https://doi.org/10.1109/ICCV.2015.169
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 580–587. https://doi.org/10.1109/CVPR.2014.81
Chapter Google Scholar
Gkioxari G, Girshick R, Malik J (2015) Contextual action recognition with R∗CNN, Proc IEEE Int Conf Comput Vis vol. 2015 inter, pp. 1080–1088 https://doi.org/10.1109/ICCV.2015.129
Guo G, Lai A (2014) A survey on still image based human action recognition. Pattern Recogn 47(10):3343–3361. https://doi.org/10.1016/j.patcog.2014.04.018
Article Google Scholar
Gupta A, Kembhavi A, Davis LS (2009) Observing human-object interactions: using spatial and functional compatibility for recognition. IEEE Trans Pattern Anal Mach Intell 31(10):1775–1789. https://doi.org/10.1109/TPAMI.2009.83
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 2016-December, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Chapter Google Scholar
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21. https://doi.org/10.1016/j.imavis.2017.01.010
Article Google Scholar
Hu T, Qi H, Huang Q, Lu Y (2019) See better before looking closer: weakly supervised data augmentation network for fine-grained visual classification. Accessed: Apr. 13, 2020. [online]. Available: http://arxiv.org/abs/1901.09891
Ikizler N, Cinbis RG, Pehlivan S, Duygulu P (2008) Recognizing actions from still images. https://doi.org/10.1109/icpr.2008.4761663
Li LJ, Fei-Fei L (2007) What, where and who? Classifying events by scene and object recognition. https://doi.org/10.1109/ICCV.2007.4408872
Li Z, Zheng Z, Lin F, Leung H, Li Q (2019) Action recognition from depth sequence using depth motion maps-based local ternary patterns and CNN. Multimed Tools Appl 78(14):19587–19601. https://doi.org/10.1007/s11042-019-7356-3
Article Google Scholar
Li Y, Li K, Wang X (2020) Recognizing actions in images by fusing multiple body structure cues. Pattern Recogn 104:107341. https://doi.org/10.1016/j.patcog.2020.107341
Article Google Scholar
Liao X, Li K, Zhu X, Liu KJR (2020) Robust detection of image operator chain with two-stream convolutional neural network. IEEE J Sel Top Signal Process 14(5):955–968. https://doi.org/10.1109/JSTSP.2020.3002391
Article Google Scholar
Liu L, Tan RT, You S (2019) Loss guided activation for action recognition in still images. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 11365 LNCS, pp 152–167. https://doi.org/10.1007/978-3-030-20873-8_10
Chapter Google Scholar
Lu X, Wang W, Ma C, Shen J, Shao L, Porikli F (2020) See more, know more: unsupervised video object segmentation with co-attention Siamese networks,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2019-June, pp. 3618–3627. Accessed: Dec. 22, 2020. [Online]. Available: http://arxiv.org/abs/2001.06810
Ludl D, Gulde T, Curio C (2019) Simple yet efficient real-time pose-based action recognition. In: 2019 IEEE intelligent transportation systems conference, ITSC 2019, pp 581–588. https://doi.org/10.1109/ITSC.2019.8917128
Chapter Google Scholar
Ma S, Bargal SA, Zhang J, Sigal L, Sclaroff S (2017) Do less and achieve more: training CNNs for action recognition utilizing action images from the web. Pattern Recogn 68:334–345. https://doi.org/10.1016/j.patcog.2017.01.027
Article Google Scholar
McAuley J, Leskovec J (2012) Image labeling on a network: using social-network metadata for image classification. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 7575 LNCS, no. PART 4, pp 828–841. https://doi.org/10.1007/978-3-642-33765-9_59
Chapter Google Scholar
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol. 9912 LNCS, pp 483–499. https://doi.org/10.1007/978-3-319-46484-8_29
Chapter Google Scholar
Popoola OP, Wang K (2012) Video-based abnormal human behavior recognitiona review. IEEE Trans Syst Man Cybern Part C Appl Rev 42(6):865–878. https://doi.org/10.1109/TSMCC.2011.2178594
Article Google Scholar
PyTorch. (2016) https://pytorch.org/ (accessed September 1, 2016).
Qi T, Xu Y, Quan Y, Wang Y, Ling H (2017) Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267:475–488. https://doi.org/10.1016/j.neucom.2017.06.041
Article Google Scholar
Raja K, Laptev I, Pérez P, Oisel L (2011) Joint pose estimation and action recognition in image graphs. In: Proceedings - international conference on image processing, ICIP, pp 25–28. https://doi.org/10.1109/ICIP.2011.6116197
Chapter Google Scholar
Redmon J, Divvala S, Girshick R, Farhadi A (2015) You only look once: unified, real-time object detection. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit vol. 2016-December, pp. 779–788. Accessed: Apr. 12, 2020. [Online]. Available: http://arxiv.org/abs/1506.02640
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Article Google Scholar
Sadeghi H, Raie AA (2019) Histogram distance metric learning for facial expression recognition. J Vis Commun Image Represent 62:152–165. https://doi.org/10.1016/j.jvcir.2019.05.004
Article Google Scholar
Szegedy C et al (2015) Going deeper with convolutions. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 07–12-June-2015, pp 1–9. https://doi.org/10.1109/CVPR.2015.7298594
Chapter Google Scholar
Thurau C, Hlaváč V (2008) Pose primitive based human action recognition in videos or still images. https://doi.org/10.1109/CVPR.2008.4587721
Tian D, Lu ZM, Chen X, Ma LH (2020) An attentional spatial temporal graph convolutional network with co-occurrence feature learning for action recognition. Multimed Tools Appl 79(17–18):12679–12697. https://doi.org/10.1007/s11042-020-08611-4
Article Google Scholar
Wang Y, Jiang H, Drew MS, Li ZN, Mori G (2006) Unsupervised discovery of action classes. Proc IEEE Comput Soc Confer Comput Vis Pattern Recog 2:1654–1661. https://doi.org/10.1109/CVPR.2006.321
Article Google Scholar
Xin M, Wang S, Cheng J (2019) Entanglement loss for context-based still image action recognition. In: Proceedings - IEEE international conference on multimedia and expo, vol. 2019-July, pp 1042–1047. https://doi.org/10.1109/ICME.2019.00183
Chapter Google Scholar
Yan S, Smith JS, Zhang B (2017) Action recognition from still images based on deep VLAD spatial pyramids. Signal Process Image Commun 54:118–129. https://doi.org/10.1016/j.image.2017.03.010
Article Google Scholar
Yan S, Smith JS, Lu W, Zhang B (2018) Multibranch attention networks for action recognition in still images. IEEE Trans Cogn Dev Syst 10(4):1116–1125. https://doi.org/10.1109/TCDS.2017.2783944
Article Google Scholar
Yang W, Huang H, Zhang Z, Chen X, Huang K, Zhang S (2019) Towards rich feature discovery with class activation maps augmentation for person re-identification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol. 2019-June, pp 1389–1398. https://doi.org/10.1109/CVPR.2019.00148
Chapter Google Scholar
Yao B, Fei-Fei L (2010) Modeling mutual context of object and human pose in human-object interaction activities. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 17–24. https://doi.org/10.1109/CVPR.2010.5540235
Chapter Google Scholar
Yao B, Jiang X, Khosla A, Lin AL, Guibas L, Fei-Fei L (2011) Human action recognition by learning bases of action attributes and parts. In: Proceedings of the IEEE international conference on computer vision, pp 1331–1338. https://doi.org/10.1109/ICCV.2011.6126386
Chapter Google Scholar
Yao H, Zhang S, Hong R, Zhang Y, Xu C, Tian Q (2019) Deep representation learning with part loss for person re-identification. IEEE Trans Image Process 28(6):2860–2871. https://doi.org/10.1109/TIP.2019.2891888
Article MathSciNet MATH Google Scholar
Zagoruyko S, Komodakis N (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer, 5th Int. Conf. Learn. Represent. ICLR 2017 - Conf. Track proc. Accessed: Jun. 19, 2020. [Online]. Available: http://arxiv.org/abs/1612.03928
Zhao Z, Ma H, You S (2017) Single image action recognition using semantic body part actions. In: Proceedings of the IEEE international conference on computer vision, vol. 2017-October, pp 3411–3419. https://doi.org/10.1109/ICCV.2017.367
Chapter Google Scholar
Zheng H, Fu J, Mei T, Luo J (2017) Learning multi-attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE international conference on computer vision, vol. 2017-October, pp 5219–5227. https://doi.org/10.1109/ICCV.2017.557
Chapter Google Scholar
Zhou W, Li H, Tian Q (2020) Recent advance in content-based image retrieval: a literature survey, Jun. 2017. Accessed: Jun. 20, 2020. [Online]. Available: http://arxiv.org/abs/1706.06064

Download references

Funding

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Author information

Authors and Affiliations

Electrical Engineering Department, Iran University of Science and Technology (IUST), Tehran, Iran
Seyed Sajad Ashrafi, Shahriar B. Shokouhi & Ahmad Ayatollahi

Authors

Seyed Sajad Ashrafi
View author publications
You can also search for this author in PubMed Google Scholar
Shahriar B. Shokouhi
View author publications
You can also search for this author in PubMed Google Scholar
Ahmad Ayatollahi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shahriar B. Shokouhi.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ashrafi, S.S., Shokouhi, S.B. & Ayatollahi, A. Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection. Multimed Tools Appl 80, 32567–32593 (2021). https://doi.org/10.1007/s11042-021-11215-1

Download citation

Received: 24 August 2020
Revised: 21 January 2021
Accepted: 05 July 2021
Published: 31 July 2021
Issue Date: September 2021
DOI: https://doi.org/10.1007/s11042-021-11215-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection

Abstract

Access this article

Similar content being viewed by others

Learning Using Privileged Information for Zero-Shot Action Recognition

Spatiotemporal Saliency Based Multi-stream Networks for Action Recognition

Attention-Based Top-Down Single-Task Action Recognition in Still Images

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Action recognition in still images using a multi-attention guided network with weakly supervised saliency detection

Abstract

Access this article

Similar content being viewed by others

Learning Using Privileged Information for Zero-Shot Action Recognition

Spatiotemporal Saliency Based Multi-stream Networks for Action Recognition

Attention-Based Top-Down Single-Task Action Recognition in Still Images

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation