Skip to main content
Log in

Human action interpretation using convolutional neural network: a survey

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Human action interpretation (HAI) is one of the trending domains in the era of computer vision. It can further be divided into human action recognition (HAR) and human action detection (HAD). The HAR analyzes frames and provides label(s) to overall video, whereas the HAD localizes actor first, in each frame, and then estimates the action score for the detected region. The effectiveness of a HAI model is highly dependent on the representation of spatiotemporal features and the model’s architectural design. For the effective representation of these features, various studies have been carried out. Moreover, to better learn these features and to get the action score on the basis of these features, different designs of deep architectures have also been proposed. Among various deep architectures, convolutional neural network (CNN) is relatively more explored for HAI due to its lesser computational cost. To provide overview of these efforts, various surveys have been published to date; however, none of these surveys is focusing the features’ representation and design of proposed architectures in detail. Secondly, none of these studies is focusing the pose assisted HAI techniques. This study provides a more detailed survey on existing CNN-based HAI techniques by incorporating the frame level as well as pose level spatiotemporal features-based techniques. Besides these, it offers comparative study on different publicly available datasets used to evaluate HAI models based on various spatiotemporal features’ representations. Furthermore, it also discusses the limitations and challenges of the HAI and concludes that human action interpretation from visual data is still very far from the actual interpretation of human action in realistic videos which are continuous in nature and may contain multiple human beings performing multiple actions sequentially or in parallel.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Ahad, M.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Motion history image: its variants and applications. Mach. Vis. Appl. 23(2), 255–281 (2012)

    Article  Google Scholar 

  2. Asghari-Esfeden, S., Sznaier, M., Camps, O.: Dynamic motion representation for human action recognition. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 557–566 (2020)

  3. Baştan, M., Cam, H., Güdükbay, U., Ulusoy, O.: Bilvideo-7: an mpeg-7-compatible video indexing and retrieval system. IEEE Multimed. 17(3), 62–73 (2010)

    Article  Google Scholar 

  4. Beddiar, D.R., Nini, B., Sabokrou, M., Hadid, A.: Vision-based human activity recognition: a survey. Multimed. Tools Appl. 79(41), 30509–30555 (2020)

    Article  Google Scholar 

  5. Bouguet, J.-Y.: Pyramidal implementation of the affine Lucas Kanade feature tracker description of the algorithm. Intel Corporation 5(1–10), 4 (2001)

    Google Scholar 

  6. Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: Proceedings of the European Conference on Computer Vision, pp. 25–36. Springer (2004)

  7. Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)

  8. Chao, Y.-W., Liu, Y., Liu, X., Zeng, H., Deng, J.: Learning to detect human–object interactions. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 381–389. IEEE (2018)

  9. Chaudhary, S., Dudhane, A., Patil, P., Murala, S.: Pose guided dynamic image network for human action recognition in person centric videos. In: Proceedings of the 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE (2019)

  10. Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 168–172 (2015)

  11. Chéron, G., Laptev, I., Schmid, C.: P-CNN: pose-based CNN features for action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3218–3226 (2015)

  12. Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7024–7033 (2018)

  13. Cornacchia, M., Ozcan, K., Zheng, Y., Velipasalar, S.: A survey on activity detection and classification using wearable sensors. IEEE Sens. J. 17(2), 386–403 (2016)

    Article  Google Scholar 

  14. Creswell, A., White, T., Dumoulin, V., Arulkumaran, K., Sengupta, B., Bharath, A.A.: Generative adversarial networks: an overview. IEEE Signal Process. Mag. 35(1), 53–65 (2018)

    Article  Google Scholar 

  15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893 (2005)

  16. Danelljan, M., Khan, F.S., Felsberg, M., van de Weijer, J.: Adaptive color attributes for real-time visual tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1090–1097 (2014)

  17. Dang, Q., Yin, J., Wang, B., Zheng, W.: Deep learning based 2D human pose estimation: a survey. Tsinghua Sci. Technol. 24(6), 663–676 (2019)

    Article  Google Scholar 

  18. Dedeoğlu, Y., Töreyin, B.U., Güdükbay, U., Çetin, A.E.: Silhouette-based method for object classification and human action recognition in video. In: Huang, T.S., Sebe, N., Lew, M.S., Pavlović, V., Kölsch, M., Galata, A., Kisačanin, B. (eds.) Proceedings of Workshop on Human Computer Interaction (HCI/ECCV 2006), vol. 3979, pp. 64–77. Springer, Berlin (2006)

  19. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

  20. Doulamis, A., Voulodimos, A., Varvarigou, T.: Human face region detection driving activity recognition in video. In: Computer Vision: Concepts, Methodologies, Tools, and Applications, pp. 2102–2123. IGI Global, Hershey, PA, USA (2018)

  21. Fan, J., Shen, X., Wu, Y.: Scribble tracker: a matting-based approach for robust tracking. IEEE Trans. Pattern Anal. Mach. Intell. 34(8), 1633–1644 (2012)

    Article  Google Scholar 

  22. Fischer, P., Dosovitskiy, A., Ilg, E., Häusser, P., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T.: Flownet: learning optical flow with convolutional networks. CoRR arXiv:abs/1504.06852 (2015)

  23. Gao, Z., Cheong, L.-F., Wang, Y.-X.: Block-sparse RPCA for salient motion detection. IEEE Trans. Pattern Anal. Mach. Intell. 36(10), 1975–1987 (2014)

    Article  Google Scholar 

  24. Gavrilyuk, K., Ghodrati, A., Zhenyang, L., Snoek, C.G.M.: Spatio-temporal action and actor localization. US Patent 10,896,342. Google Patents (2021)

  25. Gidaris, S., Komodakis, N.: Object detection via a multi-region & semantic segmentation-aware CNN model. CoRR arXiv:abs/1505.01749 (2015)

  26. Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)

  27. Gkioxari, G., Girshick, R., Dollár, P., He, K.: Detecting and recognizing human–object interactions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8359–8367. IEEE (2018)

  28. Golestani, N., Moghaddam, M.: Human activity recognition using magnetic induction-based motion signals and deep recurrent neural networks. Nat. Commun. 11(1), 1–11 (2020)

    Google Scholar 

  29. Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)

  30. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

  31. He, W., Yamashita, T., Lu, H., Lao, S.: Surf tracking. In: Proceedings of the IEEE 12th International Conference on Computer Vision, pp. 1586–1592. IEEE (2009)

  32. Hou, R., Chen, C., Shah, M.: Tube convolutional neural network (T-CNN) for action detection in videos. CoRR arXiv:abs/1703.10664 (2017)

  33. Hua, A., Quicksall, Z., Di, C., Motl, R., LaCroix, A.Z., Schatz, B., Buchner, D.M.: Accelerometer-based predictive models of fall risk in older women: a pilot study. NPJ Digit. Med. 1(1), 1–8 (2018)

    Article  Google Scholar 

  34. Jegham, I., Khalifa, A.B., Alouani, I., Mahjoub, M.A.: Vision-based human action recognition: an overview and real world challenges. Forensic Sci. Int. Digit. Investig. 32, 200901 (2020)

    Article  Google Scholar 

  35. Jégou, H., Douze, M., Schmid, C., Pérez, P.: Aggregating local descriptors into a compact image representation. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3304–3311 (2010)

  36. Jhuang, H., Gall, J., Zuffi, S., Schmid, C., Black, M.J.: Towards understanding action recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3192–3199 (2013)

  37. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  Google Scholar 

  38. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2012)

    Article  Google Scholar 

  39. Jin, C.-B., Li, S., Do, T.D., Kim, H.: Real-time human action recognition using CNN over temporal images for static video surveillance cameras. In: Ho, Y.-S., Sang, J., Ro, Y.M., Kim, J., Wu, F. (eds.) Advances in Multimedia Information Processing—PCM 2015, pp. 330–339. Springer, Cham (2015)

    Chapter  Google Scholar 

  40. Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2011)

    Article  Google Scholar 

  41. Ke, Q., An, S., Bennamoun, M., Sohel, F., Boussaid, F.: Skeletonnet: mining deep part features for 3-D action recognition. IEEE Signal Process. Lett. 24(6), 731–735 (2017)

    Article  Google Scholar 

  42. Kingma, D.P., Welling, M.: An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691 (2019)

  43. Ko, K.-E., Sim, K.-B.: Deep convolutional framework for abnormal behavior detection in a smart surveillance system. Eng. Appl. Artif. Intell. 67, 226–234 (2018)

    Article  Google Scholar 

  44. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  45. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: A large video database for human motion recognition. In: Proceedings of the International Conference on Computer Vision, pp. 2556–2563 (2011)

  46. Lai, Y.-H., Yang, C.-K.: Video object retrieval by trajectory and appearance. IEEE Trans. Circuits Syst. Video Technol. 25(6), 1026–1037 (2014)

    Google Scholar 

  47. Laptev, I., Caputo, B. et al.: Recognizing human actions: a local SVM approach. In: Proceedings of the 17th International Conference on Pattern Recognition, ICPR 2004, pp. 32–36. IEEE (2004)

  48. Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: Advances in Neural Information Processing Systems, pp. 801–808 (2007)

  49. Li, C., Tong, R., Tang, M.: Modelling human body pose for action recognition using deep neural networks. Arabian J. Sci. Eng. 43(12), 7777–7788 (2018)

    Article  Google Scholar 

  50. Lietz, H., Ritter, M., Manthey, R., Wanielik, G.: Improving pedestrian detection using mpeg-7 descriptors. Adv. Radio Sci. 11(C.4), 101–105 (2013)

    Article  Google Scholar 

  51. Lin, L., Liu, B., Xiao, Y.: An object tracking method based on CNN and optical flow. In: Proceedings of the 13th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD), pp. 24–31 (2017)

  52. Liu, C., Ying, J., Yang, H., Hu, X., Liu, J.: Improved human action recognition approach based on two-stream convolutional neural network model. Visual Comput. 37, 1327–1341 (2021)

    Article  Google Scholar 

  53. Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L.-Y., Kot, A.C.: NTU RGB+D 120: a large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 42(10), 2684–2701 (2020)

    Article  Google Scholar 

  54. Liu, M., Meng, F., Chen, C., Wu, S.: Joint dynamic pose image and space time reversal for human action recognition from videos. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8762–8769 (2019)

  55. Liu, W., Wang, Z., Liu, X., Zeng, N., Liu, Y., Alsaadi, F.E.: A survey of deep neural network architectures and their applications. Neurocomputing 234, 11–26 (2017)

    Article  Google Scholar 

  56. Lu, Y., Wei, Y., Liu, L., Zhong, J., Sun, L., Liu, Y.: Towards unsupervised physical activity recognition using smartphone accelerometers. Multimed. Tools Appl. 76(8), 10701–10719 (2017)

    Article  Google Scholar 

  57. Ludl, D., Gulde, T., Curio, C.: Simple yet efficient real-time pose-based action recognition. In: Proceedings of the IEEE Intelligent Transportation Systems Conference (ITSC), pp. 581–588. IEEE (2019)

  58. Ma, M., Marturi, N., Li, Y., Leonardis, A., Stolkin, R.: Region-sequence based six-stream CNN features for general and fine-grained human action recognition in videos. Pattern Recogn. 76, 506–521 (2018)

    Article  Google Scholar 

  59. Muralikrishna, S., Muniyal, B., Acharya, U.D., Holla, R.: Enhanced human action recognition using fusion of skeletal joint dynamics and structural features. J. Robot. 2020, 3096858 (2020)

    Google Scholar 

  60. Ng, J.Y., Choi, J., Neumann, J., Davis, L.S.: Actionflownet: learning motion representation for action recognition. CoRR arXiv:1612.03052 (2016)

  61. Nweke, H.F., Teh, Y.W., Al-Garadi, M.A., Alo, U.R.: Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: state of the art and research challenges. Expert Syst. Appl. 105, 233–261 (2018)

    Article  Google Scholar 

  62. Pal, M.: Random forest classifier for remote sensing classification. Int. J. Remote Sens. 26(1), 217–222 (2005)

    Article  Google Scholar 

  63. Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision—ECCV 2016, pp. 744–759. Springer, Cham (2016)

    Chapter  Google Scholar 

  64. Pérez, J.S., Meinhardt-Llopis, E., Facciolo, G.: Tv-l1 optical flow estimation. Image Process. On Line 2013, 137–150 (2013)

    Article  Google Scholar 

  65. Perš, J., Sulić, V., Kristan, M., Perše, M., Polanec, K., Kovačič, S.: Histograms of optical flow for efficient representation of body motion. Pattern Recogn. Lett. 31(11), 1369–1376 (2010). https://doi.org/10.1016/j.patrec.2010.03.024

    Article  Google Scholar 

  66. Pham, H.-H., Khoudour, L., Crouzil, A., Zegers, P., Velastin, S.A.: Skeletal movement to color map: a novel representation for 3D action recognition with inception residual networks. In: Proceedings of the 25th IEEE international conference on image processing (ICIP), pp. 3483–3487. IEEE (2018)

  67. Ponti, M.A., Ribeiro, L.S.F., Nazare, T.S., Bui, T., Collomosse, J.; Everything you wanted to know about deep learning for computer vision but were afraid to ask. In: Proceedings of the 30th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials (SIBGRAPI-T), pp. 17–41. IEEE (2017)

  68. Ranasinghe, S., Al Machot, F., Mayr, H.C.: A review on applications of activity recognition systems with regard to performance and evaluation. Int. J. Distrib. Sensor Netw. 12(8), 1–22 (2016)

    Article  Google Scholar 

  69. Redmon, J., Farhadi, A.: Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

  70. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR arXiv:1506.01497 (2015)

  71. Rodriguez, M.D., Ahmed, J., Shah, M.: Action mach a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)

  72. Rohrbach, M., Amin, S., Andriluka, M., Schiele, B.: A database for fine grained activity detection of cooking activities. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1194–1201. IEEE (2012)

  73. Ryoo, M.S., Aggarwal, J.K.: Spatio-temporal relationship match: video structure comparison for recognition of complex human activities. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1593–1600. IEEE (2009)

  74. Sargano, A.B., Angelov, P., Habib, Z.: A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition. Appl. Sci. 7(1), 110 (2017)

    Article  Google Scholar 

  75. Saykol, E., Bastan, M., Güdükbay, U., Ulusoy, Ö.: Keyframe labeling technique for surveillance event classification. Opt. Eng. 49(11), 117203 (2010)

    Article  Google Scholar 

  76. Shahroudy, A., Liu, J., Ng, T.-T., Wang, G.: NTU RGB+D: a large scale dataset for 3D human activity analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1010–1019 (2016)

  77. Shao, L., Ji, L.: Motion histogram analysis based key frame extraction for human action/activity representation. In: Proceedings of the Canadian Conference on Computer and Robot Vision, pp. 88–92. IEEE (2009)

  78. Sigurdsson, G.A., Divvala, S., Farhadi, A., Gupta, A.: Asynchronous temporal fields for action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 585–594 (2017)

  79. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR arXiv:abs/1406.2199 (2014)

  80. Singh, G., Saha, S., Sapienza, M., Torr, P., Cuzzolin, F.: Online real-time multiple spatiotemporal action localisation and prediction. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 3657–3666 (2018)

  81. Soomro, K., Zamir, A.R., Shah, M., Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR arXiv:1212.0402 (2012)

  82. Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 843–852. PMLR (2015)

  83. Sun, L., Jia, K., Yeung, D.-Y., Shi, B.E.: Human action recognition using factorized spatio-temporal convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4597–4605 (2015)

  84. Sun, S.: Multi-view Laplacian support vector machines. In: Proceedings of the International Conference on Advanced Data Mining and Applications, pp. 209–222. Springer (2011)

  85. Tarwani, K.M., Edem, S.: Survey on recurrent neural network in natural language processing. Int. J. Eng. Trends Technol 48, 301–304 (2017)

    Article  Google Scholar 

  86. Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R.C., Li, B., Yuan, J.: Multi-stream CNN: Learning representations based on human-related regions for action recognition. Pattern Recogn. 79, 32–43 (2018)

    Article  Google Scholar 

  87. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)

    Article  Google Scholar 

  88. Voulodimos, A., Doulamis, N., Doulamis, A., Protopapadakis, E.: Deep learning for computer vision: a brief review. Comput. Intell. Neurosci. 2018, 7068349 (2018)

    Article  Google Scholar 

  89. Wang, J., Liu, Z., Wu, Y., Yuan, J.: Mining actionlet ensemble for action recognition with depth cameras. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1290–1297. IEEE (2012)

  90. Wang, J., Cherian, A., Porikli, F.: Ordered pooling of optical flow sequences for action recognition. CoRR arXiv:abs/1701.03246 (2017)

  91. Wang, L., Ge, L., Li, R., Fang, Y.: Three-stream CNNs for action recognition. Pattern Recogn. Lett. 92, 33–40 (2017)

    Article  Google Scholar 

  92. Wang, Y., Song, J., Wang, L., Van Gool, L., Hilliges, O.: Two-stream SR-CNNs for action recognition in videos. In: Richard, E.R.H., Wilson, C., Smith, W.A.P. (eds.) Proceedings of the British machine vision conference (BMVC), p. 12. BMVA Press, Durham, UK (2016)

  93. Warchoł, D., Kapuściński, T.: Human action recognition using bone pair descriptor and distance descriptor. Symmetry 12(10), 1580 (2020)

    Article  Google Scholar 

  94. Widodo, A., Yang, B.-S.: Support vector machine in machine condition monitoring and fault diagnosis. Mech. Syst. Signal Process. 21(6), 2560–2574 (2007)

    Article  Google Scholar 

  95. Wixson, L.: Detecting salient motion by accumulating directionally-consistent flow. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 774–780 (2000)

    Article  Google Scholar 

  96. Wu, Q., Liu, Y., Li, Q., Jin, S., Li, F.: The application of deep learning in computer vision. In: 2017 Chinese Automation Congress (CAC), pp. 6522–6527. IEEE (2017)

  97. Yan, A., Wang, Y., Li, Z., Qiao, Y.: PA3D: Pose-action 3D machine for video recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7922–7931 (2019)

  98. Yao, G., Lei, T., Zhong, J.: A review of convolutional-neural-network-based action recognition. Pattern Recogn. Lett. 118, 14–22 (2019)

    Article  Google Scholar 

  99. Yin, J., Yang, Q., Pan, J.J.: Sensor-based abnormal human-activity detection. IEEE Trans. Knowl. Data Eng. 20(8), 1082–1090 (2008)

    Article  Google Scholar 

  100. Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35 (2012)

  101. Zhang, D., Guo, G., Huang, D., Han, J.: PoseFlow: a deep motion representation for understanding human behaviors in videos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6762–6770 (2018)

  102. Zhang, P., Lan, C., Xing, J., Zeng, W., Xue, J., Zheng, N.: View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 41(8), 1963–1978 (2019)

    Article  Google Scholar 

  103. Zhang, W., Zhu, M., Derpanis, K.G.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2248–2255 (2013)

  104. Zhao, J., Mao, X., Zhang, J.: Learning deep facial expression features from image and optical flow sequences using 3D CNN. Visual Comput. 34(10), 1461–1475 (2018)

    Article  Google Scholar 

  105. Zhao, Z., Elgammal, A.M.: Information theoretic key frame selection for action recognition. In: BMVC, pp. 1–10 (2008)

  106. Zhou, Z.-H., Sun, Y.-Y., Li, Y.-F.: Multi-instance learning by treating instances as non-IID samples. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1249–1256 (2009)

  107. Zhu, J., Zou, W., Xu, L., Hu, Y., Zhu, Z., Chang, M., Huang, J., Huang, G., Du, D.: Action machine: rethinking action recognition in trimmed videos. arXiv preprint arXiv:1812.05770 (2018)

  108. Zhu, J., Zou, W., Zhu, Z., Xu, L., Huang, G.: Action machine: toward person-centric action recognition in videos. IEEE Signal Process. Lett. 26(11), 1633–1637 (2019)

    Article  Google Scholar 

  109. Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2923–2932. IEEE (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zainab Malik.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Malik, Z., Shapiai, M.I.B. Human action interpretation using convolutional neural network: a survey. Machine Vision and Applications 33, 37 (2022). https://doi.org/10.1007/s00138-022-01291-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-022-01291-0

Keywords

Navigation