Skip to main content
Log in

Fusion of spatial and dynamic CNN streams for action recognition

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Human action recognition is widely explored as it finds varied applications including visual navigation, surveillance, video indexing, biometrics, human–computer interaction, ambient assisted living, etc. This paper aims to design and analyze the performance of Spatial and Temporal CNN streams for action recognition from videos. An action video is fragmented into a predefined number of segments called snippets. For each segment, atomic poses portrayed by the individual is effectively captured by the representative frame and dynamics of the action is well described by the dynamic image. The representative frames and dynamic images are separately trained by Convolutional Neural Network for further analysis. The attained results on KTH, Weizmann UCF Sports and UCF101 datasets ascertain the efficiency of the proposed architecture for action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  1. Bobick, A., Davis, J.: An appearance-based representation of action. In: Proceedings of 13th International Conference on Pattern Recognition, Vienna, Austria, vol.1, pp. 307–312 (1996)

  2. Ahad, M.A.R., Tan, J.K., Kim, H., et al.: Motion history image: its variants and applications. Mach. Vis. Appl. 23, 255–281 (2012)

    Article  Google Scholar 

  3. Han, J., Bhanu, B.: Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell. 28(2), 316–322 (2006)

    Article  Google Scholar 

  4. Chandrashekhar, V., Venkatesh, K.S.: Action energy images for reliable human action recognition. In: Proceedings of 9th Asian Symposium on Information, pp. 484–487 (2006)

  5. Ahmad, M., Parvin, I., Lee, S.: Silhouette history and energy image information for human movement recognition. J Multimed 5(1), 12–21 (2010)

    Google Scholar 

  6. Ahmad, M., Lee, S.: Recognizing human actions based on silhouette energy image and global motion description. In: Proceedings of 8th IEEE International Conference on Automatic Face & Gesture Recognition, Amsterdam, pp. 1–6 (2008)

  7. Ahmad, M., Hossain, M.Z.: SEI and SHI representations for human movement recognition. In: Proceedings of 11th International Conference on Computer and Information Technology, Khulna, Bangladesh, pp. 521–526 (2008)

  8. Laptev, I., Lindeberg, T.: Space-time interest points. In: Proceedings of Ninth IEEE International Conference on Computer Vision, Nice, France, vol. 1, pp. 432–439 (2003)

  9. Oikonomopoulos, A., Patras, I., Pantic, M.: Spatiotemporal salient points for visual recognition of human actions. IEEE Trans Syst Man Cybern Part B Cybern 36(3), 710–719 (2006)

    Article  Google Scholar 

  10. Kadir, T., Brady, M.: Scale Saliency: a novel approach to salient feature and scale selection. In: Proceedings of International Conference on Visual Information Engineering VIE 2003, Guildford, UK, pp. 25–28 (2003)

  11. Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Proceedings of IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, pp. 65–72 (2005)

  12. Rapantzikos, K., Avrithis, Y., Kollias S.: Spatiotemporal saliency for event detection and representation in the 3D wavelet domain: potential in human action recognition. In: Proceedings of ACM International Conference on Image and Video Retrieval, Amsterdam, The Netherlands, pp. 294–301 (2007)

  13. Wang, H., MuneebUllah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference, London, United Kingdom, BMVA Press, pp. 124.1–124.11 (2009)

  14. Wong, S., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: Proceedings of IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, pp. 1–8 (2007)

  15. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of 17th International Conference on Pattern Recognition, Cambridge, UK, vol. 3, pp. 32–36 (2004)

  16. Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299–318 (2008)

    Article  Google Scholar 

  17. Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, pp. 1–8 (2008)

  18. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)

    Article  Google Scholar 

  19. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 1725–1732 (2014)

  20. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of 27th International Conference on Neural Information Processing Systems, vol. 1, pp. 568–576 (2014)

  21. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 1933–1941 (2016)

  22. Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 4305–4314 (2015)

  23. Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream ConvNets. arXiv preprint (2015). arXiv:1507.02159

  24. Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision—ECCV 2016. Lecture Notes in Computer Science, vol. 9912, pp. 20–36. Springer, Cham (2016)

  25. Singh, R., Khurana, R., Kushwaha, A.K.S., Srivastava, R.: Combining CNN streams of dynamic image and depth data for action recognition. Multimed. Syst. 26, 313–322 (2020)

    Article  Google Scholar 

  26. Tripathy, S.K., Srivastava, R.: A real-time two-input stream multi-column multi-stage convolution neural network (TIS-MCMS-CNN) for efficient crowd congestion-level analysis. Multimed. Syst. (2020). https://doi.org/10.1007/s00530-020-00667-4

    Article  Google Scholar 

  27. Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R.C., Li, B., Yuan, J.: Multi-stream CNN: learning representations based on human-related regions for action recognition. Pattern Recogn. 79, 32–43 (2018)

    Article  Google Scholar 

  28. Yu D., Wang H., Chen P., Wei Z. (2014) Mixed pooling for convolutional neural networks. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds) Rough sets and knowledge technology. RSKT 2014. Lecture notes in computer science, vol. 8818. Springer, Cham. https://doi.org/10.1007/978-3-319-11740-9_34

  29. Berthold K.P. Horn, Brian G. Schunck, Determining optical flow, artificial intelligence, vol. 17, Issues 1–3, pp 185-203, ISSN 0004-3702 (1981). https://doi.org/10.1016/0004-3702(81)90024-2

  30. Fernando, B., Gavves, E., Oramas, M.J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 773–787 (2017)

    Article  Google Scholar 

  31. Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: ‘Actions as space-time shapes’. In: Proceedings of the Tenth IEEE International Conference on Computer Vision, Beijing, China, pp. 1395–1402 (2005)

  32. Rodriguez, M.D., Ahmed, J., Shah, M.: ‘Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Anchorage, AK, USA, pp. 1–8 (2008)

  33. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild, CRCV-TR-12-01, November (2012)

  34. Bregonzio, M., Xiang, T., Gong, S.: Fusing appearance and distribution information of interest points for action recognition. Pattern Recogn. 45(3), 1220–1234 (2012)

    Article  Google Scholar 

  35. Zhao, D., Shao, L., Zhen, X., Liu, Y.: Combining appearance and structural features for human action recognition. Neurocomputing 113, 88–96 (2013)

    Article  Google Scholar 

  36. Zhou, W., Wang, C., Xiao, B., Zhang, Z.: Human action recognition using weighted pooling. IET Comput. Vis. 8(6), 579–587 (2014)

    Article  Google Scholar 

  37. Somasundaram, G., Cherian, A., Morellas, V., Papanikolopoulos, N.: Action recognition using global spatio-temporal features derived from sparse representations. Comput. Vis. Image Underst. 123, 1–13 (2014)

    Article  Google Scholar 

  38. Moussa, M.M., Hamayed, E., Fayek, M.B., El Nemr, H.A.: An enhanced method for human action recognition. J. Adv. Res. 6(2), 163–169 (2015)

    Article  Google Scholar 

  39. Xu, K., Jiang, X., Sun, T.: Human activity recognition based on pose points selection. In: Proceedings of IEEE International Conference on Image Processing, Quebec City, QC, Canada, pp. 2930–2834 (2015)

  40. Abdulmunem, A., Lai, Y., Sun, X.: Saliency guided local and global descriptors for effective action recognition. Comput. Vis. Media 2(1), 97–106 (2016)

    Article  Google Scholar 

  41. Selmi, M., El-Yacoubi, M.A., Dorizzi, B.: Two-layer discriminative model for human activity recognition. IET Comput. Vis. 10(4), 273–278 (2016)

    Article  Google Scholar 

  42. Zhang, N., Hu, Z., Lee, S., Lee, E.: Human action recognition based on global silhouette and local optical flow. Adv. Eng. Res. 134, 1–5 (2018)

    Google Scholar 

  43. Martínez, F., Manzanera, A., Romero, E.: Spatio-temporal multi-scale motion descriptor from a spatially-constrained decomposition for online action recognition. IET Comput. Vis. 11(7), 541–549 (2017)

    Article  Google Scholar 

  44. Arunnehru, J., Chamundeeswari, G., Bharathi, S.P.: Human action recognition using 3D convolutional neural networks with 3D motion cuboids in surveillance videos. Procedia Comput. Sci. 133, 471–477 (2018)

    Article  Google Scholar 

  45. Nazir, S., Yousaf, M.H., Nebel, J., Velastin, S.A.: A Bag of Expression framework for improved human action recognition. Pattern Recogn. Lett. 103, 39–45 (2018)

    Article  Google Scholar 

  46. Khelalef, A., Ababsa, F., Benoudjit, N.: An efficient human activity recognition technique based on deep learning. Pattern Recognit. Image Anal. 29(4), 702–715 (2019)

    Article  Google Scholar 

  47. Jaouedi, N., Boujnah, N., Bouhlel, M.S.: A new hybrid deep learning model for human action recognition. J. King Saud Univ. Comput. Inf. Sci. 32(4), 447–453 (2020)

    Google Scholar 

  48. Basha, S.H.S., Pulabaigari, V., Mukherjee, S.: An information-rich sampling technique over spatio-temporal CNN for classification of human actions in videos. arXiv preprint (2020). arXiv:2002.02100

  49. Parisi, G.I.: Human action recognition and assessment via deep neural network self-organization. In: Noceti, N., Sciutti, A., Rea, F. (eds) Modelling human motion. Springer, Cham, pp. 187–211 (2020). https://doi.org/10.1007/978-3-030-46732-6_10

  50. Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah A.A., Lepri B. (eds) Human behavior understanding. HBU 2011. Lecture notes in computer science, vol. 7065. Springer, Berlin, pp. 29–39 (2011)

  51. De Oliveira Silva, V., De Barros Vidal, F., Soares Romariz, A.R.: Human action recognition based on a two-stream convolutional network classifier. In: Proceedings of 16th IEEE International Conference on Machine Learning and Applications, Cancun, Mexico, pp. 774–778 (2017)

  52. Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Proceedings of European Conference on Computer Vision, Amsterdam, Netherlands, pp. 744–759 (2016)

  53. Shamsipour, G., Shanbehzadeh, J., Sarrafzadeh, H.: Human action recognition by conceptual features. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong, vol. 1 (2017)

  54. Wang, L., Xu, Y., Cheng, J., Xia, H., Yin, J., Wu, J.: Human action recognition by learning spatio-temporal features with deep neural networks. IEEE Access 6, 17913–17922 (2018)

    Article  Google Scholar 

  55. Shamsipour, G., Pirasteh, S.: Artificial intelligence and Convolutional neural network for recognition of human interaction by video from drone. Preprints 2019 (2019). https://doi.org/10.20944/preprints201908.0289.v1

  56. Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733 (2017)

  57. Kar, A., Rai, N., Sikka, K., Sharma, G.: AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 5699–5708 (2017)

  58. Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 3034–3042 (2016)

  59. Yang, W., Chen, Y., Huang, C., Gao, M.: Video-based human action recognition using spatial pyramid pooling and 3D densely convolutional networks. Future Internet 10(12), 115 (2018)

    Article  Google Scholar 

  60. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)

    Article  Google Scholar 

  61. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018)

    Article  Google Scholar 

  62. Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 20(3), 634–644 (2018)

    Article  Google Scholar 

  63. Chang, Y.L., Chan, C.S., Remagnino, P.: Action recognition on continuous video. Neural Comput. Appl. (2020). https://doi.org/10.1007/s00521-020-04982-9

    Article  Google Scholar 

  64. Zhao, Y., Man, K.L., Smith, J., Siddique, K., Guan, S.: Improved two-stream model for human action recognition. EURASIP J. Image Video Process. 24:5 (2020). https://doi.org/10.1186/s13640-020-00501-x

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Newlin Shebiah Russel.

Additional information

Communicated by Y. Kong.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Russel, N., Selvaraj, A. Fusion of spatial and dynamic CNN streams for action recognition. Multimedia Systems 27, 969–984 (2021). https://doi.org/10.1007/s00530-021-00773-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-021-00773-x

Navigation