Fusion of spatial and dynamic CNN streams for action recognition

Russel, Newlin Shebiah; Selvaraj, Arivazhagan

doi:10.1007/s00530-021-00773-x

Fusion of spatial and dynamic CNN streams for action recognition

Regular Paper
Published: 23 March 2021

Volume 27, pages 969–984, (2021)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

472 Accesses
9 Citations
Explore all metrics

Abstract

Human action recognition is widely explored as it finds varied applications including visual navigation, surveillance, video indexing, biometrics, human–computer interaction, ambient assisted living, etc. This paper aims to design and analyze the performance of Spatial and Temporal CNN streams for action recognition from videos. An action video is fragmented into a predefined number of segments called snippets. For each segment, atomic poses portrayed by the individual is effectively captured by the representative frame and dynamics of the action is well described by the dynamic image. The representative frames and dynamic images are separately trained by Convolutional Neural Network for further analysis. The attained results on KTH, Weizmann UCF Sports and UCF101 datasets ascertain the efficiency of the proposed architecture for action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 9

CBAM: Convolutional Block Attention Module

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

Anamika Dhillon & Gyanendra K. Verma

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Xia Zhao, Limin Wang, … Milan Parmar

References

Bobick, A., Davis, J.: An appearance-based representation of action. In: Proceedings of 13th International Conference on Pattern Recognition, Vienna, Austria, vol.1, pp. 307–312 (1996)
Ahad, M.A.R., Tan, J.K., Kim, H., et al.: Motion history image: its variants and applications. Mach. Vis. Appl. 23, 255–281 (2012)
Article Google Scholar
Han, J., Bhanu, B.: Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell. 28(2), 316–322 (2006)
Article Google Scholar
Chandrashekhar, V., Venkatesh, K.S.: Action energy images for reliable human action recognition. In: Proceedings of 9th Asian Symposium on Information, pp. 484–487 (2006)
Ahmad, M., Parvin, I., Lee, S.: Silhouette history and energy image information for human movement recognition. J Multimed 5(1), 12–21 (2010)
Google Scholar
Ahmad, M., Lee, S.: Recognizing human actions based on silhouette energy image and global motion description. In: Proceedings of 8th IEEE International Conference on Automatic Face & Gesture Recognition, Amsterdam, pp. 1–6 (2008)
Ahmad, M., Hossain, M.Z.: SEI and SHI representations for human movement recognition. In: Proceedings of 11th International Conference on Computer and Information Technology, Khulna, Bangladesh, pp. 521–526 (2008)
Laptev, I., Lindeberg, T.: Space-time interest points. In: Proceedings of Ninth IEEE International Conference on Computer Vision, Nice, France, vol. 1, pp. 432–439 (2003)
Oikonomopoulos, A., Patras, I., Pantic, M.: Spatiotemporal salient points for visual recognition of human actions. IEEE Trans Syst Man Cybern Part B Cybern 36(3), 710–719 (2006)
Article Google Scholar
Kadir, T., Brady, M.: Scale Saliency: a novel approach to salient feature and scale selection. In: Proceedings of International Conference on Visual Information Engineering VIE 2003, Guildford, UK, pp. 25–28 (2003)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Proceedings of IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, pp. 65–72 (2005)
Rapantzikos, K., Avrithis, Y., Kollias S.: Spatiotemporal saliency for event detection and representation in the 3D wavelet domain: potential in human action recognition. In: Proceedings of ACM International Conference on Image and Video Retrieval, Amsterdam, The Netherlands, pp. 294–301 (2007)
Wang, H., MuneebUllah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference, London, United Kingdom, BMVA Press, pp. 124.1–124.11 (2009)
Wong, S., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: Proceedings of IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, pp. 1–8 (2007)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of 17th International Conference on Pattern Recognition, Cambridge, UK, vol. 3, pp. 32–36 (2004)
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299–318 (2008)
Article Google Scholar
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, pp. 1–8 (2008)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 1725–1732 (2014)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of 27th International Conference on Neural Information Processing Systems, vol. 1, pp. 568–576 (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 1933–1941 (2016)
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 4305–4314 (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream ConvNets. arXiv preprint (2015). arXiv:1507.02159
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision—ECCV 2016. Lecture Notes in Computer Science, vol. 9912, pp. 20–36. Springer, Cham (2016)
Singh, R., Khurana, R., Kushwaha, A.K.S., Srivastava, R.: Combining CNN streams of dynamic image and depth data for action recognition. Multimed. Syst. 26, 313–322 (2020)
Article Google Scholar
Tripathy, S.K., Srivastava, R.: A real-time two-input stream multi-column multi-stage convolution neural network (TIS-MCMS-CNN) for efficient crowd congestion-level analysis. Multimed. Syst. (2020). https://doi.org/10.1007/s00530-020-00667-4
Article Google Scholar
Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R.C., Li, B., Yuan, J.: Multi-stream CNN: learning representations based on human-related regions for action recognition. Pattern Recogn. 79, 32–43 (2018)
Article Google Scholar
Yu D., Wang H., Chen P., Wei Z. (2014) Mixed pooling for convolutional neural networks. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds) Rough sets and knowledge technology. RSKT 2014. Lecture notes in computer science, vol. 8818. Springer, Cham. https://doi.org/10.1007/978-3-319-11740-9_34
Berthold K.P. Horn, Brian G. Schunck, Determining optical flow, artificial intelligence, vol. 17, Issues 1–3, pp 185-203, ISSN 0004-3702 (1981). https://doi.org/10.1016/0004-3702(81)90024-2
Fernando, B., Gavves, E., Oramas, M.J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 773–787 (2017)
Article Google Scholar
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: ‘Actions as space-time shapes’. In: Proceedings of the Tenth IEEE International Conference on Computer Vision, Beijing, China, pp. 1395–1402 (2005)
Rodriguez, M.D., Ahmed, J., Shah, M.: ‘Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Anchorage, AK, USA, pp. 1–8 (2008)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild, CRCV-TR-12-01, November (2012)
Bregonzio, M., Xiang, T., Gong, S.: Fusing appearance and distribution information of interest points for action recognition. Pattern Recogn. 45(3), 1220–1234 (2012)
Article Google Scholar
Zhao, D., Shao, L., Zhen, X., Liu, Y.: Combining appearance and structural features for human action recognition. Neurocomputing 113, 88–96 (2013)
Article Google Scholar
Zhou, W., Wang, C., Xiao, B., Zhang, Z.: Human action recognition using weighted pooling. IET Comput. Vis. 8(6), 579–587 (2014)
Article Google Scholar
Somasundaram, G., Cherian, A., Morellas, V., Papanikolopoulos, N.: Action recognition using global spatio-temporal features derived from sparse representations. Comput. Vis. Image Underst. 123, 1–13 (2014)
Article Google Scholar
Moussa, M.M., Hamayed, E., Fayek, M.B., El Nemr, H.A.: An enhanced method for human action recognition. J. Adv. Res. 6(2), 163–169 (2015)
Article Google Scholar
Xu, K., Jiang, X., Sun, T.: Human activity recognition based on pose points selection. In: Proceedings of IEEE International Conference on Image Processing, Quebec City, QC, Canada, pp. 2930–2834 (2015)
Abdulmunem, A., Lai, Y., Sun, X.: Saliency guided local and global descriptors for effective action recognition. Comput. Vis. Media 2(1), 97–106 (2016)
Article Google Scholar
Selmi, M., El-Yacoubi, M.A., Dorizzi, B.: Two-layer discriminative model for human activity recognition. IET Comput. Vis. 10(4), 273–278 (2016)
Article Google Scholar
Zhang, N., Hu, Z., Lee, S., Lee, E.: Human action recognition based on global silhouette and local optical flow. Adv. Eng. Res. 134, 1–5 (2018)
Google Scholar
Martínez, F., Manzanera, A., Romero, E.: Spatio-temporal multi-scale motion descriptor from a spatially-constrained decomposition for online action recognition. IET Comput. Vis. 11(7), 541–549 (2017)
Article Google Scholar
Arunnehru, J., Chamundeeswari, G., Bharathi, S.P.: Human action recognition using 3D convolutional neural networks with 3D motion cuboids in surveillance videos. Procedia Comput. Sci. 133, 471–477 (2018)
Article Google Scholar
Nazir, S., Yousaf, M.H., Nebel, J., Velastin, S.A.: A Bag of Expression framework for improved human action recognition. Pattern Recogn. Lett. 103, 39–45 (2018)
Article Google Scholar
Khelalef, A., Ababsa, F., Benoudjit, N.: An efficient human activity recognition technique based on deep learning. Pattern Recognit. Image Anal. 29(4), 702–715 (2019)
Article Google Scholar
Jaouedi, N., Boujnah, N., Bouhlel, M.S.: A new hybrid deep learning model for human action recognition. J. King Saud Univ. Comput. Inf. Sci. 32(4), 447–453 (2020)
Google Scholar
Basha, S.H.S., Pulabaigari, V., Mukherjee, S.: An information-rich sampling technique over spatio-temporal CNN for classification of human actions in videos. arXiv preprint (2020). arXiv:2002.02100
Parisi, G.I.: Human action recognition and assessment via deep neural network self-organization. In: Noceti, N., Sciutti, A., Rea, F. (eds) Modelling human motion. Springer, Cham, pp. 187–211 (2020). https://doi.org/10.1007/978-3-030-46732-6_10
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah A.A., Lepri B. (eds) Human behavior understanding. HBU 2011. Lecture notes in computer science, vol. 7065. Springer, Berlin, pp. 29–39 (2011)
De Oliveira Silva, V., De Barros Vidal, F., Soares Romariz, A.R.: Human action recognition based on a two-stream convolutional network classifier. In: Proceedings of 16th IEEE International Conference on Machine Learning and Applications, Cancun, Mexico, pp. 774–778 (2017)
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Proceedings of European Conference on Computer Vision, Amsterdam, Netherlands, pp. 744–759 (2016)
Shamsipour, G., Shanbehzadeh, J., Sarrafzadeh, H.: Human action recognition by conceptual features. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong, vol. 1 (2017)
Wang, L., Xu, Y., Cheng, J., Xia, H., Yin, J., Wu, J.: Human action recognition by learning spatio-temporal features with deep neural networks. IEEE Access 6, 17913–17922 (2018)
Article Google Scholar
Shamsipour, G., Pirasteh, S.: Artificial intelligence and Convolutional neural network for recognition of human interaction by video from drone. Preprints 2019 (2019). https://doi.org/10.20944/preprints201908.0289.v1
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733 (2017)
Kar, A., Rai, N., Sikka, K., Sharma, G.: AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 5699–5708 (2017)
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 3034–3042 (2016)
Yang, W., Chen, Y., Huang, C., Gao, M.: Video-based human action recognition using spatial pyramid pooling and 3D densely convolutional networks. Future Internet 10(12), 115 (2018)
Article Google Scholar
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)
Article Google Scholar
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018)
Article Google Scholar
Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 20(3), 634–644 (2018)
Article Google Scholar
Chang, Y.L., Chan, C.S., Remagnino, P.: Action recognition on continuous video. Neural Comput. Appl. (2020). https://doi.org/10.1007/s00521-020-04982-9
Article Google Scholar
Zhao, Y., Man, K.L., Smith, J., Siddique, K., Guan, S.: Improved two-stream model for human action recognition. EURASIP J. Image Video Process. 24:5 (2020). https://doi.org/10.1186/s13640-020-00501-x

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Centre for Image Processing and Pattern Recognition, Mepco Schlenk Engineering College, Sivakasi, Tamilnadu, 626005, India
Newlin Shebiah Russel & Arivazhagan Selvaraj

Authors

Newlin Shebiah Russel
View author publications
You can also search for this author in PubMed Google Scholar
Arivazhagan Selvaraj
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Newlin Shebiah Russel.

Additional information

Communicated by Y. Kong.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Russel, N., Selvaraj, A. Fusion of spatial and dynamic CNN streams for action recognition. Multimedia Systems 27, 969–984 (2021). https://doi.org/10.1007/s00530-021-00773-x

Download citation

Received: 20 July 2020
Accepted: 26 February 2021
Published: 23 March 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s00530-021-00773-x

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fusion of spatial and dynamic CNN streams for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Convolutional neural network: a review of models, methodologies and applications to object detection

A review of convolutional neural networks in computer vision

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Navigation

Fusion of spatial and dynamic CNN streams for action recognition

Abstract

Access this article

Similar content being viewed by others

CBAM: Convolutional Block Attention Module

Convolutional neural network: a review of models, methodologies and applications to object detection

A review of convolutional neural networks in computer vision

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation