Abstract
Human action recognition is widely explored as it finds varied applications including visual navigation, surveillance, video indexing, biometrics, human–computer interaction, ambient assisted living, etc. This paper aims to design and analyze the performance of Spatial and Temporal CNN streams for action recognition from videos. An action video is fragmented into a predefined number of segments called snippets. For each segment, atomic poses portrayed by the individual is effectively captured by the representative frame and dynamics of the action is well described by the dynamic image. The representative frames and dynamic images are separately trained by Convolutional Neural Network for further analysis. The attained results on KTH, Weizmann UCF Sports and UCF101 datasets ascertain the efficiency of the proposed architecture for action recognition.
Similar content being viewed by others
References
Bobick, A., Davis, J.: An appearance-based representation of action. In: Proceedings of 13th International Conference on Pattern Recognition, Vienna, Austria, vol.1, pp. 307–312 (1996)
Ahad, M.A.R., Tan, J.K., Kim, H., et al.: Motion history image: its variants and applications. Mach. Vis. Appl. 23, 255–281 (2012)
Han, J., Bhanu, B.: Individual recognition using gait energy image. IEEE Trans. Pattern Anal. Mach. Intell. 28(2), 316–322 (2006)
Chandrashekhar, V., Venkatesh, K.S.: Action energy images for reliable human action recognition. In: Proceedings of 9th Asian Symposium on Information, pp. 484–487 (2006)
Ahmad, M., Parvin, I., Lee, S.: Silhouette history and energy image information for human movement recognition. J Multimed 5(1), 12–21 (2010)
Ahmad, M., Lee, S.: Recognizing human actions based on silhouette energy image and global motion description. In: Proceedings of 8th IEEE International Conference on Automatic Face & Gesture Recognition, Amsterdam, pp. 1–6 (2008)
Ahmad, M., Hossain, M.Z.: SEI and SHI representations for human movement recognition. In: Proceedings of 11th International Conference on Computer and Information Technology, Khulna, Bangladesh, pp. 521–526 (2008)
Laptev, I., Lindeberg, T.: Space-time interest points. In: Proceedings of Ninth IEEE International Conference on Computer Vision, Nice, France, vol. 1, pp. 432–439 (2003)
Oikonomopoulos, A., Patras, I., Pantic, M.: Spatiotemporal salient points for visual recognition of human actions. IEEE Trans Syst Man Cybern Part B Cybern 36(3), 710–719 (2006)
Kadir, T., Brady, M.: Scale Saliency: a novel approach to salient feature and scale selection. In: Proceedings of International Conference on Visual Information Engineering VIE 2003, Guildford, UK, pp. 25–28 (2003)
Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatio-temporal features. In: Proceedings of IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, China, pp. 65–72 (2005)
Rapantzikos, K., Avrithis, Y., Kollias S.: Spatiotemporal saliency for event detection and representation in the 3D wavelet domain: potential in human action recognition. In: Proceedings of ACM International Conference on Image and Video Retrieval, Amsterdam, The Netherlands, pp. 294–301 (2007)
Wang, H., MuneebUllah, M., Kläser, A., Laptev, I., Schmid, C.: Evaluation of local spatio-temporal features for action recognition. In: Proceedings of British Machine Vision Conference, London, United Kingdom, BMVA Press, pp. 124.1–124.11 (2009)
Wong, S., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: Proceedings of IEEE 11th International Conference on Computer Vision, Rio de Janeiro, Brazil, pp. 1–8 (2007)
Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proceedings of 17th International Conference on Pattern Recognition, Cambridge, UK, vol. 3, pp. 32–36 (2004)
Niebles, J.C., Wang, H., Fei-Fei, L.: Unsupervised learning of human action categories using spatial-temporal words. Int. J. Comput. Vis. 79, 299–318 (2008)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, pp. 1–8 (2008)
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 1725–1732 (2014)
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Proceedings of 27th International Conference on Neural Information Processing Systems, vol. 1, pp. 568–576 (2014)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, pp. 1933–1941 (2016)
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pp. 4305–4314 (2015)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y.: Towards good practices for very deep two-stream ConvNets. arXiv preprint (2015). arXiv:1507.02159
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision—ECCV 2016. Lecture Notes in Computer Science, vol. 9912, pp. 20–36. Springer, Cham (2016)
Singh, R., Khurana, R., Kushwaha, A.K.S., Srivastava, R.: Combining CNN streams of dynamic image and depth data for action recognition. Multimed. Syst. 26, 313–322 (2020)
Tripathy, S.K., Srivastava, R.: A real-time two-input stream multi-column multi-stage convolution neural network (TIS-MCMS-CNN) for efficient crowd congestion-level analysis. Multimed. Syst. (2020). https://doi.org/10.1007/s00530-020-00667-4
Tu, Z., Xie, W., Qin, Q., Poppe, R., Veltkamp, R.C., Li, B., Yuan, J.: Multi-stream CNN: learning representations based on human-related regions for action recognition. Pattern Recogn. 79, 32–43 (2018)
Yu D., Wang H., Chen P., Wei Z. (2014) Mixed pooling for convolutional neural networks. In: Miao, D., Pedrycz, W., Ślȩzak, D., Peters, G., Hu, Q., Wang, R. (eds) Rough sets and knowledge technology. RSKT 2014. Lecture notes in computer science, vol. 8818. Springer, Cham. https://doi.org/10.1007/978-3-319-11740-9_34
Berthold K.P. Horn, Brian G. Schunck, Determining optical flow, artificial intelligence, vol. 17, Issues 1–3, pp 185-203, ISSN 0004-3702 (1981). https://doi.org/10.1016/0004-3702(81)90024-2
Fernando, B., Gavves, E., Oramas, M.J., Ghodrati, A., Tuytelaars, T.: Rank pooling for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 773–787 (2017)
Blank, M., Gorelick, L., Shechtman, E., Irani, M., Basri, R.: ‘Actions as space-time shapes’. In: Proceedings of the Tenth IEEE International Conference on Computer Vision, Beijing, China, pp. 1395–1402 (2005)
Rodriguez, M.D., Ahmed, J., Shah, M.: ‘Action MACH a spatio-temporal maximum average correlation height filter for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Anchorage, AK, USA, pp. 1–8 (2008)
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human action classes from videos in the wild, CRCV-TR-12-01, November (2012)
Bregonzio, M., Xiang, T., Gong, S.: Fusing appearance and distribution information of interest points for action recognition. Pattern Recogn. 45(3), 1220–1234 (2012)
Zhao, D., Shao, L., Zhen, X., Liu, Y.: Combining appearance and structural features for human action recognition. Neurocomputing 113, 88–96 (2013)
Zhou, W., Wang, C., Xiao, B., Zhang, Z.: Human action recognition using weighted pooling. IET Comput. Vis. 8(6), 579–587 (2014)
Somasundaram, G., Cherian, A., Morellas, V., Papanikolopoulos, N.: Action recognition using global spatio-temporal features derived from sparse representations. Comput. Vis. Image Underst. 123, 1–13 (2014)
Moussa, M.M., Hamayed, E., Fayek, M.B., El Nemr, H.A.: An enhanced method for human action recognition. J. Adv. Res. 6(2), 163–169 (2015)
Xu, K., Jiang, X., Sun, T.: Human activity recognition based on pose points selection. In: Proceedings of IEEE International Conference on Image Processing, Quebec City, QC, Canada, pp. 2930–2834 (2015)
Abdulmunem, A., Lai, Y., Sun, X.: Saliency guided local and global descriptors for effective action recognition. Comput. Vis. Media 2(1), 97–106 (2016)
Selmi, M., El-Yacoubi, M.A., Dorizzi, B.: Two-layer discriminative model for human activity recognition. IET Comput. Vis. 10(4), 273–278 (2016)
Zhang, N., Hu, Z., Lee, S., Lee, E.: Human action recognition based on global silhouette and local optical flow. Adv. Eng. Res. 134, 1–5 (2018)
Martínez, F., Manzanera, A., Romero, E.: Spatio-temporal multi-scale motion descriptor from a spatially-constrained decomposition for online action recognition. IET Comput. Vis. 11(7), 541–549 (2017)
Arunnehru, J., Chamundeeswari, G., Bharathi, S.P.: Human action recognition using 3D convolutional neural networks with 3D motion cuboids in surveillance videos. Procedia Comput. Sci. 133, 471–477 (2018)
Nazir, S., Yousaf, M.H., Nebel, J., Velastin, S.A.: A Bag of Expression framework for improved human action recognition. Pattern Recogn. Lett. 103, 39–45 (2018)
Khelalef, A., Ababsa, F., Benoudjit, N.: An efficient human activity recognition technique based on deep learning. Pattern Recognit. Image Anal. 29(4), 702–715 (2019)
Jaouedi, N., Boujnah, N., Bouhlel, M.S.: A new hybrid deep learning model for human action recognition. J. King Saud Univ. Comput. Inf. Sci. 32(4), 447–453 (2020)
Basha, S.H.S., Pulabaigari, V., Mukherjee, S.: An information-rich sampling technique over spatio-temporal CNN for classification of human actions in videos. arXiv preprint (2020). arXiv:2002.02100
Parisi, G.I.: Human action recognition and assessment via deep neural network self-organization. In: Noceti, N., Sciutti, A., Rea, F. (eds) Modelling human motion. Springer, Cham, pp. 187–211 (2020). https://doi.org/10.1007/978-3-030-46732-6_10
Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., Baskurt, A.: Sequential deep learning for human action recognition. In: Salah A.A., Lepri B. (eds) Human behavior understanding. HBU 2011. Lecture notes in computer science, vol. 7065. Springer, Berlin, pp. 29–39 (2011)
De Oliveira Silva, V., De Barros Vidal, F., Soares Romariz, A.R.: Human action recognition based on a two-stream convolutional network classifier. In: Proceedings of 16th IEEE International Conference on Machine Learning and Applications, Cancun, Mexico, pp. 774–778 (2017)
Peng, X., Schmid, C.: Multi-region two-stream R-CNN for action detection. In: Proceedings of European Conference on Computer Vision, Amsterdam, Netherlands, pp. 744–759 (2016)
Shamsipour, G., Shanbehzadeh, J., Sarrafzadeh, H.: Human action recognition by conceptual features. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong, vol. 1 (2017)
Wang, L., Xu, Y., Cheng, J., Xia, H., Yin, J., Wu, J.: Human action recognition by learning spatio-temporal features with deep neural networks. IEEE Access 6, 17913–17922 (2018)
Shamsipour, G., Pirasteh, S.: Artificial intelligence and Convolutional neural network for recognition of human interaction by video from drone. Preprints 2019 (2019). https://doi.org/10.20944/preprints201908.0289.v1
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 4724–4733 (2017)
Kar, A., Rai, N., Sikka, K., Sharma, G.: AdaScan: adaptive scan pooling in deep convolutional neural networks for human action recognition in videos. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, pp. 5699–5708 (2017)
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, pp. 3034–3042 (2016)
Yang, W., Chen, Y., Huang, C., Gao, M.: Video-based human action recognition using spatial pyramid pooling and 3D densely convolutional networks. Future Internet 10(12), 115 (2018)
Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2018)
Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2018)
Wang, X., Gao, L., Wang, P., Sun, X., Liu, X.: Two-stream 3-D convNet fusion for action recognition in videos with arbitrary size and length. IEEE Trans. Multimed. 20(3), 634–644 (2018)
Chang, Y.L., Chan, C.S., Remagnino, P.: Action recognition on continuous video. Neural Comput. Appl. (2020). https://doi.org/10.1007/s00521-020-04982-9
Zhao, Y., Man, K.L., Smith, J., Siddique, K., Guan, S.: Improved two-stream model for human action recognition. EURASIP J. Image Video Process. 24:5 (2020). https://doi.org/10.1186/s13640-020-00501-x
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Y. Kong.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Russel, N., Selvaraj, A. Fusion of spatial and dynamic CNN streams for action recognition. Multimedia Systems 27, 969–984 (2021). https://doi.org/10.1007/s00530-021-00773-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00530-021-00773-x