A deeply coupled ConvNet for human activity recognition using dynamic and RGB images

Singh, Tej; Vishwakarma, Dinesh Kumar

doi:10.1007/s00521-020-05018-y

A deeply coupled ConvNet for human activity recognition using dynamic and RGB images

Original Article
Published: 21 May 2020

Volume 33, pages 469–485, (2021)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

1458 Accesses
59 Citations
Explore all metrics

Abstract

This work is motivated by the tremendous achievement of deep learning models for computer vision tasks, particularly for human activity recognition. It is gaining more attention due to the numerous applications in real life, for example smart surveillance system, human–computer interaction, sports action analysis, elderly healthcare, etc. Recent days, the acquisition and interface of multimodal data are straightforward due to the invention of low-cost depth devices. Several approaches have been developed based on RGB-D (depth) evidence at the cost of additional equipment’s setup and high complexity. Contrarily, the methods that utilize RGB frames provide inferior performance due to the absence of depth evidence and these approaches require to less hardware, simple and easy to generalize using only color cameras. In this work, a deeply coupled ConvNet for human activity recognition proposed that utilizes the RGB frames at the top layer with bi-directional long short-term memory (Bi-LSTM). At the bottom layer, the CNN model is trained with a single dynamic motion image. For the RGB frames, the CNN-Bi-LSTM model is trained end-to-end learning to refine the feature of the pre-trained CNN, while dynamic images stream is fine-tuned with the top layers of the pre-trained model to extract temporal information in videos. The features obtained from both the data streams are fused at decision level after the softmax layer with different late fusion techniques and achieved high accuracy with max fusion. The performance accuracy of the model is assessed using four standard single as well as multiple person activities RGB-D (depth) datasets. The highest classification accuracies achieved on human action datasets are compared with similar state of the art and found significantly higher margin such as 2% on SBU Interaction, 4% on MIVIA Action, 1% on MSR Action Pair, and 4% on MSR Daily Activity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 7

A human activity recognition framework in videos using segmented human subject focus

Article 06 February 2024

Delving Deeper with Dual-Stream CNN for Activity Recognition

A Novel CNN-LSTM Hybrid Architecture for the Recognition of Human Activities

References

Aggarwal JK, Xia L (2013) Human activity recognition from 3D data—a review. Pattern Recognit Lett 48:70–80
Dhiman C, Vishwakarma DK (2018) A review of state-of-the-art techniques for abnormal human activity recognition. Eng Appl Artif Intell 77:21–45
Article Google Scholar
Suto J, Oniga S, Lung C, Orha I (2018) Comparison of offline and real-time human activity recognition results using machine learning techniques. Neural Comput Appl 1–14
Vishwakarma DK, Kapoor R, Maheshwari R, Kapoor V, Raman S (2015) Recognition of abnormal human activity using the changes in orientation of silhouette in key frames. In: IEEE international conference on computing for sustainable global development (INDIACom), New Delhi
Schuldt C, Laptev I, Caputo B (2004) Recognizing human actions: a local SVM approach. In: 17th International conference on pattern recognition
Vishwakarma DK, Kapoor R (2015) Integrated approach for human action recognition using edge spatial distribution, direction pixel, and R-transform. Adv Robot 29(23):1551–1561
Article Google Scholar
Singh T, Vishwakarma DK (2018) Video benchmarks of human action datasets: a review. Artif Intell Rev 52(2):1107–1154
Article Google Scholar
Zhang J, Li W, Ogunbona PO, Wang P, Tang C (2016) RGB-D based action recognition datasets: a survey. Pattern Recognit 60:86–105
Article Google Scholar
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, pp 3034–3042
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2015) Rethinking the inception architecture for computer vision, arXiv:1512.00567 [cs.CV]
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inf Process Syst 1097–1105
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 60:4–21
Article Google Scholar
Ladjailia A, Bouchrika I, Merouani H, Harrati N, Mahfouf Z (2019) Human activity recognition via optical flow: decomposing activities into basic actions. Neural Comput Appl 1–14
Wang H, Klaeser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. IJCV 103:60–79
Liu J, Luo J, Shah M (2009) Recognizing realistic actions from videos “in the Wild”. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Vishwakarma DK, Singh K (2016) Human activity recognition based on spatial distribution of gradients at sub-levels of average energy silhouette images. IEEE Trans Cogn Dev Syst 99:1
Google Scholar
Dhiman C, Vishwakarma DK (2019) A robust framework for abnormal human action recognition using R-transform and Zernike moments in depth videos. IEEE Sens J 19(13):5195–5203
Article Google Scholar
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: Proceedings of the second international conference on human behavior understanding
Ji S, Xu W, Yang M, Yu K (2013) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Proceedings of the advances in neural information processing systems
Ji X, Cheng J, Feng W, Tao D (2017) Skeleton embedded motion body partition for human action recognition using depth sequences. Sig Process 143:56–68
Article Google Scholar
Ji Y, Yang Y, Xu X, Shen HT (2018) One-shot learning based pattern transition map for action early recognition. Sig Process 143:364–370
Article Google Scholar
Fernando B, Gavves E, Oramas M, Ghodrati A, Tuytelaars T (2015) Modeling video evolution for action recognition. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Amor BB, Su J, Srivastava A (2016) Action recognition using rate-invariant analysis of skeletal shape trajectories. IEEE Trans Pattern Anal Mach Intell 38(1):1–13
Article Google Scholar
Feng J, Zhang S, Xiao J (2017) Explorations of skeleton features for LSTM-based action recognition. Multimed Tools Appl 78:591–603
Bobick AF, Davis JW (2001) The recognition of human movement using temporal templates. IEEE Trans Pattern Anal Mach Intell 23(3):257–267
Article Google Scholar
Blank M, Gorelick L, Shechtman E, Irani M, Basri R (2005) Actions as space-time shapes. In: Tenth IEEE international conference on computer vision (ICCV’05), Beijing
Gorelick L, Blank M, Shechtman E, Irani M, Basri R (2007) Actions as space-time shapes. Trans Pattern Anal Mach Intell 29:2247–2253
Article Google Scholar
Laptev I (2005) On space-time interest points. Int J Comput Vision 64(2–3):107–123
Article Google Scholar
Matikainen P, Hebert M, Sukthankar R (2009) Trajectons: action recognition through the motion analysis of tracked features. In: IEEE 12th international conference on computer vision
Brun L, Percannella G, Saggesea A, Vento M (2016) Action recognition by using kernels on aclets sequences. Comput Vis Image Underst 144:3–13
Article Google Scholar
Carletti V, Foggia P, Percannella G, Saggese A, Vento M (2013) Recognition of human actions from RGB-D videos using a reject option. In: International workshop on social behaviour analysis
Saggese A, Strisciuglio N, Vento M, Petkov N (2018) Learning skeleton representations for human action recognition. Pattern Recognit Lett 118:23–31
Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. In: Proceedings of the European conference on computer vision
Laptev I, Lindeberg T (2004) Local descriptors for spatio-temporal recognition. In: ECCV workshop on spatial coherence for visual motion analysis
Rodriguez MD, Ahmed J, Shah M (2008) Action MACH: a spatio-temporal maximum average correlation height filter for action recognition. In: IEEE conference on computer vision and pattern recognition, Anchorage, AK
Al-Nawashi M, Al-Hazaimeh O, Saraee M (2017) A novel framework for intelligent surveillance system based on abnormal human activity detection in academic environments. Neural Comput Appl 28:565–572
Article Google Scholar
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the international conference on computer vision (ICCV)
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L, (2014) Large-scale video classification with convolutional neural networks. In: IEEE conference on computer vision and pattern recognition, Columbus, OH
Peng X, Zou C, Qiao Y, Peng Q (2014) Action recognition with stacked fisher vectors. In: ECCV
Keçeli AS, Kaya A, Can AB (2018) Combining 2D and 3D deep models for action recognition with depth information. SIViP 12:1197–1205
Article Google Scholar
Ijjina EP, Chalavadi KM (2017) Human action recognition in RGB-D videos using motion sequence information and deep learning. Pattern Recognit 72:504–516
Article Google Scholar
Jing C, Wei P, Sun H, Zheng N (2019) Spatiotemporal neural networks for action recognition based on joint loss. Neural Comput Appl 32:4293–4302
Srihari D, Kishore PVV, Kumar EK, Kumar A, Kumar MTK, Prasad MVD, Prasad CR (2020) A four-stream ConvNet based on spatial and depth flow for human action classification using RGB-D data. Multimed Tools Appl 79:11723–11746. https://doi.org/10.1007/s11042-019-08588-9
Elboushaki A, Hannane R, Afdel K, Koutti L (2020) MultiD-CNN: a multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in RGB-D image. Expert Syst Appl. https://doi.org/10.1016/j.eswa.2019.112829
Williams RJ, Hinton GE, Rumelhart DE (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article Google Scholar
Hochreiter S, Schnidhuber J (1997) Long short-term memory. Neural Comput 9(1997):1735–1780
Article Google Scholar
Smola AJ, Scholkopf B (2004) A tutorial on support vector regression. Stat Comput 14:199–222
Article MathSciNet Google Scholar
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition, arXiv:1604.06573v2 [cs.CV]
Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: IEEE international conference computer vision and pattern recognition workshops (CVPRW), Rhode Island
Wang J, Liu Z, Wu Y, Yuan J (2012) Mining Actionlet ensemble for action recognition with depth cameras. In: IEEE conference on computer vision and pattern recognition, Rhode Island
Oreifej O, Liu Z (2013) HON4D: histogram of oriented 4D normals for activity recognition from depth sequences. In: IEEE international conference on computer vision and pattern recognition (CVPR), Portland, OR
Kingma PD, Ba JL (2015) ADAM: a method for stochastic optimization. In: International conference on learning representations, San Diego
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18(1):50–60
Article MathSciNet Google Scholar
Foggia P, Saggese A, Strisciuglio N, Vento M (2014) Exploiting the deep learning paradigm for recognizing human actions. In: IEEE AVSS
Brun L, Foggia P, Saggese A, Vento M (2015) Recognition of human actions using edit distance on aclet strings. In: VISAPP
Jia C, Kong Y, Ding Z, Fu Y (2014) Latent tensor transfer learning for RGB-D action recognition. In: Proceedings of the 22nd ACM international conference on multimedia, Orlando, FL, USA
Vemulapalli R, Chellapa R (2016) Rolling rotations for recognizing human actions from 3D skeletal data. In: IEEE international conference on computer vision and pattern recognition (CVPR)
Seidenari L, Varano V, Berretti S, Bimbo AD, Pala P (2013) Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses. In: IEEE international conference on computer vision and pattern recognition (CVPR), Portland
Cai X, Zhou W, Wu L, Luo J, Li H (2016) Effective active skeleton representation for low latency human action recognition. IEEE Trans Multimed 18(2):141–154
Article Google Scholar
Zhang H, Parker LE (2015) Bio-inspired predictive orientation decomposition of skeleton trajectories for real-time human activity prediction. In: IEEE international conference on robotics and automation (ICRA), Seattle, WA
Huynh T-T, Hua C-H, Tu NA, Hur T, Bang J, Kim D, Amin MB, Kang BH, Seung H, Shin S-Y, Kim E-S, Lee S (2018) Hierarchical topic modeling with pose-transition feature for action recognition using 3D skeleton data. Inf Sci 444:20–35
Article MathSciNet Google Scholar

Download references

Acknowledgements

The authors would like to acknowledge the computation of the work was supported by Biometric Research Laboratory, Department of Information Technology, Delhi Technological University, New Delhi, India.

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, Delhi Technological University, New Delhi, 110042, India
Tej Singh
Department of Information Technology, Delhi Technological University, New Delhi, 110042, India
Dinesh Kumar Vishwakarma

Authors

Tej Singh
View author publications
You can also search for this author in PubMed Google Scholar
Dinesh Kumar Vishwakarma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dinesh Kumar Vishwakarma.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Singh, T., Vishwakarma, D.K. A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput & Applic 33, 469–485 (2021). https://doi.org/10.1007/s00521-020-05018-y

Download citation

Received: 06 May 2019
Accepted: 07 May 2020
Published: 21 May 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s00521-020-05018-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deeply coupled ConvNet for human activity recognition using dynamic and RGB images

Abstract

Access this article

Similar content being viewed by others

A human activity recognition framework in videos using segmented human subject focus

Delving Deeper with Dual-Stream CNN for Activity Recognition

A Novel CNN-LSTM Hybrid Architecture for the Recognition of Human Activities

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A deeply coupled ConvNet for human activity recognition using dynamic and RGB images

Abstract

Access this article

Similar content being viewed by others

A human activity recognition framework in videos using segmented human subject focus

Delving Deeper with Dual-Stream CNN for Activity Recognition

A Novel CNN-LSTM Hybrid Architecture for the Recognition of Human Activities

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation