Skip to main content

Advertisement

Log in

Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

The deep learning techniques have achieved great success in the application of human activity recognition (HAR). In this paper, we propose a technique for HAR that utilizes the RGB and skeleton information with the help of a convolutional neural network (Convnet) and long short-term memory (LSTM) as a recurrent neural network (RNN). The proposed method has two parts: first, motion representation images like motion history image (MHI) and motion energy image (MEI) have been created from the RGB videos. The convnet has been trained, using these images with feature-level fusion. Second, the skeleton data have been utilized with a proposed algorithm that develops skeleton intensity images, for three views (top, front and side). Each view is first analyzed by a convnet, that generates the set of feature maps, which are fused for further analysis. On top of convnet sub-networks, LSTM has been used to exploit the temporal dependency. The softmax scores from these two independent parts are later combined at the decision level. Apart from the given approach for HAR, this paper also presents a strategy that utilizes the concept of cyclic learning rate to develop a multi-modal neural network by training the model only once to make the system more efficient. The suggested approach privileges for the perfect utilization of RGB and skeleton data available from an RGB-D sensor. The proposed approach has been tested on three famous and challenging multimodal datasets which are UTD-MHAD, CAD-60 and NTU-RGB + D120. Results have shown that the stated method gives a satisfactory result as compared to the other state-of-the-art systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Varol, G., Laptev, I., Schmid, C.: Long-term temporal convolutions for action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 40(6), 1510–1517 (2017)

    Article  Google Scholar 

  2. Song, S., Lan, C., Xing, J., Zeng, W., Liu, J.: An end-to-end spatio-temporal attention model for human action recognition from skeleton data. In: Thirty-First AAAI Conference on Artificial Intelligence, pp. 4263–4270 (2017)

  3. Yun, K., Honorio, J., Chattopadhyay, D., Berg, T.L., Samaras, D.: Two-person interaction detection using body-pose features and multiple instance learning. In: 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 28–35. IEEE, New York (2012)

  4. Sung, J., Ponce, C., Selman, B., Saxena, A.: Unstructured human activity detection from RGBD images. In: 2012 IEEE International Conference on Robotics and Automation, pp. 842–849. IEEE, New York (2012)

  5. Li, W., Zhang, Z., Liu, Z.: Action recognition based on a bag of 3D points. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 9–14. IEEE, New York (2010)

  6. Chen, C., Jafari, R., Kehtarnavaz, N.: UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. IEEE International Conference on Image Processing, pp. 168–172 (2015)

  7. Ofli, F., Chaudhry, R., Kurillo, G., Vidal, R., Bajcsy, R.: Berkeley MHAD: a comprehensive multimodal human action database. In: 2013 IEEE Workshop on Applications of Computer Vision (WACV), pp. 53–60. IEEE, New York (2013)

  8. Liu, J., Shahroudy, A., Perez, M.L., Wang, G., Duan, L.-Y., Chichung, A.K.: NTU RGB+D 120: a large-scale benchmark for 3D human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. (2019)

  9. Khaire, P., Kumar, P., Imran, J.: Combining CNN streams of RGB-D and skeletal data for human activity recognition. Pattern Recognit. Lett. 115, 107–116 (2018)

    Article  Google Scholar 

  10. Bulbul, M.F., Islam, S., Ali, H.: Human action recognition using MHI and SHI based GLAC features and collaborative representation classifier. J. Intell. Fuzzy Syst. 36(4), 3385–3401 (2019)

    Article  Google Scholar 

  11. e Souza, M.R., Pedrini, H.: Motion energy image for evaluation of video stabilization. Vis. Comput. 35(12), 1769–1781 (2019)

    Article  Google Scholar 

  12. Jiang, F., Zhang, S., Wu, S., Gao, Y., Zhao, D.: Multi-layered gesture recognition with kinect. In: Gesture Recognition, pp. 387–416. Springer, Cham (2017)

  13. Yao, L., Kusakunniran, W., Wu, Q., Zhang, J., Tang, Z.: Robust CNN-based gait verification and identification using skeleton gait energy image. In: 2018 Digital Image Computing: Techniques and Applications (DICTA), pp. 1–7. IEEE, New York (2018)

  14. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)

  15. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems, pp. 568–576 (2014)

  16. Graves, A., Mohamed, A.-R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645–6649. IEEE, New York (2013)

  17. Ullah, A., Ahmad, J., Muhammad, K., Sajjad, M., Baik, S.W.: Action recognition in video sequences using deep bi-directional LSTM with CNN features. IEEE Access 6, 1155–1166 (2017)

    Article  Google Scholar 

  18. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)

  19. Chen, C., Jafari, R., Kehtarnavaz, N..: Fusion of depth, skeleton, and inertial data for human action recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 2712–2716 (2016)

  20. Escobedo, E., Camara, G.: A new approach for dynamic gesture recognition using skeleton trajectory representation and histograms of cumulative magnitudes. In: 29th SIBGRAPI Conference on Graphics, Patterns and Images, pp. 209–216 (2016)

  21. Gaglio, S., Re, G.L., Morana, M.: Human activity recognition process using 3-D posture data. IEEE Trans. Hum. Mach. Syst. 45(5), 586–597 (2014)

    Article  Google Scholar 

  22. Cippitelli, E., Gambi, E., Spinsante, S., Flórez-Revuelta, F.: Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset. In: 2nd IET International Conference on Technologies for Active and Assisted Living, pp. 1–6 (2016)

  23. Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: train 1, get m for free. arXiv preprint arXiv:1704.00109 (2017)

  24. Zavadskas, E.K., Turskis, Z., Antucheviciene, J., Zakarevicius, A.: Optimization of weighted aggregated sum product assessment. Elektronika ir elektrotechnika 122(6), 3–6 (2012)

    Article  Google Scholar 

  25. Velasquez, M., Hester, P.T.: An analysis of multi-criteria decision making methods. Int. J. Oper. Res. 10(2), 56–66 (2013)

    MathSciNet  Google Scholar 

  26. Dhanisetty, V.S.V., Verhagen, W.J.C., Curran, R.: Multi-criteria weighted decision making for operational maintenance processes. J. Air Transp. Manag. 68, 152–164 (2018)

    Article  Google Scholar 

  27. Caetano, C., Sena, J., Brémond, F., Dos Santos, J.A., Schwartz, W.R.: SkeleMotion: a new representation of skeleton joint sequences based on motion information for 3D action recognition. In: 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8. IEEE, New York (2019)

  28. Annadani, Y., Rakshith, D.L., Soma Biswas, S.: Sliding dictionary based sparse representation for action recognition (2016). arXiv preprint. arXiv:1611.00218

  29. Bulbul, M.F., Jiang, Y., Ma, J.: DMMs-based multiple features fusion for human action recognition. Int. J. Multimed. Data Eng. Manag. 6(4), 23–39 (2015)

    Article  Google Scholar 

  30. Elmadany, N.E.D., He, Y., Guan, L.: Human gesture recognition via bag of angles for 3D virtual city planning in CAVE environment. In: IEEE 18th International Workshop on Multimedia Signal Processing, pp. 1–5 (2016)

  31. Zhu, Y., Chen, W., Guo, G.: Evaluating spatiotemporal interest point features for depth-based action recognition. Image Vis. Comput. 32(8), 453–464 (2014)

    Article  Google Scholar 

  32. Parisi, G.I., Weber, C., Wermter, S.: Self-organizing neural integration of pose-motion features for human action recognition. Front. Neurorobot. 9, 3 (2015)

    Article  Google Scholar 

  33. Liu, J., Shahroudy, A., Xu, D., Kot, A.C., Wang, G.: Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 40(12), 3007–3021 (2017)

    Article  Google Scholar 

  34. Yang, H., Yan, D., Zhang, L., Li, D., Sun, Y.D., You, S.D., Maybank, S.J.: Feedback graph convolutional network for skeleton-based action recognition (2020). arXiv preprint. arXiv:2003.07564

  35. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: A new representation of skeleton sequences for 3D action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3288–3297 (2017)

  36. Ke, Q., Bennamoun, M., An, S., Sohel, F., Boussaid, F.: Learning clip representations for skeleton-based 3D action recognition. IEEE Trans. Image Process. 27(6), 2842–2855 (2018)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pratishtha Verma.

Additional information

Communicated by Y. Kong.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Verma, P., Sah, A. & Srivastava, R. Deep learning-based multi-modal approach using RGB and skeleton sequences for human activity recognition. Multimedia Systems 26, 671–685 (2020). https://doi.org/10.1007/s00530-020-00677-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-020-00677-2

Keywords

Navigation