Skip to main content
Log in

Second-order motion descriptors for efficient action recognition

  • Original Article
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Human action recognition from realistic video data constitutes a challenging and relevant research area. Leading the state of the art we can find those methods based on convolutional neural networks (CNNs), and specially two-stream CNNs. In this family of deep architectures, the appearance channel learns from the RGB images and the motion channel learns from a motion representation, usually, the optical flow. Given that action recognition requires the extraction of complex motion patterns descriptors in image sequences, we introduce a new set of second-order motion representations capable of capturing both: geometrical and kinematic properties of the motion (curl, div, curvature, and acceleration). Besides, we present a new and effective strategy capable of reducing training times without sacrificing the performance when using the I3D two-stream CNN and robust to the weakness of a single channel. The experiments presented in this paper were carried out over two of the most challenging datasets for action recognition: UCF101 and HMDB51. Reported results show an improvement in accuracy over the UCF101 dataset where an accuracy of 98.45% is achieved when the curvature and acceleration are combined as a motion representation. For the HMDB51, our approach shows a competitive performance, achieving an accuracy of 80.19%. In both datasets, our approach shows a considerable reduction in time for the preprocessing and training phases. Preprocessing time is reduced to a sixth of the time while the training procedure for the motion stream can be performed in a third of the time usually employed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. We could not train the second-order descriptors on the Kinetics dataset where the optical flow was originally trained for lack of computational resources.

References

  1. García RO, Morales EF, Novel Sucar LE A (2019) Scheme for training two-stream CNNs for action recognition. In: Iberoamerican congress on pattern recognition. Springer, Cham, pp 729–739

  2. Ahad MAR, Tan JK, Kim H, Ishikawa S (2012) Motion history image: its variants and applications. Mach Vis Appl 23(2):255–281

    Article  Google Scholar 

  3. Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 1(60):4–21

    Article  Google Scholar 

  4. Nanni L, Stefano G, Sheryl B (2017) Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn 71:158–172

    Article  Google Scholar 

  5. Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7024–7033

  6. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576

  7. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  8. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308

  9. Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200

  10. Zach C, Pock T, Bischof HA (2007) Duality based approach for realtime TV-L 1 optical flow. In: Joint pattern recognition symposium. Springer, Berlin, pp 214–223

  11. García RO, Valentin L, Risquet CP, Sucar LE (2017) A pathline-based background subtraction algorithm. In: 9th Mexican conference on pattern recognition, pp 179–188

  12. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402

  13. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: 2011 international conference on computer vision. IEEE, pp 2556–2563

  14. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  15. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, Cham, pp 630–645

  16. Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–31

    Article  Google Scholar 

  17. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167

  18. Diba A, Pazandeh AM, Van Gool L (2016) Efficient two-stream motion and appearance 3d CNNs for video classification. arXiv preprint arXiv:1608.08851

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  20. Tran D, Ray J, Shou Z, Chang SF, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038

  21. Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702

  22. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634

  23. Wang Y, Song J, Wang L, Van Gool L, Hilliges O (2016) Two-stream SR-CNNs for action recognition in videos. In: BMVC

  24. Peng X, Schmid C (2016) Multi-region two-stream R-CNN for action detection. In: European conference on computer vision. Springer, Cham, pp 744–759

  25. Saha S, Singh G, Sapienza M, Torr PH, Cuzzolin F (2016) Deep learning for detecting multiple space–time action tubes in videos. arXiv preprint arXiv:1608.01529

  26. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555

  27. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lSTM with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, Cham, pp 816–833

  28. Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39. Springer, Berlin

  29. Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in lSTMS for activity detection and early detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1942–1950

  30. Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118

  31. Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987

  32. Szegedy C, Vanhoucke V, I47offe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826

  33. Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  34. Rao C, Shah M (2001) View-invariance in action recognition. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. In: CVPR 2001. IEEE, vol 2, pp II-II

  35. Bashir FI, Khokhar AA, Schonfeld D (2006) View-invariant motion trajectory-based activity classification and recognition. Multimed Syst 12(1):45–54

    Article  Google Scholar 

  36. Chen H, Chirikjian GS (2019) Curvature: a signature for Action Recognition in Video Sequences. arXiv preprint arXiv:1904.13003

  37. Weinkauf T, Theisel H (2002) Curvature measures of 3D vector fields and their applications. J WSCG, pp 507–514

  38. Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2555–2562

  39. Wang X, Qi C (2016) Action recognition using edge trajectories and motion acceleration descriptor. Mach Vis Appl 27(6):861–75

    Article  Google Scholar 

  40. Canny JF (1983) Finding edges and lines in images. Massachusetts inst of Tech Cambridge Artificial Intelligence Lab

  41. Kroeger T, Timofte R, Dai D, Van Gool L (2016). Fast optical flow using dense inverse search. In: European conference on computer vision. Springer, Cham, pp 471–488

  42. Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysis. Springer, Berlin, pp 363–370

  43. Theisel H (1998) Visualizing the curvature of unsteady 2D flow fields. In: Proceedings of the 9th EG workshop on visualization in science computing, pp 47–56

  44. Suter D (1994) Motion estimation and vector splines. In: CVPR, Vol 94, pp 939–942

  45. Vetterling WT, Teukolsky SA, Press WH, Flannery BP (1989) Numerical recipes. University Press, Cambridge

    MATH  Google Scholar 

  46. Cruz C, Sucar LE, Morales EF (2008) Real-time face recognition for human–robot interaction. In: 2008 8th IEEE international conference on automatic face and gesture recognition. IEEE, pp 1–6

  47. Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–81

    Article  Google Scholar 

  48. Yao Y, Rosasco L, Caponnetto A (2007) On early stopping in gradient descent learning. Construct Approx 26(2):289–315

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reinier Oves García.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Oves García, R., Morales, E.F. & Sucar, L.E. Second-order motion descriptors for efficient action recognition. Pattern Anal Applic 24, 473–482 (2021). https://doi.org/10.1007/s10044-020-00924-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-020-00924-2

Keywords

Navigation