Abstract
Human action recognition from realistic video data constitutes a challenging and relevant research area. Leading the state of the art we can find those methods based on convolutional neural networks (CNNs), and specially two-stream CNNs. In this family of deep architectures, the appearance channel learns from the RGB images and the motion channel learns from a motion representation, usually, the optical flow. Given that action recognition requires the extraction of complex motion patterns descriptors in image sequences, we introduce a new set of second-order motion representations capable of capturing both: geometrical and kinematic properties of the motion (curl, div, curvature, and acceleration). Besides, we present a new and effective strategy capable of reducing training times without sacrificing the performance when using the I3D two-stream CNN and robust to the weakness of a single channel. The experiments presented in this paper were carried out over two of the most challenging datasets for action recognition: UCF101 and HMDB51. Reported results show an improvement in accuracy over the UCF101 dataset where an accuracy of 98.45% is achieved when the curvature and acceleration are combined as a motion representation. For the HMDB51, our approach shows a competitive performance, achieving an accuracy of 80.19%. In both datasets, our approach shows a considerable reduction in time for the preprocessing and training phases. Preprocessing time is reduced to a sixth of the time while the training procedure for the motion stream can be performed in a third of the time usually employed.
Similar content being viewed by others
Notes
We could not train the second-order descriptors on the Kinetics dataset where the optical flow was originally trained for lack of computational resources.
References
García RO, Morales EF, Novel Sucar LE A (2019) Scheme for training two-stream CNNs for action recognition. In: Iberoamerican congress on pattern recognition. Springer, Cham, pp 729–739
Ahad MAR, Tan JK, Kim H, Ishikawa S (2012) Motion history image: its variants and applications. Mach Vis Appl 23(2):255–281
Herath S, Harandi M, Porikli F (2017) Going deeper into action recognition: a survey. Image Vis Comput 1(60):4–21
Nanni L, Stefano G, Sheryl B (2017) Handcrafted vs. non-handcrafted features for computer vision classification. Pattern Recogn 71:158–172
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7024–7033
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Advances in neural information processing systems, pp 568–576
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299–6308
Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3d convnets: new architecture and transfer learning for video classification. arXiv preprint arXiv:1711.08200
Zach C, Pock T, Bischof HA (2007) Duality based approach for realtime TV-L 1 optical flow. In: Joint pattern recognition symposium. Springer, Berlin, pp 214–223
García RO, Valentin L, Risquet CP, Sucar LE (2017) A pathline-based background subtraction algorithm. In: 9th Mexican conference on pattern recognition, pp 179–188
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In: 2011 international conference on computer vision. IEEE, pp 2556–2563
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In: European conference on computer vision. Springer, Cham, pp 630–645
Ji S, Xu W, Yang M, Yu K (2012) 3D convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–31
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167
Diba A, Pazandeh AM, Van Gool L (2016) Efficient two-stream motion and appearance 3d CNNs for video classification. arXiv preprint arXiv:1608.08851
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Tran D, Ray J, Shou Z, Chang SF, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. arXiv preprint arXiv:1708.05038
Yue-Hei Ng J, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: deep networks for video classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4694–4702
Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2625–2634
Wang Y, Song J, Wang L, Van Gool L, Hilliges O (2016) Two-stream SR-CNNs for action recognition in videos. In: BMVC
Peng X, Schmid C (2016) Multi-region two-stream R-CNN for action detection. In: European conference on computer vision. Springer, Cham, pp 744–759
Saha S, Singh G, Sapienza M, Torr PH, Cuzzolin F (2016) Deep learning for detecting multiple space–time action tubes in videos. arXiv preprint arXiv:1608.01529
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and imagenet? In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6546–6555
Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal lSTM with trust gates for 3d human action recognition. In: European conference on computer vision. Springer, Cham, pp 816–833
Baccouche M, Mamalet F, Wolf C, Garcia C, Baskurt A (2011) Sequential deep learning for human action recognition. In: International workshop on human behavior understanding, pp 29–39. Springer, Berlin
Ma S, Sigal L, Sclaroff S (2016) Learning activity progression in lSTMS for activity detection and early detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1942–1950
Du Y, Wang W, Wang L (2015) Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1110–1118
Carreira J, Noland E, Hillier C, Zisserman A (2019) A short note on the kinetics-700 human action dataset. arXiv preprint arXiv:1907.06987
Szegedy C, Vanhoucke V, I47offe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2818–2826
Wang H, Kläser A, Schmid C, Liu CL (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Rao C, Shah M (2001) View-invariance in action recognition. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. In: CVPR 2001. IEEE, vol 2, pp II-II
Bashir FI, Khokhar AA, Schonfeld D (2006) View-invariant motion trajectory-based activity classification and recognition. Multimed Syst 12(1):45–54
Chen H, Chirikjian GS (2019) Curvature: a signature for Action Recognition in Video Sequences. arXiv preprint arXiv:1904.13003
Weinkauf T, Theisel H (2002) Curvature measures of 3D vector fields and their applications. J WSCG, pp 507–514
Jain M, Jegou H, Bouthemy P (2013) Better exploiting motion for better action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2555–2562
Wang X, Qi C (2016) Action recognition using edge trajectories and motion acceleration descriptor. Mach Vis Appl 27(6):861–75
Canny JF (1983) Finding edges and lines in images. Massachusetts inst of Tech Cambridge Artificial Intelligence Lab
Kroeger T, Timofte R, Dai D, Van Gool L (2016). Fast optical flow using dense inverse search. In: European conference on computer vision. Springer, Cham, pp 471–488
Farnebäck G (2003) Two-frame motion estimation based on polynomial expansion. In: Scandinavian conference on image analysis. Springer, Berlin, pp 363–370
Theisel H (1998) Visualizing the curvature of unsteady 2D flow fields. In: Proceedings of the 9th EG workshop on visualization in science computing, pp 47–56
Suter D (1994) Motion estimation and vector splines. In: CVPR, Vol 94, pp 939–942
Vetterling WT, Teukolsky SA, Press WH, Flannery BP (1989) Numerical recipes. University Press, Cambridge
Cruz C, Sucar LE, Morales EF (2008) Real-time face recognition for human–robot interaction. In: 2008 8th IEEE international conference on automatic face and gesture recognition. IEEE, pp 1–6
Reddy KK, Shah M (2013) Recognizing 50 human action categories of web videos. Mach Vis Appl 24(5):971–81
Yao Y, Rosasco L, Caponnetto A (2007) On early stopping in gradient descent learning. Construct Approx 26(2):289–315
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Oves García, R., Morales, E.F. & Sucar, L.E. Second-order motion descriptors for efficient action recognition. Pattern Anal Applic 24, 473–482 (2021). https://doi.org/10.1007/s10044-020-00924-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-020-00924-2