Abstract
In this work, a supervised probabilistic approach is proposed that integrates the learning using privileged information (LUPI) paradigm into a hidden conditional random field (HCRF) model, called HCRF+, for human action recognition. The proposed model employs a self-training technique for automatic estimation of the regularization parameters of the objective function. Moreover, the method provides robustness to outliers by modeling the conditional distribution of the privileged information by a Student’s t-density function, which is naturally integrated into the HCRF+ framework. The proposed method was evaluated using different forms of privileged information on four publicly available datasets. The experimental results demonstrate its effectiveness concerning the state of the art in the LUPI framework using both hand-crafted and deep learning-based features extracted from a convolutional neural network.
Similar content being viewed by others
References
Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Las Vegas, NV
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Honolulu, Hawaii
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) PoTion: pose motion representation for action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Salt Lake City, UT
Cohen I, Cozman FG, Sebe N, Cirelo MC, Huang TS (2004) Semisupervised learning of classifiers: theory, algorithms, and their application to human–computer interaction. IEEE Trans Pattern Anal Mach Intell 26(12):1553–1566
Crasto N, Weinzaepfel P, Alahari K, Schmid C (2019) Mars: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA
De Geest R, Tuytelaars T (2018) Modeling temporal structure with LSTM for online action detection. In: Proceedings of the IEEE winter conference on applications of computer vision, Lake Tahoe, NV/CA
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Boston, MA
Feichtenhofer C, Pinz A, Wildes RP, Zisserman A (2018) What have we learned from deep representations for action recognition? In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Salt Lake City, UT
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Las Vegas, NV
Fouad S, Tino P, Raychaudhury S, Schneider P (2013) Incorporating privileged information through metric learning. IEEE Trans Neural Netw Learn Syst 24(7):1086–1098
Fu Y, Hospedales TM, Xiang T, Gong S (2012) Attribute learning for understanding unstructured social activity. In: Proceedings of the 12th European conference on computer vision, lecture notes in computer science, Florence, Italy, vol 7575
Gao Z, Li S, Zhu Y, Wang C, Zhang H (2017) Collaborative sparse representation leaning model for RGBD action recognition. J Vis Commun Image Represent 48:442–452
Gao Z, Xuan H, Zhang H, Wan S, Choo KR (2019) Adaptive fusion and category-level dictionary learning model for multiview human action recognition. IEEE Internet Things J 6(6):9280–9293
Garcia NC, Morerio P, Murino V (2018) Modality distillation with multiple stream networks for action recognition. In: Proceedings of the European conference on computer vision, Munich, Germany
Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney RJ, Darrell T, Saenko K (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE international conference on computer vision, Sydney, Australia
Hardoon DR, Szedmak SR, Shawe-Taylor JR (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664
Hastie T, Rosset S, Tibshirani R, Zhu J (2004) The entire regularization path for the support vector machine. J Mach Learn Res 5:1391–1415
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Las Vegas, NV, pp 770–778
Hoai M, Zisserman A (2014) Talking heads: detecting humans and recognizing their interactions. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Columbus, OH
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Jin L, Li Z, Tang J (2020) Deep semantic multimodal hashing network for scalable image-text and video-text retrievals. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2020.2997020
Kakadiaris I, Sarafianos N, Nikou C (2016) Show me your body: gender classification from still images. In: Proceedings of the IEEE international conference on image processing, Phoenix, AZ
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Columbus, OH
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980
Kläser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of the British machine vision conference. University of Leeds, Leeds, UK
Komodakis N, Tziritas G (2007) Image completion using efficient belief propagation via priority scheduling and dynamic pruning. IEEE Trans Image Process 16(11):2649–2661
Kotz S, Nadarajah S (2004) Multivariate t distributions and their applications. Cambridge University Press, Cambridge
Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123
Li C, Zhong Q, Xie D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA
Li Y, Li Y, Vasconcelos N (2018) RESOUND: towards action recognition without representation bias. In: Proceedings of the European conference on computer vision, Munich, Germany
Liu A, Su Y, Jia P, Gao Z, Hao T, Yang Z (2015) Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194–1208
Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Colorado Springs, CO
Lopez-Paz D, Bottou L, Schölkopf B, Vapnik V (2016) Unifying distillation and privileged information. In: Proceedings of the 5th international conference on learning representations, San Juan, Puerto Rico
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Luo Z, Hsieh JT, Jiang L, Carlos Niebles J, Fei-Fei L (2018) Graph distillation for action detection with privileged modalities. In: Proceedings of the European conference on computer vision, Munich, Germany
Luvizon DC, Picard D, Tabia H (2018) 2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Salt Lake City, UT
Marín-Jiménez MJ, noz Salinas RM, Yeguas-Bolivar E, de la Blanca NP (2014) Human interaction categorization by using audio-visual cues. Mach Vis Appl 25(1):71–84
Nocedal J, Wright SJ (2006) Numerical optimization. Springer series in operations research and financial engineering, 2nd edn. Springer, New York
Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: Proceedings of the advances in neural information processing systems, Vancouver, British Columbia, Canada
Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intell 34(12):2441–2453
Pechyony D, Vapnik V (2010) On the theory of learning with privileged information. In: Proceedings of the annual conference on neural information processing systems, Vancouver, British Columbia, Canada
Peel D, Mclachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10:339–348
Perrett T, Damen D (2019) DDLSTM: dual-domain LSTM for cross-dataset action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA
Quattoni A, Wang S, Morency LP, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10):1848–1852
Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice-Hall, Upper Saddle River
Ramanathan V, Liang P, Fei-Fei L (2013) Video event understanding using natural language descriptions. In: Proceedings of the IEEE international conference on computer vision, Sydney, Australia
Ramanathan V, Yao B, Fei-Fei L (2013) Social role discovery in human events. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Portland, OR
Serra-Toro C, Traver VJ, Pla F (2014) Exploring some practical issues of svm+: is really privileged information that helps? Pattern Recognit Lett 42:40–46
Shao J, Kang K, Loy CC, Wang, X (2015) Deeply learned attributes for crowded scene understanding. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Boston, MA
Sharmanska V, Quadrianto N, Lampert CH (2013) Learning to rank using privileged information. In: Proceedings of the IEEE international conference on computer vision, Sydney, Australia
Smailis C, Vrigkas M, Kakadiaris I.A (2019) Recaspia: Recognizing carrying actions in single images using privileged information. In: Proceedings of the 26th IEEE international conference on image processing, Taipei, Taiwan, pp 26–30
Smeulders AWM, Chu DM, Cucchiara R, Calderara S, Dehghan A, Shah M (2014) Visual tracking: an experimental survey. IEEE Trans Pattern Anal Mach Intell 36(7):1–1
Teo CH, Smola AJ, Vishwanathan SVN, Le QV (2007) A scalable modular convex solver for regularized risk minimization. In: Proceedings of the ACM international conference on knowledge discovery and data mining, San Jose, CA
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, Santiago, Chile, pp 4489-4497
Tsai YHH, Divvala S, Morency LP, Salakhutdinov R, Farhadi A (2019) Video relationship reasoning using gated spatio-temporal energy graph. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA
Vapnik V, Vashist A (2009) A new learning paradigm: learning using privileged information. Neural Netw 22(5–6):544–557
Vrigkas M, Kazakos E, Nikou C, Kakadiaris IA (2017) Inferring human activities using robust privileged probabilistic learning. In: Proceedings of the IEEE international conference on computer vision workshops, Venice, Italy
Vrigkas M, Mastora E, Nikou C, Kakadiaris IA (2018) Robust incremental hidden conditional random fields for human action recognition. In: Proceedings of the 13th international symposium on visual computing, Las Vegas, NV, pp 126–136
Vrigkas M, Nikou C, Kakadiaris IA (2014) Classifying behavioral attributes using conditional random fields. In: Proceedings of the 8th hellenic conference on artificial intelligence, lecture notes in computer science, Ioannina, Greece, vol 8445
Vrigkas M, Nikou C, Kakadiaris IA (2015) A review of human activity recognition methods. Front Robot Artif Intell 2(28):1–26. https://doi.org/10.3389/frobt.2015.00028
Vrigkas M, Nikou C, Kakadiaris IA (2016) Active privileged learning of human activities from weakly labeled samples. In: Proceedings of the 23rd IEEE international conference on image processing, Phoenix, AZ
Vrigkas M, Nikou C, Kakadiaris IA (2016) Exploiting privileged information for facial expression recognition. In: Proceedings of the IEEE international conference on biometrics, Halmstad, Sweden
Vrigkas M, Nikou C, Kakadiaris IA (2017) Identifying human behaviors using synchronized audio-visual cues. IEEE Trans Affect Comput 8(1):54–66
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, Sydney, Australia
Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Boston, MA
Wang S, He M, Zhu Y, He S, Liu Y, Ji Q (2015) Learning with privileged information using Bayesian networks. Front Comput Sci 9(2):185–199
Wang Y, Mori G (2011) Hidden part models for human action recognition: probabilistic versus max margin. IEEE Trans Pattern Anal Mach Intell 33(7):1310–1323
Wang Z, Gao T, Ji Q (2014) Learning with hidden information using a max-margin latent variable model. In: Proceedings of the international conference on pattern recognition, Stockholm, Sweden
Wang Z, Ji Q (2015) Classifier learning with hidden information. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Boston, MA
Yan A, Wang Y, Li Z, Qiao Y (2019) PA3D: pose-action 3D machine for video recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA
Yuan S, Stenger B, Kim TK (2019) 3D hand pose estimation from RGB using privileged learning with depth data. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, Seoul, Korea
Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition workshops, Rhode Island
Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Las Vegas, NV
Zhu Y, Long Y, Guan Y, Newsam S, Shao L (2018) Towards universal representation for unseen action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Salt Lake City, UT
Acknowledgements
The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of the sponsors.
Funding
This work has been co-funded by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH-CREATE-INNOVATE (Project Code: T1EDK-04517) and by the UH Hugh Roy and Lillie Cranz Cullen Endowment Fund.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix: Conditional distribution of the privileged information
Appendix: Conditional distribution of the privileged information
Recall that \(\mathbf {x} \in \mathbb {R}^{M_{\mathbf {x}} \times T}\) is an observation sequence of length T and \(\mathbf {x}^{*} \in \mathbb {R}^{M_{\mathbf {x}^{*}} \times T}\) corresponds to the privileged information of the same length. We partition the original set \(\left( \mathbf {x}^{*}, \mathbf {x}\right) ^{T} \in \mathbb {R}^{M \times T}\) into two disjoint subsets, where \(\mathbf {x}^{*}\) forms the first \(M_{{\mathbf {x}}^{*}}\) components of \(\left( \mathbf {x}^{*}, \mathbf {x}\right) ^{T} \in \mathbb {R}^{M \times T}\) and \(\mathbf {x}\) comprises the remaining \(M - M_{\mathbf {x}}\) components. If the joint distribution \(p(\mathbf {x},\mathbf {x}^{*};\mathbf {w})\) follows a Student’s t-law, with mean vector \(\mu =\left( \mu _{\mathbf {x}^{*}}, \mu _{\mathbf {x}}\right) ^{T}\), a real, positive definite, and symmetric \(M \times M\) covariance matrix \(\Sigma = \begin{pmatrix} \Sigma _{\mathbf {x}^{*}\mathbf {x}^{*}} &{} \Sigma _{\mathbf {x}^{*}\mathbf {x}} \\ \!\!\Sigma _{\mathbf {x}\mathbf {x}^{*}} &{} \!\!\Sigma _{\mathbf {x}\mathbf {x}} \end{pmatrix}\) and \(\nu \in [0, \infty )\) corresponds to the degrees of freedom of the distribution [28], then the conditional distribution \(p(\mathbf {x}|\mathbf {x}^{*};\mathbf {w})\) is also a Student’s t-distribution:
The mean \(\mu ^{*}\), the covariance matrix \(\Sigma ^{*}\), and the degrees of freedom \(\nu ^{*}\) of the conditional distribution \(p(\mathbf {x}^{*}|\mathbf {x};\mathbf {w})\) are computed by the respective parts of \(\mu \) and \(\Sigma \):
The parameters \((\mu ,\Sigma ,\nu )\) of the joint Student’s t-distribution \(p(\mathbf {x}^{*},\mathbf {x};\mathbf {w})\), which are defined by the corresponding partition of the vector \(\left( \mathbf {x}^{*}, \mathbf {x}\right) ^{T}\), are estimated using the expectation-maximization (EM) algorithm [28]. Then, the parameters of the conditional distribution \(p(\mathbf {x}^{*}|\mathbf {x};\mathbf {w})\) are computed using Eqs. (25)–(27).
It is worth noting that by letting the degrees of freedom \(\nu ^{*}\) to go to infinity, we can recover the Gaussian distribution with the same parameters. If the data contain outliers, the degrees of freedom parameter \(\nu ^{*}\) are weak and the mean and covariance of the data are appropriately weighted in order not to take into account the outliers.
Rights and permissions
About this article
Cite this article
Vrigkas, M., Kazakos, E., Nikou, C. et al. Human activity recognition using robust adaptive privileged probabilistic learning. Pattern Anal Applic 24, 915–932 (2021). https://doi.org/10.1007/s10044-020-00953-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-020-00953-x