Skip to main content
Log in

Human activity recognition using robust adaptive privileged probabilistic learning

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

In this work, a supervised probabilistic approach is proposed that integrates the learning using privileged information (LUPI) paradigm into a hidden conditional random field (HCRF) model, called HCRF+, for human action recognition. The proposed model employs a self-training technique for automatic estimation of the regularization parameters of the objective function. Moreover, the method provides robustness to outliers by modeling the conditional distribution of the privileged information by a Student’s t-density function, which is naturally integrated into the HCRF+ framework. The proposed method was evaluated using different forms of privileged information on four publicly available datasets. The experimental results demonstrate its effectiveness concerning the state of the art in the LUPI framework using both hand-crafted and deep learning-based features extracted from a convolutional neural network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Bilen H, Fernando B, Gavves E, Vedaldi A, Gould S (2016) Dynamic image networks for action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Las Vegas, NV

  2. Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin

    MATH  Google Scholar 

  3. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Honolulu, Hawaii

  4. Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) PoTion: pose motion representation for action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Salt Lake City, UT

  5. Cohen I, Cozman FG, Sebe N, Cirelo MC, Huang TS (2004) Semisupervised learning of classifiers: theory, algorithms, and their application to human–computer interaction. IEEE Trans Pattern Anal Mach Intell 26(12):1553–1566

    Article  Google Scholar 

  6. Crasto N, Weinzaepfel P, Alahari K, Schmid C (2019) Mars: motion-augmented RGB stream for action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA

  7. De Geest R, Tuytelaars T (2018) Modeling temporal structure with LSTM for online action detection. In: Proceedings of the IEEE winter conference on applications of computer vision, Lake Tahoe, NV/CA

  8. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Boston, MA

  9. Feichtenhofer C, Pinz A, Wildes RP, Zisserman A (2018) What have we learned from deep representations for action recognition? In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Salt Lake City, UT

  10. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Las Vegas, NV

  11. Fouad S, Tino P, Raychaudhury S, Schneider P (2013) Incorporating privileged information through metric learning. IEEE Trans Neural Netw Learn Syst 24(7):1086–1098

    Article  Google Scholar 

  12. Fu Y, Hospedales TM, Xiang T, Gong S (2012) Attribute learning for understanding unstructured social activity. In: Proceedings of the 12th European conference on computer vision, lecture notes in computer science, Florence, Italy, vol 7575

  13. Gao Z, Li S, Zhu Y, Wang C, Zhang H (2017) Collaborative sparse representation leaning model for RGBD action recognition. J Vis Commun Image Represent 48:442–452

    Article  Google Scholar 

  14. Gao Z, Xuan H, Zhang H, Wan S, Choo KR (2019) Adaptive fusion and category-level dictionary learning model for multiview human action recognition. IEEE Internet Things J 6(6):9280–9293

    Article  Google Scholar 

  15. Garcia NC, Morerio P, Murino V (2018) Modality distillation with multiple stream networks for action recognition. In: Proceedings of the European conference on computer vision, Munich, Germany

  16. Guadarrama S, Krishnamoorthy N, Malkarnenkar G, Venugopalan S, Mooney RJ, Darrell T, Saenko K (2013) Youtube2text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In: Proceedings of the IEEE international conference on computer vision, Sydney, Australia

  17. Hardoon DR, Szedmak SR, Shawe-Taylor JR (2004) Canonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  Google Scholar 

  18. Hastie T, Rosset S, Tibshirani R, Zhu J (2004) The entire regularization path for the support vector machine. J Mach Learn Res 5:1391–1415

    MathSciNet  MATH  Google Scholar 

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Las Vegas, NV, pp 770–778

  20. Hoai M, Zisserman A (2014) Talking heads: detecting humans and recognizing their interactions. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Columbus, OH

  21. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  22. Jin L, Li Z, Tang J (2020) Deep semantic multimodal hashing network for scalable image-text and video-text retrievals. IEEE Trans Neural Netw Learn Syst. https://doi.org/10.1109/TNNLS.2020.2997020

    Article  Google Scholar 

  23. Kakadiaris I, Sarafianos N, Nikou C (2016) Show me your body: gender classification from still images. In: Proceedings of the IEEE international conference on image processing, Phoenix, AZ

  24. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Columbus, OH

  25. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. CoRR abs/1412.6980

  26. Kläser A, Marszałek M, Schmid C (2008) A spatio-temporal descriptor based on 3D-gradients. In: Proceedings of the British machine vision conference. University of Leeds, Leeds, UK

  27. Komodakis N, Tziritas G (2007) Image completion using efficient belief propagation via priority scheduling and dynamic pruning. IEEE Trans Image Process 16(11):2649–2661

    Article  MathSciNet  Google Scholar 

  28. Kotz S, Nadarajah S (2004) Multivariate t distributions and their applications. Cambridge University Press, Cambridge

    Book  Google Scholar 

  29. Laptev I (2005) On space-time interest points. Int J Comput Vis 64(2–3):107–123

    Article  Google Scholar 

  30. Li C, Zhong Q, Xie D, Pu S (2019) Collaborative spatiotemporal feature learning for video action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA

  31. Li Y, Li Y, Vasconcelos N (2018) RESOUND: towards action recognition without representation bias. In: Proceedings of the European conference on computer vision, Munich, Germany

  32. Liu A, Su Y, Jia P, Gao Z, Hao T, Yang Z (2015) Multiple/single-view human action recognition via part-induced multitask structural learning. IEEE Trans Cybern 45(6):1194–1208

    Article  Google Scholar 

  33. Liu J, Kuipers B, Savarese S (2011) Recognizing human actions by attributes. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Colorado Springs, CO

  34. Lopez-Paz D, Bottou L, Schölkopf B, Vapnik V (2016) Unifying distillation and privileged information. In: Proceedings of the 5th international conference on learning representations, San Juan, Puerto Rico

  35. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  36. Luo Z, Hsieh JT, Jiang L, Carlos Niebles J, Fei-Fei L (2018) Graph distillation for action detection with privileged modalities. In: Proceedings of the European conference on computer vision, Munich, Germany

  37. Luvizon DC, Picard D, Tabia H (2018) 2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Salt Lake City, UT

  38. Marín-Jiménez MJ, noz Salinas RM, Yeguas-Bolivar E, de la Blanca NP (2014) Human interaction categorization by using audio-visual cues. Mach Vis Appl 25(1):71–84

    Article  Google Scholar 

  39. Nocedal J, Wright SJ (2006) Numerical optimization. Springer series in operations research and financial engineering, 2nd edn. Springer, New York

    Google Scholar 

  40. Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: Proceedings of the advances in neural information processing systems, Vancouver, British Columbia, Canada

  41. Patron-Perez A, Marszalek M, Reid I, Zisserman A (2012) Structured learning of human interactions in TV shows. IEEE Trans Pattern Anal Mach Intell 34(12):2441–2453

    Article  Google Scholar 

  42. Pechyony D, Vapnik V (2010) On the theory of learning with privileged information. In: Proceedings of the annual conference on neural information processing systems, Vancouver, British Columbia, Canada

  43. Peel D, Mclachlan GJ (2000) Robust mixture modelling using the t distribution. Stat Comput 10:339–348

    Article  Google Scholar 

  44. Perrett T, Damen D (2019) DDLSTM: dual-domain LSTM for cross-dataset action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA

  45. Quattoni A, Wang S, Morency LP, Collins M, Darrell T (2007) Hidden conditional random fields. IEEE Trans Pattern Anal Mach Intell 29(10):1848–1852

    Article  Google Scholar 

  46. Rabiner L, Juang BH (1993) Fundamentals of speech recognition. Prentice-Hall, Upper Saddle River

    Google Scholar 

  47. Ramanathan V, Liang P, Fei-Fei L (2013) Video event understanding using natural language descriptions. In: Proceedings of the IEEE international conference on computer vision, Sydney, Australia

  48. Ramanathan V, Yao B, Fei-Fei L (2013) Social role discovery in human events. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Portland, OR

  49. Serra-Toro C, Traver VJ, Pla F (2014) Exploring some practical issues of svm+: is really privileged information that helps? Pattern Recognit Lett 42:40–46

    Article  Google Scholar 

  50. Shao J, Kang K, Loy CC, Wang, X (2015) Deeply learned attributes for crowded scene understanding. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Boston, MA

  51. Sharmanska V, Quadrianto N, Lampert CH (2013) Learning to rank using privileged information. In: Proceedings of the IEEE international conference on computer vision, Sydney, Australia

  52. Smailis C, Vrigkas M, Kakadiaris I.A (2019) Recaspia: Recognizing carrying actions in single images using privileged information. In: Proceedings of the 26th IEEE international conference on image processing, Taipei, Taiwan, pp 26–30

  53. Smeulders AWM, Chu DM, Cucchiara R, Calderara S, Dehghan A, Shah M (2014) Visual tracking: an experimental survey. IEEE Trans Pattern Anal Mach Intell 36(7):1–1

    Article  Google Scholar 

  54. Teo CH, Smola AJ, Vishwanathan SVN, Le QV (2007) A scalable modular convex solver for regularized risk minimization. In: Proceedings of the ACM international conference on knowledge discovery and data mining, San Jose, CA

  55. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE international conference on computer vision, Santiago, Chile, pp 4489-4497

  56. Tsai YHH, Divvala S, Morency LP, Salakhutdinov R, Farhadi A (2019) Video relationship reasoning using gated spatio-temporal energy graph. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA

  57. Vapnik V, Vashist A (2009) A new learning paradigm: learning using privileged information. Neural Netw 22(5–6):544–557

    Article  Google Scholar 

  58. Vrigkas M, Kazakos E, Nikou C, Kakadiaris IA (2017) Inferring human activities using robust privileged probabilistic learning. In: Proceedings of the IEEE international conference on computer vision workshops, Venice, Italy

  59. Vrigkas M, Mastora E, Nikou C, Kakadiaris IA (2018) Robust incremental hidden conditional random fields for human action recognition. In: Proceedings of the 13th international symposium on visual computing, Las Vegas, NV, pp 126–136

  60. Vrigkas M, Nikou C, Kakadiaris IA (2014) Classifying behavioral attributes using conditional random fields. In: Proceedings of the 8th hellenic conference on artificial intelligence, lecture notes in computer science, Ioannina, Greece, vol 8445

  61. Vrigkas M, Nikou C, Kakadiaris IA (2015) A review of human activity recognition methods. Front Robot Artif Intell 2(28):1–26. https://doi.org/10.3389/frobt.2015.00028

    Article  Google Scholar 

  62. Vrigkas M, Nikou C, Kakadiaris IA (2016) Active privileged learning of human activities from weakly labeled samples. In: Proceedings of the 23rd IEEE international conference on image processing, Phoenix, AZ

  63. Vrigkas M, Nikou C, Kakadiaris IA (2016) Exploiting privileged information for facial expression recognition. In: Proceedings of the IEEE international conference on biometrics, Halmstad, Sweden

  64. Vrigkas M, Nikou C, Kakadiaris IA (2017) Identifying human behaviors using synchronized audio-visual cues. IEEE Trans Affect Comput 8(1):54–66

    Article  Google Scholar 

  65. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, Sydney, Australia

  66. Wang L, Qiao Y, Tang X (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Boston, MA

  67. Wang S, He M, Zhu Y, He S, Liu Y, Ji Q (2015) Learning with privileged information using Bayesian networks. Front Comput Sci 9(2):185–199

    Article  MathSciNet  Google Scholar 

  68. Wang Y, Mori G (2011) Hidden part models for human action recognition: probabilistic versus max margin. IEEE Trans Pattern Anal Mach Intell 33(7):1310–1323

    Article  Google Scholar 

  69. Wang Z, Gao T, Ji Q (2014) Learning with hidden information using a max-margin latent variable model. In: Proceedings of the international conference on pattern recognition, Stockholm, Sweden

  70. Wang Z, Ji Q (2015) Classifier learning with hidden information. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Boston, MA

  71. Yan A, Wang Y, Li Z, Qiao Y (2019) PA3D: pose-action 3D machine for video recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Long Beach, CA

  72. Yuan S, Stenger B, Kim TK (2019) 3D hand pose estimation from RGB using privileged learning with depth data. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, Seoul, Korea

  73. Yun K, Honorio J, Chattopadhyay D, Berg TL, Samaras D (2012) Two-person interaction detection using body-pose features and multiple instance learning. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition workshops, Rhode Island

  74. Zhu W, Hu J, Sun G, Cao X, Qiao Y (2016) A key volume mining deep framework for action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Las Vegas, NV

  75. Zhu Y, Long Y, Guan Y, Newsam S, Shao L (2018) Towards universal representation for unseen action recognition. In: Proceedings of the IEEE Computer Society conference on computer vision and pattern recognition, Salt Lake City, UT

Download references

Acknowledgements

The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. All statements of fact, opinion or conclusions contained herein are those of the authors and should not be construed as representing the official views or policies of the sponsors.

Funding

This work has been co-funded by the European Union and Greek national funds through the Operational Program Competitiveness, Entrepreneurship and Innovation, under the call RESEARCH-CREATE-INNOVATE (Project Code: T1EDK-04517) and by the UH Hugh Roy and Lillie Cranz Cullen Endowment Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michalis Vrigkas.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Conditional distribution of the privileged information

Appendix: Conditional distribution of the privileged information

Recall that \(\mathbf {x} \in \mathbb {R}^{M_{\mathbf {x}} \times T}\) is an observation sequence of length T and \(\mathbf {x}^{*} \in \mathbb {R}^{M_{\mathbf {x}^{*}} \times T}\) corresponds to the privileged information of the same length. We partition the original set \(\left( \mathbf {x}^{*}, \mathbf {x}\right) ^{T} \in \mathbb {R}^{M \times T}\) into two disjoint subsets, where \(\mathbf {x}^{*}\) forms the first \(M_{{\mathbf {x}}^{*}}\) components of \(\left( \mathbf {x}^{*}, \mathbf {x}\right) ^{T} \in \mathbb {R}^{M \times T}\) and \(\mathbf {x}\) comprises the remaining \(M - M_{\mathbf {x}}\) components. If the joint distribution \(p(\mathbf {x},\mathbf {x}^{*};\mathbf {w})\) follows a Student’s t-law, with mean vector \(\mu =\left( \mu _{\mathbf {x}^{*}}, \mu _{\mathbf {x}}\right) ^{T}\), a real, positive definite, and symmetric \(M \times M\) covariance matrix \(\Sigma = \begin{pmatrix} \Sigma _{\mathbf {x}^{*}\mathbf {x}^{*}} &{} \Sigma _{\mathbf {x}^{*}\mathbf {x}} \\ \!\!\Sigma _{\mathbf {x}\mathbf {x}^{*}} &{} \!\!\Sigma _{\mathbf {x}\mathbf {x}} \end{pmatrix}\) and \(\nu \in [0, \infty )\) corresponds to the degrees of freedom of the distribution [28], then the conditional distribution \(p(\mathbf {x}|\mathbf {x}^{*};\mathbf {w})\) is also a Student’s t-distribution:

$$\begin{aligned} \begin{aligned} p(\mathbf {x}^{*}|\mathbf {x};\mathbf {w})&= \text {St}(\mathbf {x}^{*};\mu ^{*},\Sigma ^{*},\nu ^{*}) \\&= \frac{ \Gamma \left( \left( \nu ^{*}+M\right) /2\right) |\Sigma _{\mathbf {x}\mathbf {x}}|^{1/2} }{ \left( \pi \nu ^{*}\right) ^{M_{\mathbf {x}}/2} \Gamma \left( \left( \nu ^{*}+M_{\mathbf {x}}\right) /2\right) |\Sigma ^{*}|^{1/2} } \\&\quad \quad \times \frac{ \left[ 1+\frac{1}{\nu ^{*}} \mathbf {x}^{T}\Sigma _{\mathbf {x}\mathbf {x}}^{-1}\mathbf {x} \right] ^\frac{\left( \nu ^{*}+M_{\mathbf {x}}\right) }{2} }{\left[ 1+\frac{1}{\nu ^{*}} Z^{T}{\Sigma ^{*}}^{-1}Z \right] ^\frac{\left( \nu ^{*}+M\right) }{2}} \, . \end{aligned} \end{aligned}$$
(24)

The mean \(\mu ^{*}\), the covariance matrix \(\Sigma ^{*}\), and the degrees of freedom \(\nu ^{*}\) of the conditional distribution \(p(\mathbf {x}^{*}|\mathbf {x};\mathbf {w})\) are computed by the respective parts of \(\mu \) and \(\Sigma \):

$$\begin{aligned} \mu ^{*}&= \mu _{\mathbf {x}^{*}} - \Sigma _{\mathbf {x}^{*}\mathbf {x}} \Sigma _{\mathbf {x}\mathbf {x}}^{-1} \left( \mathbf {x} - \mu _{\mathbf {x}}\right) \, , \end{aligned}$$
(25)
$$\begin{aligned} \Sigma ^{*}&= \frac{\nu _{\mathbf {x}^{*}} + \left( \mathbf {x} - \mu _{\mathbf {x}}\right) ^{T} \Sigma _{\mathbf {x}\mathbf {x}}^{-1}\left( \mathbf {x} - \mu _{\mathbf {x}}\right) }{\nu _{\mathbf {x}^{*}} + M_{{\mathbf {x}}^{*}}} \nonumber \\&\quad \quad \times \left( \Sigma _{\mathbf {x}^{*}\mathbf {x}^{*}} - \Sigma _{\mathbf {x}^{*}\mathbf {x}}\Sigma _{\mathbf {x}\mathbf {x}}^{-1} \Sigma _{\mathbf {x}\mathbf {x}^{*}}\right) \, , \end{aligned}$$
(26)
$$\begin{aligned} \nu ^{*}&= \nu _{\mathbf {x}^{*}} + M_{{\mathbf {x}}^{*}} \, . \end{aligned}$$
(27)

The parameters \((\mu ,\Sigma ,\nu )\) of the joint Student’s t-distribution \(p(\mathbf {x}^{*},\mathbf {x};\mathbf {w})\), which are defined by the corresponding partition of the vector \(\left( \mathbf {x}^{*}, \mathbf {x}\right) ^{T}\), are estimated using the expectation-maximization (EM) algorithm [28]. Then, the parameters of the conditional distribution \(p(\mathbf {x}^{*}|\mathbf {x};\mathbf {w})\) are computed using Eqs. (25)–(27).

It is worth noting that by letting the degrees of freedom \(\nu ^{*}\) to go to infinity, we can recover the Gaussian distribution with the same parameters. If the data contain outliers, the degrees of freedom parameter \(\nu ^{*}\) are weak and the mean and covariance of the data are appropriately weighted in order not to take into account the outliers.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vrigkas, M., Kazakos, E., Nikou, C. et al. Human activity recognition using robust adaptive privileged probabilistic learning. Pattern Anal Applic 24, 915–932 (2021). https://doi.org/10.1007/s10044-020-00953-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-020-00953-x

Keywords

Navigation