Skip to main content
Log in

JSE: Joint Semantic Encoder for zero-shot gesture learning

  • Original Article
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Zero-shot learning (ZSL) is a transfer learning paradigm that aims to recognize unseen categories just by having a high-level description of them. While deep learning has greatly pushed the limits of ZSL for object classification, ZSL for gesture recognition (ZSGL) remains largely unexplored. Previous attempts to address ZSGL were focused on the creation of gesture attributes and algorithmic improvements, and there is little or no research concerned with feature selection for ZSGL. It is indisputable that deep learning has obviated the need for feature engineering for problems with large datasets. However, when the data are scarce, it is critical to leverage the domain information to create discriminative input features. The main goal of this work is to study the effect of three different feature extraction techniques (velocity, heuristical and latent features) on the performance of ZSGL. In addition, we propose a bilinear auto-encoder approach, referred to as Joint Semantic Encoder (JSE), for ZSGL that jointly minimizes the reconstruction, semantic and classification losses. We conducted extensive experiments to compare and contrast the feature extraction techniques and to evaluate the performance of JSE with respect to existing ZSL methods. For attribute-based classification scenario, irrespective of the feature type, results showed that JSE outperforms other approaches by 5% (p<0.01). When JSE is trained with heuristical features in across-category condition, we showed that JSE significantly outperforms other methods by 5% (p<0.01)).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

References

  1. Microsoft HoloLens 2: https://www.microsoft.com/en-us/hololens/. https://www.microsoft.com/en-us/hololens/

  2. Bartels RH, Stewart GW (1972) Solution of the matrix equation ax + xb = c [f4]. Commun ACM 15(9):820–826. https://doi.org/10.1145/361573.361582

    Article  MATH  Google Scholar 

  3. Boonchuay K, Sinapiromsaran K, Lursinsap C (2017) Decision tree induction based on minority entropy for the class imbalance problem. Pattern Anal Appl 20(3):769–782

    Article  MathSciNet  Google Scholar 

  4. Changpinyo S, Chao W.L, Gong B, Sha F (2016) Synthesized classifiers for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5327–5336

  5. Chao W.L, Changpinyo S, Gong B, Sha F (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. arXiv:1605.04253 [cs]

  6. Cheng H, Yang L, Liu Z (2016) Survey on 3d hand gesture recognition. IEEE Trans Circuits Syst Video Technol 26(9):1659–1673

    Article  Google Scholar 

  7. Escalera S, Gonzàlez J, Baró X, Reyes M, Lopes O, Guyon I, Athitsos V, Escalante H (2013) Multi-modal gesture recognition challenge 2013: Dataset and results. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 445–452. ACM

  8. Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785 . https://doi.org/10.1109/CVPR.2009.5206772

  9. Fe-Fei L, (2003) et al.: A Bayesian approach to unsupervised one-shot learning of object categories. In: Proceedings Ninth IEEE International Conference on Computer Vision, pp. 1134–1141. IEEE

  10. Fei-Fei L (2006) Knowledge transfer in learning to recognize visual objects classes. In: Proceedings of the International Conference on Development and Learning (ICDL), p. 11

  11. Fei-Fei L, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611. https://doi.org/10.1109/TPAMI.2006.79

    Article  Google Scholar 

  12. Fei-Fei L, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611

    Article  Google Scholar 

  13. Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pp. 1737–1746. ACM, New York, NY, USA . https://doi.org/10.1145/2207676.2208303

  14. Fu Y, Hospedales TM, Xiang T, Gong S (2015) Transductive Multi-view Zero-Shot Learning. IEEE Trans Pattern Anal Mach Intell 37(11):2332–2345. https://doi.org/10.1109/TPAMI.2015.2408354arxiv.org/abs/1501.04560

    Article  Google Scholar 

  15. Fu Y, Xiang T, Jiang YG, Xue X, Sigal L, Gong S (2018) Recent advances in zero-shot recognition: toward data-efficient understanding of visual content. IEEE Signal Process Mag 35(1):112–125

    Article  Google Scholar 

  16. Gao J, Zhang T, Xu C (2019) I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. Proceedings of the AAAI Conference on Artificial Intelligence 33:8303–8311

    Article  Google Scholar 

  17. Ghosh P, Saini N, Davis L.S, Shrivastava A (2020) All about knowledge graphs for actions. arXiv preprint arXiv:2008.12432

  18. Hahn M, Silva A, Rehg J.M (2019) Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484

  19. Istance H, Hyrskykari A, Immonen L, Mansikkamaa S, Vickers S (2010) Designing gaze gestures for gaming: an investigation of performance. In: Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications, ETRA ’10, pp. 323–330. ACM, New York, NY, USA . https://doi.org/10.1145/1743666.1743740

  20. Junior VLE, Pedrini H, Menotti D (2019)Zero-shot action recognition in videos: a survey. arXiv preprint arXiv:1909.06423

  21. Kim J, Oh TH Lee S ,Pan F, Kweon IS (2019) Variational prototyping-encoder: one-shot learning with prototypical images. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

  22. Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114

  23. Kodirov E, Xiang T, Fu Z, Gong S (2015) Unsupervised domain adaptation for zero-shot learning. In: Proceedings of the IEEE international conference on computer vision, pp. 2452–2460

  24. Kodirov E, Xiang T, Gong S (2017) Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345

  25. Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958 . https://doi.org/10.1109/CVPR.2009.5206594

  26. Lampert CH, Nickisch H, Harmeling S (2014) Attribute-based classification for zero-shot visual object categorization. IEEE Trans Pattern Anal Mach Intell 36(3):453–465. https://doi.org/10.1109/TPAMI.2013.140

    Article  Google Scholar 

  27. Madapana N, Gonzalez G, Rodgers R, Zhang L, Wachs JP (2018) Gestures for picture archiving and communication systems (PACS) operation in the operating room: is there any standard? PLoS ONE 13(6):e0198092

    Article  Google Scholar 

  28. Madapana N, Wachs J (2017) Zsgl: Zero shot gestural learning. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pp. 331–335. ACM, New York, NY, USA . https://doi.org/10.1145/3136755.3136774

  29. Madapana N, Wachs J (2019) Database of gesture attributes: zero shot learning for gesture recognition. In: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8 . https://doi.org/10.1109/FG.2019.8756548

  30. Madapana N, Wachs J (2020) Feature selection for zero-shot gesture recognition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG), pp. 309–313. IEEE Computer Society, Los Alamitos, CA, USA . https://doi.org/10.1109/FG47880.2020.00046

  31. Massaroni C, Giurazza F, Tesei M, Schena E, Corvino F, Meneo M, Corletti L, Niola R, Setola R (2018) A touchless system for image visualization during surgery: preliminary experience in clinical settings. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5794–5797. IEEE

  32. Maszczyk T, Duch W (2008) Comparison of Shannon, Renyi and Tsallis entropy used in decision trees. In: International Conference on Artificial Intelligence and Soft Computing, pp. 643–651. Springer

  33. Mishra A, Krishna Reddy S, Mittal A, Murthy H.A (2018) A generative model for zero shot learning using conditional variational autoencoders. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2188–2196

  34. Mishra A, Pandey A, Murthy HA (2020) Zero-shot learning for action recognition using synthesized features. Neurocomputing 390:117–130 https://doi.org/10.1016/j.neucom.2020.01.078. http://www.sciencedirect.com/science/article/pii/S0925231220301302

  35. Mitra S, Acharya T (2007) Gesture recognition: a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37(3), 311–324

  36. Nishikawa A, Hosoi T, Koara K, Negoro D, Hikita A, Asano S, Kakutani H, Miyazaki F, Sekimoto M, Yasui M et al (2003) Face mouse: a novel human-machine interface for controlling the position of a laparoscope. IEEE Trans Robot Autom 19(5):825–841

    Article  Google Scholar 

  37. Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, Corrado GS, Dean J (2013)Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650

  38. O’Hara K, Gonzalez G, Sellen A, Penney G, Varnavas A, Mentis H, Criminisi A, Corish R, Rouncefield M, Dastur N, Carrell T (2014) Touchless interaction in surgery. Commun. ACM 57(1):70–77. https://doi.org/10.1145/2541883.2541899

  39. Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: Y. Bengio, D. Schuurmans, JD. Lafferty, CKI Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 1410–1418. Curran Associates, Inc. . http://papers.nips.cc/paper/3650-zero-shot-learning-with-semantic-output-codes.pdf

  40. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359. https://doi.org/10.1109/TKDE.2009.191

    Article  Google Scholar 

  41. Patterson G, Hays J (2012) SUN attribute database: discovering, annotating, and recognizing scene attributes. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2751–2758 . https://doi.org/10.1109/CVPR.2012.6247998

  42. Pustejovsky J, Krishnaswamy N (2020) Situated meaning in multimodal dialogue: human-robot and human-computer interactions . http://www.voxicon.net/wp-content/uploads/2020/07/TAL_2020-13.pdf

  43. Rahman S, Khan SH, Porikli F (2017) A unified approach for conventional zero-shot, generalized zero-shot and few-shot learning. arXiv:1706.08653 [cs] . http://arxiv.org/abs/1706.08653. ArXiv: 1706.08653

  44. Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54. https://doi.org/10.1007/s10462-012-9356-9

    Article  Google Scholar 

  45. Romera-Paredes B, Torr P (2015) An embarrassingly simple approach to zero-shot learning. In: F. Bach, D. Blei (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 2152–2161. PMLR, Lille, France . http://proceedings.mlr.press/v37/romera-paredes15.html

  46. Ruffieux S, Lalanne D, Mugellini E, Abou Khaled O (2014) A survey of datasets for human gesture recognition. In: M. Kurosu (ed.) Human-Computer Interaction. Advanced Interaction Modalities and Techniques, pp. 337–348. Springer International Publishing, Cham

  47. Shannon CE (1948) A mathematical theory of communication. The Bell Syst Tech J 27(3):379–423

    Article  MathSciNet  Google Scholar 

  48. Smaira L, Carreira J, Noland E, Clancy E, Wu A, Zisserman A (2020) A short note on the kinetics-700-2020 human action dataset

  49. Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal transfer. In: CJC. Burges, L Bottou, M Welling, Z Ghahramani, KQ Weinberger (eds.) Advances in Neural Information Processing Systems 26, pp. 935–943. Curran Associates, Inc. . http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.pdf

  50. Thomason W, Knepper R (2016) Recognizing unfamiliar gestures for human-robot interaction through zero-shot learning

  51. Vatavu RD (2012) User-defined Gestures for Free-hand TV Control. In: Proceedings of the 10th European Conference on Interactive TV and Video, EuroITV ’12, pp. 45–48. ACM, New York, NY, USA . https://doi.org/10.1145/2325616.2325626

  52. Vatavu RD, Wobbrock JO (2015) Formalizing agreement analysis for elicitation studies: new measures, significance test, and toolkit. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1325–1334

  53. Vatavu RD, Wobbrock JO (2016) Between-subjects elicitation studies: formalization and tool support. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 3390–3402. ACM

  54. Wachs J, Stern H, Edan Y, Gillam M, Feied C, Smith M, Handler J (2006) A real-time hand gesture interface for medical visualization applications. In: Applications of Soft Computing, pp. 153–162. Springer

  55. Wachs J, Stern H, Edan Y, Gillam M, Feied C, Smith M, Handler J (2007) Gestix: a doctor-computer sterile gesture interface for dynamic environments. In: Soft Computing in Industrial Applications, pp. 30–39. Springer

  56. Wan J, Li S.Z, Zhao Y, Zhou S, Guyon I, Escalera S (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 761–769 . https://doi.org/10.1109/CVPRW.2016.100

  57. Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Transac Intel Syst Technol (TIST) 10(2):13

    Google Scholar 

  58. Wang Y, Yu T, Shi L, Li Z (2008) Using human body gestures as inputs for gaming via depth analysis. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 993–996 . https://doi.org/10.1109/ICME.2008.4607604

  59. Wipfli R, Dubois-Ferriere V, Budry S, Hoffmeyer P, Lovis C (2016) Gesture-controlled image management for operating room: a randomized crossover study to compare interaction using gestures, mouse, and third person relaying. PloS one 11(4)

  60. Wu J, Li K, Zhao X, Tan M (2018) Unfamiliar dynamic hand gestures recognition based on zero-shot learning. In: Cheng L, Leung ACS, Ozawa S (eds) Neural information processing. Springer International Publishing, Cham, pp 244–254

    Google Scholar 

  61. Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence

  62. Yadav V, Bethard S (2019) A survey on recent advances in named entity recognition from deep learning models. CoRR abs/1910.11470. http://arxiv.org/abs/1910.11470

  63. Zhu Y, Li X, Liu C, Zolfaghari M, Xiong Y, Wu C, Zhang Z, Tighe J, Manmatha R, Li M (2020) A comprehensive study of deep video action recognition

Download references

Funding

This work is supported by the Agency for Healthcare Research and Quality (AHRQ), National Institute of Health (NIH) - under the Project No. 1R18HS024887-01. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by NIH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Wachs.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

As this article does not involve human participants, there is no such informed consent.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Madapana, N., Wachs, J. JSE: Joint Semantic Encoder for zero-shot gesture learning. Pattern Anal Applic 25, 679–692 (2022). https://doi.org/10.1007/s10044-021-00992-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-021-00992-y

Keywords

Navigation