Abstract
Zero-shot learning (ZSL) is a transfer learning paradigm that aims to recognize unseen categories just by having a high-level description of them. While deep learning has greatly pushed the limits of ZSL for object classification, ZSL for gesture recognition (ZSGL) remains largely unexplored. Previous attempts to address ZSGL were focused on the creation of gesture attributes and algorithmic improvements, and there is little or no research concerned with feature selection for ZSGL. It is indisputable that deep learning has obviated the need for feature engineering for problems with large datasets. However, when the data are scarce, it is critical to leverage the domain information to create discriminative input features. The main goal of this work is to study the effect of three different feature extraction techniques (velocity, heuristical and latent features) on the performance of ZSGL. In addition, we propose a bilinear auto-encoder approach, referred to as Joint Semantic Encoder (JSE), for ZSGL that jointly minimizes the reconstruction, semantic and classification losses. We conducted extensive experiments to compare and contrast the feature extraction techniques and to evaluate the performance of JSE with respect to existing ZSL methods. For attribute-based classification scenario, irrespective of the feature type, results showed that JSE outperforms other approaches by 5% (p<0.01). When JSE is trained with heuristical features in across-category condition, we showed that JSE significantly outperforms other methods by 5% (p<0.01)).
Similar content being viewed by others
References
Microsoft HoloLens 2: https://www.microsoft.com/en-us/hololens/. https://www.microsoft.com/en-us/hololens/
Bartels RH, Stewart GW (1972) Solution of the matrix equation ax + xb = c [f4]. Commun ACM 15(9):820–826. https://doi.org/10.1145/361573.361582
Boonchuay K, Sinapiromsaran K, Lursinsap C (2017) Decision tree induction based on minority entropy for the class imbalance problem. Pattern Anal Appl 20(3):769–782
Changpinyo S, Chao W.L, Gong B, Sha F (2016) Synthesized classifiers for zero-shot learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5327–5336
Chao W.L, Changpinyo S, Gong B, Sha F (2016) An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. arXiv:1605.04253 [cs]
Cheng H, Yang L, Liu Z (2016) Survey on 3d hand gesture recognition. IEEE Trans Circuits Syst Video Technol 26(9):1659–1673
Escalera S, Gonzàlez J, Baró X, Reyes M, Lopes O, Guyon I, Athitsos V, Escalante H (2013) Multi-modal gesture recognition challenge 2013: Dataset and results. In: Proceedings of the 15th ACM on International conference on multimodal interaction, pp. 445–452. ACM
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1778–1785 . https://doi.org/10.1109/CVPR.2009.5206772
Fe-Fei L, (2003) et al.: A Bayesian approach to unsupervised one-shot learning of object categories. In: Proceedings Ninth IEEE International Conference on Computer Vision, pp. 1134–1141. IEEE
Fei-Fei L (2006) Knowledge transfer in learning to recognize visual objects classes. In: Proceedings of the International Conference on Development and Learning (ICDL), p. 11
Fei-Fei L, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611. https://doi.org/10.1109/TPAMI.2006.79
Fei-Fei L, Fergus R, Perona P (2006) One-shot learning of object categories. IEEE Trans Pattern Anal Mach Intell 28(4):594–611
Fothergill S, Mentis H, Kohli P, Nowozin S (2012) Instructing people for training gestural interactive systems. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’12, pp. 1737–1746. ACM, New York, NY, USA . https://doi.org/10.1145/2207676.2208303
Fu Y, Hospedales TM, Xiang T, Gong S (2015) Transductive Multi-view Zero-Shot Learning. IEEE Trans Pattern Anal Mach Intell 37(11):2332–2345. https://doi.org/10.1109/TPAMI.2015.2408354arxiv.org/abs/1501.04560
Fu Y, Xiang T, Jiang YG, Xue X, Sigal L, Gong S (2018) Recent advances in zero-shot recognition: toward data-efficient understanding of visual content. IEEE Signal Process Mag 35(1):112–125
Gao J, Zhang T, Xu C (2019) I know the relationships: zero-shot action recognition via two-stream graph convolutional networks and knowledge graphs. Proceedings of the AAAI Conference on Artificial Intelligence 33:8303–8311
Ghosh P, Saini N, Davis L.S, Shrivastava A (2020) All about knowledge graphs for actions. arXiv preprint arXiv:2008.12432
Hahn M, Silva A, Rehg J.M (2019) Action2vec: A crossmodal embedding approach to action learning. arXiv preprint arXiv:1901.00484
Istance H, Hyrskykari A, Immonen L, Mansikkamaa S, Vickers S (2010) Designing gaze gestures for gaming: an investigation of performance. In: Proceedings of the 2010 Symposium on Eye-Tracking Research & Applications, ETRA ’10, pp. 323–330. ACM, New York, NY, USA . https://doi.org/10.1145/1743666.1743740
Junior VLE, Pedrini H, Menotti D (2019)Zero-shot action recognition in videos: a survey. arXiv preprint arXiv:1909.06423
Kim J, Oh TH Lee S ,Pan F, Kweon IS (2019) Variational prototyping-encoder: one-shot learning with prototypical images. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114
Kodirov E, Xiang T, Fu Z, Gong S (2015) Unsupervised domain adaptation for zero-shot learning. In: Proceedings of the IEEE international conference on computer vision, pp. 2452–2460
Kodirov E, Xiang T, Gong S (2017) Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345
Lampert CH, Nickisch H, Harmeling S (2009) Learning to detect unseen object classes by between-class attribute transfer. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 951–958 . https://doi.org/10.1109/CVPR.2009.5206594
Lampert CH, Nickisch H, Harmeling S (2014) Attribute-based classification for zero-shot visual object categorization. IEEE Trans Pattern Anal Mach Intell 36(3):453–465. https://doi.org/10.1109/TPAMI.2013.140
Madapana N, Gonzalez G, Rodgers R, Zhang L, Wachs JP (2018) Gestures for picture archiving and communication systems (PACS) operation in the operating room: is there any standard? PLoS ONE 13(6):e0198092
Madapana N, Wachs J (2017) Zsgl: Zero shot gestural learning. In: Proceedings of the 19th ACM International Conference on Multimodal Interaction, ICMI 2017, pp. 331–335. ACM, New York, NY, USA . https://doi.org/10.1145/3136755.3136774
Madapana N, Wachs J (2019) Database of gesture attributes: zero shot learning for gesture recognition. In: 2019 14th IEEE International Conference on Automatic Face Gesture Recognition (FG 2019), pp. 1–8 . https://doi.org/10.1109/FG.2019.8756548
Madapana N, Wachs J (2020) Feature selection for zero-shot gesture recognition. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020) (FG), pp. 309–313. IEEE Computer Society, Los Alamitos, CA, USA . https://doi.org/10.1109/FG47880.2020.00046
Massaroni C, Giurazza F, Tesei M, Schena E, Corvino F, Meneo M, Corletti L, Niola R, Setola R (2018) A touchless system for image visualization during surgery: preliminary experience in clinical settings. In: 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 5794–5797. IEEE
Maszczyk T, Duch W (2008) Comparison of Shannon, Renyi and Tsallis entropy used in decision trees. In: International Conference on Artificial Intelligence and Soft Computing, pp. 643–651. Springer
Mishra A, Krishna Reddy S, Mittal A, Murthy H.A (2018) A generative model for zero shot learning using conditional variational autoencoders. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2188–2196
Mishra A, Pandey A, Murthy HA (2020) Zero-shot learning for action recognition using synthesized features. Neurocomputing 390:117–130 https://doi.org/10.1016/j.neucom.2020.01.078. http://www.sciencedirect.com/science/article/pii/S0925231220301302
Mitra S, Acharya T (2007) Gesture recognition: a survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 37(3), 311–324
Nishikawa A, Hosoi T, Koara K, Negoro D, Hikita A, Asano S, Kakutani H, Miyazaki F, Sekimoto M, Yasui M et al (2003) Face mouse: a novel human-machine interface for controlling the position of a laparoscope. IEEE Trans Robot Autom 19(5):825–841
Norouzi M, Mikolov T, Bengio S, Singer Y, Shlens J, Frome A, Corrado GS, Dean J (2013)Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650
O’Hara K, Gonzalez G, Sellen A, Penney G, Varnavas A, Mentis H, Criminisi A, Corish R, Rouncefield M, Dastur N, Carrell T (2014) Touchless interaction in surgery. Commun. ACM 57(1):70–77. https://doi.org/10.1145/2541883.2541899
Palatucci M, Pomerleau D, Hinton GE, Mitchell TM (2009) Zero-shot learning with semantic output codes. In: Y. Bengio, D. Schuurmans, JD. Lafferty, CKI Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 1410–1418. Curran Associates, Inc. . http://papers.nips.cc/paper/3650-zero-shot-learning-with-semantic-output-codes.pdf
Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359. https://doi.org/10.1109/TKDE.2009.191
Patterson G, Hays J (2012) SUN attribute database: discovering, annotating, and recognizing scene attributes. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2751–2758 . https://doi.org/10.1109/CVPR.2012.6247998
Pustejovsky J, Krishnaswamy N (2020) Situated meaning in multimodal dialogue: human-robot and human-computer interactions . http://www.voxicon.net/wp-content/uploads/2020/07/TAL_2020-13.pdf
Rahman S, Khan SH, Porikli F (2017) A unified approach for conventional zero-shot, generalized zero-shot and few-shot learning. arXiv:1706.08653 [cs] . http://arxiv.org/abs/1706.08653. ArXiv: 1706.08653
Rautaray SS, Agrawal A (2015) Vision based hand gesture recognition for human computer interaction: a survey. Artif Intell Rev 43(1):1–54. https://doi.org/10.1007/s10462-012-9356-9
Romera-Paredes B, Torr P (2015) An embarrassingly simple approach to zero-shot learning. In: F. Bach, D. Blei (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 2152–2161. PMLR, Lille, France . http://proceedings.mlr.press/v37/romera-paredes15.html
Ruffieux S, Lalanne D, Mugellini E, Abou Khaled O (2014) A survey of datasets for human gesture recognition. In: M. Kurosu (ed.) Human-Computer Interaction. Advanced Interaction Modalities and Techniques, pp. 337–348. Springer International Publishing, Cham
Shannon CE (1948) A mathematical theory of communication. The Bell Syst Tech J 27(3):379–423
Smaira L, Carreira J, Noland E, Clancy E, Wu A, Zisserman A (2020) A short note on the kinetics-700-2020 human action dataset
Socher R, Ganjoo M, Manning CD, Ng A (2013) Zero-shot learning through cross-modal transfer. In: CJC. Burges, L Bottou, M Welling, Z Ghahramani, KQ Weinberger (eds.) Advances in Neural Information Processing Systems 26, pp. 935–943. Curran Associates, Inc. . http://papers.nips.cc/paper/5027-zero-shot-learning-through-cross-modal-transfer.pdf
Thomason W, Knepper R (2016) Recognizing unfamiliar gestures for human-robot interaction through zero-shot learning
Vatavu RD (2012) User-defined Gestures for Free-hand TV Control. In: Proceedings of the 10th European Conference on Interactive TV and Video, EuroITV ’12, pp. 45–48. ACM, New York, NY, USA . https://doi.org/10.1145/2325616.2325626
Vatavu RD, Wobbrock JO (2015) Formalizing agreement analysis for elicitation studies: new measures, significance test, and toolkit. In: Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1325–1334
Vatavu RD, Wobbrock JO (2016) Between-subjects elicitation studies: formalization and tool support. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 3390–3402. ACM
Wachs J, Stern H, Edan Y, Gillam M, Feied C, Smith M, Handler J (2006) A real-time hand gesture interface for medical visualization applications. In: Applications of Soft Computing, pp. 153–162. Springer
Wachs J, Stern H, Edan Y, Gillam M, Feied C, Smith M, Handler J (2007) Gestix: a doctor-computer sterile gesture interface for dynamic environments. In: Soft Computing in Industrial Applications, pp. 30–39. Springer
Wan J, Li S.Z, Zhao Y, Zhou S, Guyon I, Escalera S (2016) Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 761–769 . https://doi.org/10.1109/CVPRW.2016.100
Wang W, Zheng VW, Yu H, Miao C (2019) A survey of zero-shot learning: settings, methods, and applications. ACM Transac Intel Syst Technol (TIST) 10(2):13
Wang Y, Yu T, Shi L, Li Z (2008) Using human body gestures as inputs for gaming via depth analysis. In: 2008 IEEE International Conference on Multimedia and Expo, pp. 993–996 . https://doi.org/10.1109/ICME.2008.4607604
Wipfli R, Dubois-Ferriere V, Budry S, Hoffmeyer P, Lovis C (2016) Gesture-controlled image management for operating room: a randomized crossover study to compare interaction using gestures, mouse, and third person relaying. PloS one 11(4)
Wu J, Li K, Zhao X, Tan M (2018) Unfamiliar dynamic hand gestures recognition based on zero-shot learning. In: Cheng L, Leung ACS, Ozawa S (eds) Neural information processing. Springer International Publishing, Cham, pp 244–254
Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE transactions on pattern analysis and machine intelligence
Yadav V, Bethard S (2019) A survey on recent advances in named entity recognition from deep learning models. CoRR abs/1910.11470. http://arxiv.org/abs/1910.11470
Zhu Y, Li X, Liu C, Zolfaghari M, Xiong Y, Wu C, Zhang Z, Tighe J, Manmatha R, Li M (2020) A comprehensive study of deep video action recognition
Funding
This work is supported by the Agency for Healthcare Research and Quality (AHRQ), National Institute of Health (NIH) - under the Project No. 1R18HS024887-01. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by NIH.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Informed consent
As this article does not involve human participants, there is no such informed consent.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Madapana, N., Wachs, J. JSE: Joint Semantic Encoder for zero-shot gesture learning. Pattern Anal Applic 25, 679–692 (2022). https://doi.org/10.1007/s10044-021-00992-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-021-00992-y