Abstract
Mental disorder is a serious public health concern that affects the life of millions of people throughout the world. Early diagnosis is essential to ensure timely treatment and to improve the well-being of those affected by a mental disorder. In this paper, we present a novel multimodal framework to perform mental disorder recognition from videos. The proposed approach employs a combination of audio, video and textual modalities. Using recurrent neural network architectures, we incorporate the temporal information in the learning process and model the dynamic evolution of the features extracted for each patient. For multimodal fusion, we propose an efficient late fusion strategy based on a simple feed-forward neural network that we call adaptive nonlinear judge classifier. We evaluate the proposed framework on two mental disorder datasets. On both, the experimental results demonstrate that the proposed framework outperforms the state-of-the-art approaches. We also study the importance of each modality for mental disorder recognition and infer interesting conclusions about the temporal nature of each modality. Our findings demonstrate that careful consideration of the temporal evolution of each modality is of crucial importance to accurately perform mental disorder recognition.
Similar content being viewed by others
Data availability
The Bipolar Disorder Corpus was part of the AVEC2018 challenge and it can be accessed by contacting the authors. The Well-Being dataset is a private dataset collected at University of Cambridge by Dr Marwa Mahmoud.
Code availability
Code can be found at https://github.com/cecca46/MentalDisorderRecogntion
Notes
the test set is not accessible as it is reserved for evaluation in the AVEC 2018 Workshop and Challenge [14].
Analogously to the concept of thin-slice, we used the word “action” to refer to a piece of relevant information—or change—about a modality (for instance, an eyebrow raise or an head shake) which can be captured in a small fragment of a video. We refer to “temporal interval” as the (minimum) amount of time the video fragment has to last for in order to capture that information.
References
Ritchie Hannah, Roser Max (2020) Mental health. Our World in Data. https://ourworldindata.org/mental-health
Lisa Dixon, Leticia Postrado, Janine Delahanty, Fischer Pamela J, Anthony Lehman (1999) The association of medical comorbidity in schizophrenia with poor physical and mental health. J Nerv Ment Dis 187(8):496–502
Francine Cournos, McKinnon Karen M, Greer Sullivan (2005) Schizophrenia and comorbid human immunodeficiency virus or hepatitis c virus. J Clin Psychiatry 66:2005
Ferrari Alize J, Charlson Fiona J, Norman Rosana E, Patten Scott B, Freedman Greg, Murray Christopher JL, Vos Theo, Whiteford Harvey A (2013) Burden of depressive disorders by country, sex, age, and year: findings from the global burden of disease study 2010. PLoS medicine, 10(11)
Lucio Ghio, Simona Gotelli, Maurizio Marcenaro, Mario Amore, Werner Natta (2014) Duration of untreated illness and outcomes in unipolar depression: a systematic review and meta-analysis. J Affect Disord 152:45–51
Carlo Altamura A, Bernardo Dell’Osso, Berlin Heather A, Massimiliano Buoli, Roberta Bassetti, Emanuela Mundo (2010) Duration of untreated illness and suicide in bipolar disorder: a naturalistic study. Eur Arch Psychiatry Clin Neurosci 260(5):385–391
Cheung Ricky, O’Donnell Siobhan Madi Nawaf et al (2017) Factors associated with delayed diagnosis of mood and/or anxiety disorders. Health promotion and chronic disease prevention in Canada: research, policy and practice 37(5):137
Kazdin Alan E, Blase Stacey L (2011) Rebooting psychotherapy research and practice to reduce the burden of mental illness. Perspectives on psychological science 6(1):21–37
Wang Philip S, Patricia Berglund, Mark Olfson, Pincus Harold A, Wells Kenneth B, Kessler Ronald C (2005) Failure and delay in initial treatment contact after first onset of mental disorders in the national comorbidity survey replication. Arch Gen Psychiatry 62(6):603–613
Williamson James R, Quatieri Thomas F, Helfer Brian S, Horwitz Rachelle, Yu Bea, Mehta Daryush D (2013) Vocal biomarkers of depression based on motor incoordination. In Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge, pages 41–48
Kaya Heysem, Salah Albert Ali (2014) Eyes whisper depression: A cca based multimodal approach. In Proceedings of the 22nd ACM international conference on Multimedia, pages 961–964
Çiftçi Elvan, Kaya Heysem, Güleç Hüseyin, Salah Albert Ali (2018) The turkish audio-visual bipolar disorder corpus. In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pages 1–6. IEEE
Yang Le, Li Yan, Chen Haifeng, Jiang Dongmei, Oveneke Meshia Cédric, Sahli Hichem (2018) Bipolar disorder recognition with histogram features of arousal and body gestures. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, pages 15–21
Ringeval Fabien, Schuller Björn, Valstar Michel, Cowie Roddy, Kaya Heysem, Schmitt Maximilian, Amiriparian Shahin, Cummins Nicholas, Lalanne Denis, Michaud Adrien, et al. (2018) Avec 2018 workshop and challenge: Bipolar disorder and cross-cultural affect recognition. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, pages 3–13. ACM
Tadas Baltrušaitis, Chaitanya Ahuja, Louis-Philippe Morency (2018) Multimodal machine learning: a survey and taxonomy. IEEE Trans Pattern Anal Mach Intell 41(2):423–443
Ngiam Jiquan, Khosla Aditya, Kim Mingyu, Nam Juhan, Lee Honglak, Ng Andrew Y (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 689–696
Snoek Cees GM, Worring Marcel, Smeulders Arnold WM (2005) Early versus late fusion in semantic video analysis. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 399–402
Song Yale, Morency Louis-Philippe, Davis Randall (2013) Learning a sparse codebook of facial and body microexpressions for emotion recognition. In Proceedings of the 15th ACM on International conference on multimodal interaction, pages 237–244
Hardoon David R, Sandor Szedmak, John Shawe-Taylor (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
Dibeklioğlu Hamdi, Hammal Zakia, Yang Ying, Cohn Jeffrey F (2015) Multimodal detection of depression in clinical interviews. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, pages 307–310
Hanchuan Peng, Fuhui Long, Chris Ding (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Alghowinem Sharifa, Goecke Roland, Cohn Jeffrey F, Wagner Michael, Parker Gordon, Breakspear Michael (2015) Cross-cultural detection of depression from nonverbal behaviour. In 2015 11th IEEE International conference and workshops on automatic face and gesture recognition (FG), volume 1, pages 1–8. IEEE
Corinna Cortes, Vladimir Vapnik (1995) Support-vector networks. Machine learning 20(3):273–297
Huang Jian, Li Ya, Tao Jianhua, Lian Zheng, Wen Zhengqi, Yang Minghao, Yi Jiangyan (2017) Continuous multimodal emotion prediction based on long short term memory recurrent neural network. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pages 11–18
Awad Mariette, Khanna Rahul (2015) Support Vector Regression, pages 67–80. Apress, Berkeley, CA
Ringeval Fabien, Schuller Björn, Valstar Michel, Gratch Jonathan, Cowie Roddy, Scherer Stefan, Mozgai Sharon, Cummins Nicholas, Schmitt Maximilian, Pantic Maja (2017) Avec 2017: Real-life depression, and affect recognition workshop and challenge. In Proceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge, pages 3–9
Ma Xingchen, Yang Hongyu, Chen Qiang, Huang Di, Wang Yunhong (2016) Depaudionet: An efficient deep model for audio based depression classification. In Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge, pages 35–42
Szegedy Christian, Ioffe Sergey, Vanhoucke Vincent, Alemi Alexander A (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-first AAAI conference on artificial intelligence
Szegedy Christian, Vanhoucke Vincent, Ioffe Sergey, Shlens Jon, Wojna Zbigniew (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826
Du Zhengyin, Li Weixin, Huang Di, Wang Yunhong (2018) Bipolar disorder recognition via multi-scale discriminative audio temporal representation. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, pages 23–30
Xiaofen Xing, Bolun Cai, Yinhu Zhao, Shuzhen Li, Zhiwei He, and Weiquan Fan. Multi-modality hierarchical recall based on gbdts for bipolar disorder classification. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, pages 31–37, 2018
Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001
Zafi Sherhan Syed, Kirill Sidorov, and David Marshall. Automated screening for bipolar disorder from audio/visual modalities. In Proceedings of the 2018 on Audio/Visual Emotion Challenge and Workshop, pages 39–45, 2018
Weiwei Zong, Guang-Bin Huang, Yiqiang Chen (2013) Weighted extreme learning machine for imbalance learning. Neurocomputing 101:229–242
Ziheng Zhang, Weizhe Lin, Mingyu Liu, and Marwa Mahmoud. Multimodal deep learning framework for mental disorder recognition. In 2020 15th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2020). IEEE, 2020
Orton Indigo JD (2020) Vision based body gesture meta features for affective computing. arXiv preprint arXiv:2003.00809,
Robert C Young, Jeffery T Biggs, Veronika E Ziegler, and Dolores A Meyer. A rating scale for mania: reliability, validity and sensitivity. The British journal of psychiatry, 133(5):429–435, 1978
Kurt Kroenke, Tara W Strine, Robert L Spitzer, Janet BW Williams, Joyce T Berry, and Ali H Mokdad. The phq-8 as a measure of current depression in the general population. Journal of affective disorders, 114(1-3):163–173, 2009
Kurt Kroenke, Robert L Spitzer, and Janet BW Williams. The phq-9: validity of a brief depression severity measure. Journal of general internal medicine, 16(9):606–613, 2001
Robert L Spitzer, Kurt Kroenke, Janet BW Williams, and Bernd Löwe. A brief measure for assessing generalized anxiety disorder: the gad-7. Archives of internal medicine, 166(10):1092–1097, 2006
Benjamin Gierk, Sebastian Kohlmann, Kurt Kroenke, Lena Spangenberg, Markus Zenger, Elmar Brähler, Bernd Löwe (2014) The somatic symptom scale-8 (sss-8): a brief measure of somatic symptom burden. JAMA internal medicine 174(3):399–407
Sheldon Cohen, T Kamarck, R Mermelstein, et al. Perceived stress scale. Measuring stress: A guide for health and social scientists, 10, 1994
Hamdi Dibeklioğlu, Zakia Hammal, and Jeffrey F Cohn. Dynamic multimodal measurement of depression severity using deep autoencoding. IEEE journal of biomedical and health informatics, 22(2):525–536, 2017
Nalini Ambady, Robert Rosenthal (1992) Thin slices of expressive behavior as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 111(2):256
Nalini Ambady and Heather M Gray. On being sad and mistaken: Mood effects on the accuracy of thin-slice judgments. Journal of personality and social psychology, 83(4):947, 2002
Nalini Ambady, Mark Hallahan, Brett Conner (1999) Accuracy of judgments of sexual orientation from thin slices of behavior. J Pers Soc Psychol 77(3):538
Jacqueline NW Friedman, Thomas F Oltmanns, and Eric Turkheimer. Interpersonal perception and personality disorders: Utilization of a thin slice approach. Journal of Research in Personality, 41(3):667–688, 2007
Jorge Sánchez, Florent Perronnin, Thomas Mensink, Jakob Verbeek (2013) Image classification with the fisher vector: theory and practice. Int J Comput Vision 105(3):222–245
Florent Perronnin, Jorge Sánchez, and Thomas Mensink. Improving the fisher kernel for large-scale image classification. In European conference on computer vision, pages 143–156. Springer, 2010
Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196, 2014
Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59–66. IEEE, 2018
Florian Eyben, Martin Wöllmer, and Björn Schuller. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia, pages 1459–1462, 2010
Jonathan T Foote. Content-based retrieval of music and audio. In Multimedia Storage and Archiving Systems II, volume 3229, pages 138–147. International Society for Optics and Photonics, 1997
Beth Logan et al (2000) Mel frequency cepstral coefficients for music modeling. Ismir 270:1–11
Florian Eyben, Klaus R Scherer, Björn W Schuller, Johan Sundberg, Elisabeth André, Carlos Busso, Laurence Y Devillers, Julien Epps, Petri Laukka, Shrikanth S Narayanan, et al. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE transactions on affective computing, 7(2):190–202, 2015
Jey Han Lau and Timothy Baldwin. An empirical evaluation of doc2vec with practical insights into document embedding generation. arXiv preprint arXiv:1607.05368, 2016
Weizhe Lin, Indigo Orton, Mingyu Liu, and Marwa Mahmoud. Automatic detection of self-adaptors for psychological distress. In 2020 15th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2020). IEEE, 2020
Funding
Part of this research is funded by King’s College Cambridge.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ceccarelli, F., Mahmoud, M. Multimodal temporal machine learning for Bipolar Disorder and Depression Recognition. Pattern Anal Applic 25, 493–504 (2022). https://doi.org/10.1007/s10044-021-01001-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-021-01001-y