Abstract
Automatic emotion recognition from speech is a demanding and challenging problem. It is difficult to differentiate between the emotional states of humans. The major problem with this task is to extract the important features from the speech in case of hand-crafted features. The accuracy for emotion recognition can be increased using deep learning approaches which use high level features of speech signals. In this work, an algorithm is proposed using deep learning to extract the high-level features from raw data with high accuracy irrespective of language and speakers (male/females) of speech corpora. For this, the .wav files are converted into the RGB spectrograms (images) and normalized to size (224x224x3) for fine-tuning these for Deep Convolutional Neural Network (DCNN) to recognize emotions. DCNN model is trained in two stages. From stage-1 the optimal learning rate is identified using the Learning Rate (LR) range test and then the model is trained again with optimal learning rate in stage-2. Special stride is used for down-sampling the features with reduced model size. The emotions considered are happiness, sadness, anger, fear, disgust, boredom/surprise and neutral. The proposed algorithm is tested on three popular public speech corpora EMODB (German), EMOVO (Italian), and SAVEE (British English). The accuracy of emotion recognition reported is better as compared to the existing studies for different languages and speakers.
Similar content being viewed by others
References
Anagnostopoulos C, Iliou T, Giannoukos I (2012) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intell Rev 43:155–177. https://doi.org/10.1007/s10462-012-9368-5
Badshah A, Rahim N, Ullah N, Ahmad J, Muhammad K, Lee M et al (2019) Deep features-based speech emotion recognition for smart affective services. Multimed Tools Appl 78:5571–5589. https://doi.org/10.1007/s11042-017-5292-7
Bitouk D, Verma R, Nenkova A (2010) Class-level spectral features for emotion recognition. Speech Commun 52:613–625. https://doi.org/10.1016/j.specom.2010.02.010
Bou-Ghazale S, Hansen J (2000) A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Trans Speech Audio Process. 8:429–442. https://doi.org/10.1109/89.848224.
Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech, in: Inninth European Conference On Speech Communication And Technology
Chauhan A, Koolagudi SG, Kafley S, Rao KS (2010, April) Emotion recognition using LP residual. In: 2010 IEEE Students Technology Symposium (TechSym). IEEE, pp 255–261. https://doi.org/10.1109/TECHSYM.2010.5469162
Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444. https://doi.org/10.1109/LSP.2018.2860246
Costantini G, Iaderola I, Paoloni A, Todisco M (2014) Emovo corpus: an italian emotional speech database, in. In international Conference on Language Resources And Evaluation, European Language Resources Association (ELRA), 3501–3504.
Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1-2):5–32. https://doi.org/10.1016/s0167-6393(02)00071-7
Dai K, Fell HJ, MacAuslan J (2008) Recognizing emotion in speech using neural networks. Telehealth and Assistive Technologies 31:38–43
Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1):1261–1289. https://doi.org/10.1007/s11042-019-08222-8
Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2017) Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process Lett 24(4):500–504. https://doi.org/10.1109/lsp.2017.2672753
El Ayadi M, Kamel M, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44:572–587. https://doi.org/10.1016/j.patcog.2010.09.020
Farrús M, Hernando J (2009) Using Jitter and Shimmer in speaker verification. IET Signal Process 3:247. https://doi.org/10.1049/iet-spr.2008.0147
Fayek H M, Lech M and Cavedon L (2015) Towards real-time speech emotion recognition using deep neural networks. In 2015 9th international conference on signal processing and communication systems (ICSPCS) 1–5. IEEE. https://dpi.org/https://doi.org/10.1109/ICSPCS.2015.7391796
SA Firoz, SA Raji, AP Babu (2009) Automatic Emotion Recognition from Speech Using Artificial Neural Networks with Gender-Dependent Databases. International Conference on Advances in Computing, Control, and Telecommunication Technologies, Trivandrum, Kerala 162–164. https://doi.org/10.1109/ACT.2009.49.
Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In Fifteenth annual conference of the international speech communication association 223–227.
Haq S, Jackson PJ (2011) Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems. IGI Global, pp 398–423. https://doi.org/10.4018/978-1-61520-919-4.ch017
Huang Z, Dong M, Mao Q and Zhan Y (2014) Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on Multimedia 801–804. https://doi.org/10.1145/2647868.2654984
Jawarkar N (2007) Emotion recognition using prosody features and a fuzzy min-max neural classifier. IETE Technical Rev 24:369–373
Khanchandani KB (2009) MA Hussain, Emotion recognition using multilayer perceptron and generalized feed forward neural network. CSIR 68:367–371 http://hdl.handle.net/123456789/3787
Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audio visual emotion recognition. IEEE Int Conference Acoustics, Speech Signal Process, Vancouver, pp 3687–3691. https://doi.org/10.1109/ICASSP.2013.6638346
Koolagudi S, Maity S, Kumar V, Chakrabarti S, Rao K (2009) IITKGP-SESC: speech database for emotion analysis. In international Conference On Contemporary Computing. 485–492.
Koolagudi SG, Reddy R, Rao KS (2010, July) Emotion recognition from speech signal using epoch parameters. In: 2010 international conference on signal processing and communications (SPCOM). IEEE, pp 1–5. https://doi.org/10.1109/SPCOM.2010.5560541
Koolagudi S, Murthy Y, Bhaskar S (2018) Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int J Speech Technol 21:167–183. https://doi.org/10.1007/s10772-018-9495-8
Kwon OW, Chan K, Hao J, Lee TW (2003) Emotion recognition by speech signals. In Eighth European Conference on Speech Communication and Technology
Le Cun Y, Bengio Y, Hinton G (2015) Deep learning, Nature
Lee MC, Chiang SY, Yeh SC, Wen TF (2020) Study on emotion recognition and companion Chatbot using deep neural network. Multimed Tools Appl 79:19629–19657. https://doi.org/10.1007/s11042-020-08841-6
Livingstone S, Russo F (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE 13:e0196391. https://doi.org/10.1371/journal.pone.0196391
Martin O, Kotsia I, Macq B, Pitas I (2006, April) The eNTERFACE'05 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW'06) Atlanta, GA, USA, pp 8–8. https://doi.org/10.1109/ICDEW.2006.145
Motamed S, Setayeshi S, Rabiee A (2017) Speech emotion recognition based on a modified brain emotional learning model. Biol Inspired Cognitive Architect 19:32–38. https://doi.org/10.1016/j.bica.2016.12.002
Neiberg D, Elenius K, Laskowski K (2006) Emotion recognition in spontaneous speech using GMMs. In Ninth international conference on spoken language processing
Nwe TL, Wei FS, De Silva LC (2001) Speech based emotion classification. Proceedings of IEEE Region 10th International Conference on Electrical and Electronic Technology. TENCON, Singapore, pp 297–301. https://doi.org/10.1109/TENCON.2001.949600
Özseven T (2019) A novel feature selection method for speech emotion recognition. Applied Acoustics 146:320–326. https://doi.org/10.1016/j.apacoust.2018.11.028
Parry J, Palaz D, Clarke G, Lecomte P, Mead, R., Berger M, Hofer G (2019) Analysis of Deep Learning Architectures for Cross-corpus Speech Emotion Recognition. Proc Interspeech:1656–1660. https://doi.org/10.21437/Interspeech.2019-2753
Partila P, Voznak M (2013) Speech emotions recognition using 2-d neural classifier. In Nostradamus 2013: Prediction, modeling and analysis of complex systems (pp. 221–231). Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00542-3_23
Pervaiz A, Hussain F, Israr H, Tahir MA, Raja FR, Baloch NK, … Zikria YB (2020) Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data. Sensors 20(8):2326
Polzehl T, Sundaram S, Ketabdar H, Wagner M, Metze F (2009) Emotion classification in children's speech using fusion of acoustic and linguistic features. in: Intenth Annual Conference of The International Speech Communication Association
Prasomphan S (2015) Improvement of speech emotion recognition with neural network classifier by using speech spectrogram. International Conference on Systems, Signals and Image Processing (IWSSIP), London, pp 73–76. https://doi.org/10.1109/IWSSIP.2015.7314180
Rao K S, Reddy R, Maity S, Koolagudi SG (2010) Characterization of emotions using the dynamics of prosodic features. InSpeech Prosody Fifth International Conference
Rao K, Koolagudi S, Vempada R (2013) Emotion recognition from speech using global and local prosodic features. Int J Speech Technol. 16:143–160. https://doi.org/10.1007/s10772-012-9172-2.
Razak AA, Komiya R, Izani M, Abidin Z (2005) Comparison between fuzzy and NN method for speech emotion recognition. Third International Conference on Information Technology and Applications (ICITA'05), Sydney, pp 297–302. https://doi.org/10.1109/ICITA.2005.101
Sato N, Obuchi Y (2007) Emotion Recognition using Mel-Frequency Cepstral Coefficients. J Nat Language Process 14:83–96. https://doi.org/10.5715/jnlp.14.4_83
Shen P, Changjun Z, Chen X (2011, August) Automatic speech emotion recognition using support vector machine. In: Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology, vol 2. IEEE, pp 621–625. https://doi.org/10.1109/EMEIT.2011.6023178
Singh YB, Goel S (2018) Survey on Human Emotion Recognition: Speech Database, Features and Classification. International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida (UP), India, pp 298–301. https://doi.org/10.1109/ICACCCN.2018.8748379
Steidl S, Batliner A, Seppi D, Schuller B (2010) On the impact of children's emotional speech on acoustic and language models. EURASIP J Audio, Speech Music Process 2010(1):783954. https://doi.org/10.1186/1687-4722-2010-783954
Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: Raising the benchmarks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague 5688–5691. https://doi.org/10.1109/ICASSP.2011.5947651.
Tang H, Chu SM, Hasegawa M, Johnson HTS (2009) Emotion recognition from speech VIA boosted Gaussian mixture models. IEEE International Conference on Multimedia and Expo, New York, pp 294–297. https://doi.org/10.1109/ICME.2009.5202493
Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp 5200–5204. https://doi.org/10.1109/ICASSP.2016.7472669
Ververidis D, Kotropoulos C (2006) Emotional speech recognition: Resources, features, and methods. Speech Commun 48:1162–1181. https://doi.org/10.1016/j.specom.2006.04.003
Wahlster W (2013) Verbmobil: Foundations of Speech-to-Speech Translation, Springer Science & Business Media
Wang K, An N, Li BN, Zhang Y, Li L (2015) Speech Emotion Recognition Using Fourier Parameters. IEEE Trans Affect Comput 6:69–75. https://doi.org/10.1109/TAFFC.2015.2392101
Wu S, Falk TH, Chan W-Y (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53(5):768–785. https://doi.org/10.1016/j.specom.2010.08.013
Yu D, Deng L (2016) Automatic Speech Recognition. London Limited, Springer
Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035
Zheng W, Xin M, Wang X, Wang B (2014) A novel speech emotion recognition method via incomplete sparse least square regression. IEEE Signal Process Lett 21(5):569–572. https://doi.org/10.1109/lsp.2014.2308954
Zhou J, Wang G, Yang Y, Chen P (2006) Speech Emotion Recognition Based on Rough Set and SVM. 5th IEEE International Conference on Cognitive Informatics, Beijing, pp 53–61. https://doi.org/10.1109/COGINF.2006.365676
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Singh, Y.B., Goel, S. An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning. Multimed Tools Appl 80, 14001–14018 (2021). https://doi.org/10.1007/s11042-020-10399-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-10399-2