Skip to main content

Advertisement

Log in

An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Automatic emotion recognition from speech is a demanding and challenging problem. It is difficult to differentiate between the emotional states of humans. The major problem with this task is to extract the important features from the speech in case of hand-crafted features. The accuracy for emotion recognition can be increased using deep learning approaches which use high level features of speech signals. In this work, an algorithm is proposed using deep learning to extract the high-level features from raw data with high accuracy irrespective of language and speakers (male/females) of speech corpora. For this, the .wav files are converted into the RGB spectrograms (images) and normalized to size (224x224x3) for fine-tuning these for Deep Convolutional Neural Network (DCNN) to recognize emotions. DCNN model is trained in two stages. From stage-1 the optimal learning rate is identified using the Learning Rate (LR) range test and then the model is trained again with optimal learning rate in stage-2. Special stride is used for down-sampling the features with reduced model size. The emotions considered are happiness, sadness, anger, fear, disgust, boredom/surprise and neutral. The proposed algorithm is tested on three popular public speech corpora EMODB (German), EMOVO (Italian), and SAVEE (British English). The accuracy of emotion recognition reported is better as compared to the existing studies for different languages and speakers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Anagnostopoulos C, Iliou T, Giannoukos I (2012) Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artificial Intell Rev 43:155–177. https://doi.org/10.1007/s10462-012-9368-5

    Article  Google Scholar 

  2. Badshah A, Rahim N, Ullah N, Ahmad J, Muhammad K, Lee M et al (2019) Deep features-based speech emotion recognition for smart affective services. Multimed Tools Appl 78:5571–5589. https://doi.org/10.1007/s11042-017-5292-7

    Article  Google Scholar 

  3. Bitouk D, Verma R, Nenkova A (2010) Class-level spectral features for emotion recognition. Speech Commun 52:613–625. https://doi.org/10.1016/j.specom.2010.02.010

    Article  Google Scholar 

  4. Bou-Ghazale S, Hansen J (2000) A comparative study of traditional and newly proposed features for recognition of speech under stress. IEEE Trans Speech Audio Process. 8:429–442. https://doi.org/10.1109/89.848224.

  5. Burkhardt F, Paeschke A, Rolfes M, Sendlmeier W, Weiss B (2005) A database of German emotional speech, in: Inninth European Conference On Speech Communication And Technology

  6. Chauhan A, Koolagudi SG, Kafley S, Rao KS (2010, April) Emotion recognition using LP residual. In: 2010 IEEE Students Technology Symposium (TechSym). IEEE, pp 255–261. https://doi.org/10.1109/TECHSYM.2010.5469162

  7. Chen M, He X, Yang J, Zhang H (2018) 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process Lett 25(10):1440–1444. https://doi.org/10.1109/LSP.2018.2860246

    Article  Google Scholar 

  8. Costantini G, Iaderola I, Paoloni A, Todisco M (2014) Emovo corpus: an italian emotional speech database, in. In international Conference on Language Resources And Evaluation, European Language Resources Association (ELRA), 3501–3504.

  9. Cowie R, Cornelius RR (2003) Describing the emotional states that are expressed in speech. Speech Commun 40(1-2):5–32. https://doi.org/10.1016/s0167-6393(02)00071-7

    Article  MATH  Google Scholar 

  10. Dai K, Fell HJ, MacAuslan J (2008) Recognizing emotion in speech using neural networks. Telehealth and Assistive Technologies 31:38–43

    Google Scholar 

  11. Daneshfar F, Kabudian SJ (2020) Speech emotion recognition using discriminative dimension reduction by employing a modified quantum behaved particle swarm optimization algorithm. Multimed Tools Appl 79(1):1261–1289. https://doi.org/10.1007/s11042-019-08222-8

    Article  Google Scholar 

  12. Deng J, Xu X, Zhang Z, Frühholz S, Schuller B (2017) Universum autoencoder-based domain adaptation for speech emotion recognition. IEEE Signal Process Lett 24(4):500–504. https://doi.org/10.1109/lsp.2017.2672753

    Article  Google Scholar 

  13. El Ayadi M, Kamel M, Karray F (2011) Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern Recognition 44:572–587. https://doi.org/10.1016/j.patcog.2010.09.020

    Article  MATH  Google Scholar 

  14. Farrús M, Hernando J (2009) Using Jitter and Shimmer in speaker verification. IET Signal Process 3:247. https://doi.org/10.1049/iet-spr.2008.0147

    Article  Google Scholar 

  15. Fayek H M, Lech M and Cavedon L (2015) Towards real-time speech emotion recognition using deep neural networks. In 2015 9th international conference on signal processing and communication systems (ICSPCS) 1–5. IEEE. https://dpi.org/https://doi.org/10.1109/ICSPCS.2015.7391796

  16. SA Firoz, SA Raji, AP Babu (2009) Automatic Emotion Recognition from Speech Using Artificial Neural Networks with Gender-Dependent Databases. International Conference on Advances in Computing, Control, and Telecommunication Technologies, Trivandrum, Kerala 162–164. https://doi.org/10.1109/ACT.2009.49.

  17. Han K, Yu D, Tashev I (2014) Speech emotion recognition using deep neural network and extreme learning machine. In Fifteenth annual conference of the international speech communication association 223–227.

  18. Haq S, Jackson PJ (2011) Multimodal emotion recognition. In: Machine audition: principles, algorithms and systems. IGI Global, pp 398–423. https://doi.org/10.4018/978-1-61520-919-4.ch017

  19. Huang Z, Dong M, Mao Q and Zhan Y (2014) Speech emotion recognition using CNN. In Proceedings of the 22nd ACM international conference on Multimedia 801–804. https://doi.org/10.1145/2647868.2654984

  20. Jawarkar N (2007) Emotion recognition using prosody features and a fuzzy min-max neural classifier. IETE Technical Rev 24:369–373

    Google Scholar 

  21. Khanchandani KB (2009) MA Hussain, Emotion recognition using multilayer perceptron and generalized feed forward neural network. CSIR 68:367–371 http://hdl.handle.net/123456789/3787

    Google Scholar 

  22. Kim Y, Lee H, Provost EM (2013) Deep learning for robust feature generation in audio visual emotion recognition. IEEE Int Conference Acoustics, Speech Signal Process, Vancouver, pp 3687–3691. https://doi.org/10.1109/ICASSP.2013.6638346

    Book  Google Scholar 

  23. Koolagudi S, Maity S, Kumar V, Chakrabarti S, Rao K (2009) IITKGP-SESC: speech database for emotion analysis. In international Conference On Contemporary Computing. 485–492.

  24. Koolagudi SG, Reddy R, Rao KS (2010, July) Emotion recognition from speech signal using epoch parameters. In: 2010 international conference on signal processing and communications (SPCOM). IEEE, pp 1–5. https://doi.org/10.1109/SPCOM.2010.5560541

  25. Koolagudi S, Murthy Y, Bhaskar S (2018) Choice of a classifier, based on properties of a dataset: case study-speech emotion recognition. Int J Speech Technol 21:167–183. https://doi.org/10.1007/s10772-018-9495-8

    Article  Google Scholar 

  26. Kwon OW, Chan K, Hao J, Lee TW (2003) Emotion recognition by speech signals. In Eighth European Conference on Speech Communication and Technology

  27. Le Cun Y, Bengio Y, Hinton G (2015) Deep learning, Nature

  28. Lee MC, Chiang SY, Yeh SC, Wen TF (2020) Study on emotion recognition and companion Chatbot using deep neural network. Multimed Tools Appl 79:19629–19657. https://doi.org/10.1007/s11042-020-08841-6

    Article  Google Scholar 

  29. Livingstone S, Russo F (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLOS ONE 13:e0196391. https://doi.org/10.1371/journal.pone.0196391

    Article  Google Scholar 

  30. Martin O, Kotsia I, Macq B, Pitas I (2006, April) The eNTERFACE'05 audio-visual emotion database. In: 22nd International Conference on Data Engineering Workshops (ICDEW'06) Atlanta, GA, USA, pp 8–8. https://doi.org/10.1109/ICDEW.2006.145

  31. Motamed S, Setayeshi S, Rabiee A (2017) Speech emotion recognition based on a modified brain emotional learning model. Biol Inspired Cognitive Architect 19:32–38. https://doi.org/10.1016/j.bica.2016.12.002

    Article  Google Scholar 

  32. Neiberg D, Elenius K, Laskowski K (2006) Emotion recognition in spontaneous speech using GMMs. In Ninth international conference on spoken language processing

  33. Nwe TL, Wei FS, De Silva LC (2001) Speech based emotion classification. Proceedings of IEEE Region 10th International Conference on Electrical and Electronic Technology. TENCON, Singapore, pp 297–301. https://doi.org/10.1109/TENCON.2001.949600

    Book  Google Scholar 

  34. Özseven T (2019) A novel feature selection method for speech emotion recognition. Applied Acoustics 146:320–326. https://doi.org/10.1016/j.apacoust.2018.11.028

    Article  Google Scholar 

  35. Parry J, Palaz D, Clarke G, Lecomte P, Mead, R., Berger M, Hofer G (2019) Analysis of Deep Learning Architectures for Cross-corpus Speech Emotion Recognition. Proc Interspeech:1656–1660. https://doi.org/10.21437/Interspeech.2019-2753

  36. Partila P, Voznak M (2013) Speech emotions recognition using 2-d neural classifier. In Nostradamus 2013: Prediction, modeling and analysis of complex systems (pp. 221–231). Springer, Heidelberg. https://doi.org/10.1007/978-3-319-00542-3_23

  37. Pervaiz A, Hussain F, Israr H, Tahir MA, Raja FR, Baloch NK, … Zikria YB (2020) Incorporating Noise Robustness in Speech Command Recognition by Noise Augmentation of Training Data. Sensors 20(8):2326

    Article  Google Scholar 

  38. Polzehl T, Sundaram S, Ketabdar H, Wagner M, Metze F (2009) Emotion classification in children's speech using fusion of acoustic and linguistic features. in: Intenth Annual Conference of The International Speech Communication Association

  39. Prasomphan S (2015) Improvement of speech emotion recognition with neural network classifier by using speech spectrogram. International Conference on Systems, Signals and Image Processing (IWSSIP), London, pp 73–76. https://doi.org/10.1109/IWSSIP.2015.7314180

    Book  Google Scholar 

  40. Rao K S, Reddy R, Maity S, Koolagudi SG (2010) Characterization of emotions using the dynamics of prosodic features. InSpeech Prosody Fifth International Conference

  41. Rao K, Koolagudi S, Vempada R (2013) Emotion recognition from speech using global and local prosodic features. Int J Speech Technol. 16:143–160. https://doi.org/10.1007/s10772-012-9172-2.

  42. Razak AA, Komiya R, Izani M, Abidin Z (2005) Comparison between fuzzy and NN method for speech emotion recognition. Third International Conference on Information Technology and Applications (ICITA'05), Sydney, pp 297–302. https://doi.org/10.1109/ICITA.2005.101

    Book  Google Scholar 

  43. Sato N, Obuchi Y (2007) Emotion Recognition using Mel-Frequency Cepstral Coefficients. J Nat Language Process 14:83–96. https://doi.org/10.5715/jnlp.14.4_83

    Article  Google Scholar 

  44. Shen P, Changjun Z, Chen X (2011, August) Automatic speech emotion recognition using support vector machine. In: Proceedings of 2011 International Conference on Electronic & Mechanical Engineering and Information Technology, vol 2. IEEE, pp 621–625. https://doi.org/10.1109/EMEIT.2011.6023178

  45. Singh YB, Goel S (2018) Survey on Human Emotion Recognition: Speech Database, Features and Classification. International Conference on Advances in Computing, Communication Control and Networking (ICACCCN), Greater Noida (UP), India, pp 298–301. https://doi.org/10.1109/ICACCCN.2018.8748379

    Book  Google Scholar 

  46. Steidl S, Batliner A, Seppi D, Schuller B (2010) On the impact of children's emotional speech on acoustic and language models. EURASIP J Audio, Speech Music Process 2010(1):783954. https://doi.org/10.1186/1687-4722-2010-783954

    Article  Google Scholar 

  47. Stuhlsatz A, Meyer C, Eyben F, Zielke T, Meier G, Schuller B (2011) Deep neural networks for acoustic emotion recognition: Raising the benchmarks. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague 5688–5691. https://doi.org/10.1109/ICASSP.2011.5947651.

  48. Tang H, Chu SM, Hasegawa M, Johnson HTS (2009) Emotion recognition from speech VIA boosted Gaussian mixture models. IEEE International Conference on Multimedia and Expo, New York, pp 294–297. https://doi.org/10.1109/ICME.2009.5202493

    Book  Google Scholar 

  49. Trigeorgis G, Ringeval F, Brueckner R, Marchi E, Nicolaou MA, Schuller B, Zafeiriou S (2016) Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, pp 5200–5204. https://doi.org/10.1109/ICASSP.2016.7472669

    Book  Google Scholar 

  50. Ververidis D, Kotropoulos C (2006) Emotional speech recognition: Resources, features, and methods. Speech Commun 48:1162–1181. https://doi.org/10.1016/j.specom.2006.04.003

    Article  Google Scholar 

  51. Wahlster W (2013) Verbmobil: Foundations of Speech-to-Speech Translation, Springer Science & Business Media

  52. Wang K, An N, Li BN, Zhang Y, Li L (2015) Speech Emotion Recognition Using Fourier Parameters. IEEE Trans Affect Comput 6:69–75. https://doi.org/10.1109/TAFFC.2015.2392101

    Article  Google Scholar 

  53. Wu S, Falk TH, Chan W-Y (2011) Automatic speech emotion recognition using modulation spectral features. Speech Commun 53(5):768–785. https://doi.org/10.1016/j.specom.2010.08.013

    Article  Google Scholar 

  54. Yu D, Deng L (2016) Automatic Speech Recognition. London Limited, Springer

    MATH  Google Scholar 

  55. Zhao J, Mao X, Chen L (2019) Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed Signal Process Control 47:312–323. https://doi.org/10.1016/j.bspc.2018.08.035

    Article  Google Scholar 

  56. Zheng W, Xin M, Wang X, Wang B (2014) A novel speech emotion recognition method via incomplete sparse least square regression. IEEE Signal Process Lett 21(5):569–572. https://doi.org/10.1109/lsp.2014.2308954

    Article  Google Scholar 

  57. Zhou J, Wang G, Yang Y, Chen P (2006) Speech Emotion Recognition Based on Rough Set and SVM. 5th IEEE International Conference on Cognitive Informatics, Beijing, pp 53–61. https://doi.org/10.1109/COGINF.2006.365676

    Book  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Youddha Beer Singh.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Singh, Y.B., Goel, S. An efficient algorithm for recognition of emotions from speaker and language independent speech using deep learning. Multimed Tools Appl 80, 14001–14018 (2021). https://doi.org/10.1007/s11042-020-10399-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-10399-2

Keywords

Navigation