Human emotion recognition by optimally fusing facial expression and speech feature

https://doi.org/10.1016/j.image.2020.115831Get rights and content

Highlights

  • Subsequently, we leverage the MFCC to convert speech signal to images.

  • We utilize the weighted decision fusion method to fuse facial expression and speech signal to achieve speech emotion recognition.

  • Comprehensive experimental results have demonstrated that, compared with the uni-modal emotion recognition, bimodal features-based emotion recognition achieves a better performance.

Abstract

Emotion recognition is a hot research in modern intelligent systems. The technique is pervasively used in autonomous vehicles, remote medical service, and human–computer interaction (HCI). Traditional speech emotion recognition algorithms cannot be effectively generalized since both training and testing data are from the same domain, which have the same data distribution. In practice, however, speech data is acquired from different devices and recording environments. Thus, the data may differ significantly in terms of language, emotional types and tags. To solve such problem, in this work, we propose a bimodal fusion algorithm to realize speech emotion recognition, where both facial expression and speech information are optimally fused. We first combine the CNN and RNN to achieve facial emotion recognition. Subsequently, we leverage the MFCC to convert speech signal to images. Therefore, we can leverage the LSTM and CNN to recognize speech emotion. Finally, we utilize the weighted decision fusion method to fuse facial expression and speech signal to achieve speech emotion recognition. Comprehensive experimental results have demonstrated that, compared with the uni-modal emotion recognition, bimodal features-based emotion recognition achieves a better performance.

Introduction

Emotion recognition plays a significant role in modern intelligent systems, such as autonomous vehicles, smart phone voice assistant, human psychological analysis, and medical services [1], [2], [3]. For example, drivers’ emotion can be analyzed in real time by leveraging the speech emotion recognition system, which can judge whether the circumstance is safe or not. This can be used to warn drivers when they are in fatigue state and thus the traffic accidents can be avoided. In medical research, speech emotion recognition can be utilized to analyze emotional changes in depressive patients or autistic children, which is used as an effective tool for disease diagnosis and adjuvant treatment. Speech emotion recognition aims to effectively understand emotion status from the low-level features extracted from speech signals. It can be regarded as classification task based on speech signal sequences, which consists of emotional database compilation, speech emotional feature extraction, feature dimensionality reduction, and emotion classification/recognition. Traditional speech motion recognition techniques include hidden Markov model (HMM), artificial neural network (ANN), Gaussian mixture model (GMM), support vector machine (SVM), and K-nearest neighbor (KNN). However, the performances of these algorithms are significantly different since the corpus varies greatly. For example, the SVM and KNN based emotion recognition algorithms are generally with high certainty, whereas human emotions have strong complexity and high uncertainty. Therefore, they are deficiently effective in speech emotion recognition.

Both the physiological and psychological studies have demonstrated that facial expression and speech signal are informative for recognizing human emotion. Human beings express their different feelings by adjusting the strength of facial muscles and changing their tones. At the same time, human beings analyze the corresponding emotions by perceiving the speech signals. In addition, psychological experiments have shown that visual information will change speech perception. In this way, emotional types can be judged by visual and speech information. The database is the basis of emotional recognition research. Among all types of multi-modal emotional databases, the emotional database based on facial expression and voice is highly complete. For the early researches, Friesen et al. [4] pointed out that human facial expression can be classified into six emotional categories including happiness, anger, fear, sadness, disgust and surprise. And each emotion was related to facial expression. This research is the fundamental for emotion recognition. Combining with facial expression recognition, speech emotion recognition can achieve a better performance. Traditional RNN-based algorithms can fully exploit the contextual information to construct language models and achieve good performance in the field of emotional analysis. It is noticeable that, such recognition algorithm has the problems of gradient disappearance and explosion. To solve problems of the existing algorithms, in this paper, we propose a bimodal speech emotion recognition framework, an improved AlexNet to describe human facial expressions. Since speech signal are highly correlated at temporal-level, we combine LSTM and CNN to recognize emotion based on human speeches.

Section snippets

Related work

Speech emotion recognition has becoming a well-known research topic in modern intelligent systems. Williams and Stevens studied the principle of speech production from a physiological point of view. When people are in the state of anger, fear or pleasure, the sympathetic nervous system will be triggered. At this time, the tone of human voices will be higher, the speed of the voice will be faster, and there will be more high-frequency energy. When people are in a sad state, the parasympathetic

The proposed method

In our work, we combine both the facial expression and speech information to achieve emotion recognition. The pipeline of our method is elaborated in Fig. 1.

Experiments and analysis

In this section, we conduct experiments to verify the effectiveness of our proposed method. Our experiment is conducted on a PC equipped with an Intel Broadwell E5 CPU, Nvidia 1080ti GPU and 32 GB RAM. Three human emotion datasets are utilized in our experiment including the RML [19], AFEW6.0 [20] and eNTERFACE’05 [21].

Conclusions

Emotion recognition is popularly used in modern computer vision systems. In this paper, we propose a bimodal fusion-based speech emotion recognition method, where both the facial expression recognition and speech signal are seamlessly integrated. More specifically, we first combine CNN and RNN to optimally realize facial emotion recognition. Subsequently, we leverage the MFCC to convert speech signal to images. Therefore, we can employ the LSTM and CNN to understand speech emotion. Finally, we

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Xusheng Wang was born in Shanxi, P.R. China, in 1988. He received the Ph.D. degree from University of Paris-Saclay. Now, He works in Xi’an University of Technology. His research interest include computational intelligence, big data analysis and integrated circuit design.

E-mail: [email protected]

References (26)

  • VerveridisD. et al.

    Emotional speech recognition: resources, features, and methods

    Speech Commun.

    (2006)
  • RongJ. et al.

    Acoustic feature selection for automatic emotion recognition from speech

    Inf. Process. Manage.

    (2009)
  • A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: IEEE International...
  • H. Shen, X. Zhou, Speech recognition techniques for a sign language recognition system, in: Interspeech, Conference of...
  • AbdelhamidO. et al.

    Convolutional neural networks for speech recognition

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2014)
  • EkmanP.

    Universals and cultural differences in facial expressions of emotion

  • N. Kamaruddin, A. Wahab, Features extraction for speech emotion, in: International Conference on Software Engineering &...
  • KuoH.K.J. et al.

    Maximum entropy direct models for speech recognition

    IEEE Trans. Audio Speech Lang. Process.

    (2006)
  • NicholsonJ. et al.

    Emotion recognition in speech using neural networks

    Neural Comput. Appl.

    (2000)
  • FayekH.M. et al.

    Evaluating deep learning architectures for speech emotion recognition

    Neural Netw.

    (2017)
  • NeumannM. et al.

    Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech

    (2017)
  • ZhangS.Q. et al.

    Speech emotion recognition based on an improved supervised manifold learning algorithm

    Dianzi Yu Xinxi Xuebao/J. Electron. Inf. Technol.

    (2010)
  • TzelepiM. et al.

    Exploiting supervised learning for finetuning deep CNNs in content based image retrieval

  • Cited by (66)

    View all citing articles on Scopus

    Xusheng Wang was born in Shanxi, P.R. China, in 1988. He received the Ph.D. degree from University of Paris-Saclay. Now, He works in Xi’an University of Technology. His research interest include computational intelligence, big data analysis and integrated circuit design.

    E-mail: [email protected]

    Xing Chen was born in Shaanxi, P.R. China, in 1999. She received the Bachelor degree from Xi’an University of Technology, P.R. China. Now, she works as a college instructor in Xi’an University of Technology. Her research interests include big data analysis, computational intelligence.

    Congjun Cao was born in October 1970. She graduated from Northwestern University with Ph.D. in computer Software and Theory. She is currently a Full professor of Xi’an University of Technology in P.R. China. Her research focuses on cross-media color reproduction, quality control technology, and computational intelligence.

    View full text