Human emotion recognition by optimally fusing facial expression and speech feature
Introduction
Emotion recognition plays a significant role in modern intelligent systems, such as autonomous vehicles, smart phone voice assistant, human psychological analysis, and medical services [1], [2], [3]. For example, drivers’ emotion can be analyzed in real time by leveraging the speech emotion recognition system, which can judge whether the circumstance is safe or not. This can be used to warn drivers when they are in fatigue state and thus the traffic accidents can be avoided. In medical research, speech emotion recognition can be utilized to analyze emotional changes in depressive patients or autistic children, which is used as an effective tool for disease diagnosis and adjuvant treatment. Speech emotion recognition aims to effectively understand emotion status from the low-level features extracted from speech signals. It can be regarded as classification task based on speech signal sequences, which consists of emotional database compilation, speech emotional feature extraction, feature dimensionality reduction, and emotion classification/recognition. Traditional speech motion recognition techniques include hidden Markov model (HMM), artificial neural network (ANN), Gaussian mixture model (GMM), support vector machine (SVM), and K-nearest neighbor (KNN). However, the performances of these algorithms are significantly different since the corpus varies greatly. For example, the SVM and KNN based emotion recognition algorithms are generally with high certainty, whereas human emotions have strong complexity and high uncertainty. Therefore, they are deficiently effective in speech emotion recognition.
Both the physiological and psychological studies have demonstrated that facial expression and speech signal are informative for recognizing human emotion. Human beings express their different feelings by adjusting the strength of facial muscles and changing their tones. At the same time, human beings analyze the corresponding emotions by perceiving the speech signals. In addition, psychological experiments have shown that visual information will change speech perception. In this way, emotional types can be judged by visual and speech information. The database is the basis of emotional recognition research. Among all types of multi-modal emotional databases, the emotional database based on facial expression and voice is highly complete. For the early researches, Friesen et al. [4] pointed out that human facial expression can be classified into six emotional categories including happiness, anger, fear, sadness, disgust and surprise. And each emotion was related to facial expression. This research is the fundamental for emotion recognition. Combining with facial expression recognition, speech emotion recognition can achieve a better performance. Traditional RNN-based algorithms can fully exploit the contextual information to construct language models and achieve good performance in the field of emotional analysis. It is noticeable that, such recognition algorithm has the problems of gradient disappearance and explosion. To solve problems of the existing algorithms, in this paper, we propose a bimodal speech emotion recognition framework, an improved AlexNet to describe human facial expressions. Since speech signal are highly correlated at temporal-level, we combine LSTM and CNN to recognize emotion based on human speeches.
Section snippets
Related work
Speech emotion recognition has becoming a well-known research topic in modern intelligent systems. Williams and Stevens studied the principle of speech production from a physiological point of view. When people are in the state of anger, fear or pleasure, the sympathetic nervous system will be triggered. At this time, the tone of human voices will be higher, the speed of the voice will be faster, and there will be more high-frequency energy. When people are in a sad state, the parasympathetic
The proposed method
In our work, we combine both the facial expression and speech information to achieve emotion recognition. The pipeline of our method is elaborated in Fig. 1.
Experiments and analysis
In this section, we conduct experiments to verify the effectiveness of our proposed method. Our experiment is conducted on a PC equipped with an Intel Broadwell E5 CPU, Nvidia 1080ti GPU and 32 GB RAM. Three human emotion datasets are utilized in our experiment including the RML [19], AFEW6.0 [20] and eNTERFACE’05 [21].
Conclusions
Emotion recognition is popularly used in modern computer vision systems. In this paper, we propose a bimodal fusion-based speech emotion recognition method, where both the facial expression recognition and speech signal are seamlessly integrated. More specifically, we first combine CNN and RNN to optimally realize facial emotion recognition. Subsequently, we leverage the MFCC to convert speech signal to images. Therefore, we can employ the LSTM and CNN to understand speech emotion. Finally, we
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Xusheng Wang was born in Shanxi, P.R. China, in 1988. He received the Ph.D. degree from University of Paris-Saclay. Now, He works in Xi’an University of Technology. His research interest include computational intelligence, big data analysis and integrated circuit design.
E-mail: [email protected]
References (26)
- et al.
Emotional speech recognition: resources, features, and methods
Speech Commun.
(2006) - et al.
Acoustic feature selection for automatic emotion recognition from speech
Inf. Process. Manage.
(2009) - A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: IEEE International...
- H. Shen, X. Zhou, Speech recognition techniques for a sign language recognition system, in: Interspeech, Conference of...
- et al.
Convolutional neural networks for speech recognition
IEEE/ACM Trans. Audio Speech Lang. Process.
(2014) Universals and cultural differences in facial expressions of emotion
- N. Kamaruddin, A. Wahab, Features extraction for speech emotion, in: International Conference on Software Engineering &...
- et al.
Maximum entropy direct models for speech recognition
IEEE Trans. Audio Speech Lang. Process.
(2006) - et al.
Emotion recognition in speech using neural networks
Neural Comput. Appl.
(2000) - et al.
Evaluating deep learning architectures for speech emotion recognition
Neural Netw.
(2017)
Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech
Speech emotion recognition based on an improved supervised manifold learning algorithm
Dianzi Yu Xinxi Xuebao/J. Electron. Inf. Technol.
Exploiting supervised learning for finetuning deep CNNs in content based image retrieval
Cited by (66)
Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments
2024, Computer Standards and InterfacesSTP-MFM: Semi-tensor product-based multi-modal factorized multilinear pooling for information fusion in sentiment analysis
2024, Digital Signal Processing: A Review JournalEnhancement multi-module network for few-shot leaky cable fixture detection in railway tunnel
2023, Signal Processing: Image CommunicationAn ongoing review of speech emotion recognition
2023, Neurocomputing
Xusheng Wang was born in Shanxi, P.R. China, in 1988. He received the Ph.D. degree from University of Paris-Saclay. Now, He works in Xi’an University of Technology. His research interest include computational intelligence, big data analysis and integrated circuit design.
E-mail: [email protected]
Xing Chen was born in Shaanxi, P.R. China, in 1999. She received the Bachelor degree from Xi’an University of Technology, P.R. China. Now, she works as a college instructor in Xi’an University of Technology. Her research interests include big data analysis, computational intelligence.
Congjun Cao was born in October 1970. She graduated from Northwestern University with Ph.D. in computer Software and Theory. She is currently a Full professor of Xi’an University of Technology in P.R. China. Her research focuses on cross-media color reproduction, quality control technology, and computational intelligence.