Human emotion recognition by optimally fusing facial expression and speech feature

doi:10.1016/j.image.2020.115831

Signal Processing: Image Communication

Volume 84, May 2020, 115831

https://doi.org/10.1016/j.image.2020.115831 Get rights and content

Highlights

•
Subsequently, we leverage the MFCC to convert speech signal to images.
•
We utilize the weighted decision fusion method to fuse facial expression and speech signal to achieve speech emotion recognition.
•
Comprehensive experimental results have demonstrated that, compared with the uni-modal emotion recognition, bimodal features-based emotion recognition achieves a better performance.

Abstract

Emotion recognition is a hot research in modern intelligent systems. The technique is pervasively used in autonomous vehicles, remote medical service, and human–computer interaction (HCI). Traditional speech emotion recognition algorithms cannot be effectively generalized since both training and testing data are from the same domain, which have the same data distribution. In practice, however, speech data is acquired from different devices and recording environments. Thus, the data may differ significantly in terms of language, emotional types and tags. To solve such problem, in this work, we propose a bimodal fusion algorithm to realize speech emotion recognition, where both facial expression and speech information are optimally fused. We first combine the CNN and RNN to achieve facial emotion recognition. Subsequently, we leverage the MFCC to convert speech signal to images. Therefore, we can leverage the LSTM and CNN to recognize speech emotion. Finally, we utilize the weighted decision fusion method to fuse facial expression and speech signal to achieve speech emotion recognition. Comprehensive experimental results have demonstrated that, compared with the uni-modal emotion recognition, bimodal features-based emotion recognition achieves a better performance.

Introduction

Emotion recognition plays a significant role in modern intelligent systems, such as autonomous vehicles, smart phone voice assistant, human psychological analysis, and medical services [1], [2], [3]. For example, drivers’ emotion can be analyzed in real time by leveraging the speech emotion recognition system, which can judge whether the circumstance is safe or not. This can be used to warn drivers when they are in fatigue state and thus the traffic accidents can be avoided. In medical research, speech emotion recognition can be utilized to analyze emotional changes in depressive patients or autistic children, which is used as an effective tool for disease diagnosis and adjuvant treatment. Speech emotion recognition aims to effectively understand emotion status from the low-level features extracted from speech signals. It can be regarded as classification task based on speech signal sequences, which consists of emotional database compilation, speech emotional feature extraction, feature dimensionality reduction, and emotion classification/recognition. Traditional speech motion recognition techniques include hidden Markov model (HMM), artificial neural network (ANN), Gaussian mixture model (GMM), support vector machine (SVM), and K-nearest neighbor (KNN). However, the performances of these algorithms are significantly different since the corpus varies greatly. For example, the SVM and KNN based emotion recognition algorithms are generally with high certainty, whereas human emotions have strong complexity and high uncertainty. Therefore, they are deficiently effective in speech emotion recognition.

Both the physiological and psychological studies have demonstrated that facial expression and speech signal are informative for recognizing human emotion. Human beings express their different feelings by adjusting the strength of facial muscles and changing their tones. At the same time, human beings analyze the corresponding emotions by perceiving the speech signals. In addition, psychological experiments have shown that visual information will change speech perception. In this way, emotional types can be judged by visual and speech information. The database is the basis of emotional recognition research. Among all types of multi-modal emotional databases, the emotional database based on facial expression and voice is highly complete. For the early researches, Friesen et al. [4] pointed out that human facial expression can be classified into six emotional categories including happiness, anger, fear, sadness, disgust and surprise. And each emotion was related to facial expression. This research is the fundamental for emotion recognition. Combining with facial expression recognition, speech emotion recognition can achieve a better performance. Traditional RNN-based algorithms can fully exploit the contextual information to construct language models and achieve good performance in the field of emotional analysis. It is noticeable that, such recognition algorithm has the problems of gradient disappearance and explosion. To solve problems of the existing algorithms, in this paper, we propose a bimodal speech emotion recognition framework, an improved AlexNet to describe human facial expressions. Since speech signal are highly correlated at temporal-level, we combine LSTM and CNN to recognize emotion based on human speeches.

Section snippets

Related work

Speech emotion recognition has becoming a well-known research topic in modern intelligent systems. Williams and Stevens studied the principle of speech production from a physiological point of view. When people are in the state of anger, fear or pleasure, the sympathetic nervous system will be triggered. At this time, the tone of human voices will be higher, the speed of the voice will be faster, and there will be more high-frequency energy. When people are in a sad state, the parasympathetic

The proposed method

In our work, we combine both the facial expression and speech information to achieve emotion recognition. The pipeline of our method is elaborated in Fig. 1.

Experiments and analysis

In this section, we conduct experiments to verify the effectiveness of our proposed method. Our experiment is conducted on a PC equipped with an Intel Broadwell E5 CPU, Nvidia 1080ti GPU and 32 GB RAM. Three human emotion datasets are utilized in our experiment including the RML [19], AFEW6.0 [20] and eNTERFACE’05 [21].

Conclusions

Emotion recognition is popularly used in modern computer vision systems. In this paper, we propose a bimodal fusion-based speech emotion recognition method, where both the facial expression recognition and speech signal are seamlessly integrated. More specifically, we first combine CNN and RNN to optimally realize facial emotion recognition. Subsequently, we leverage the MFCC to convert speech signal to images. Therefore, we can employ the LSTM and CNN to understand speech emotion. Finally, we

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Xusheng Wang was born in Shanxi, P.R. China, in 1988. He received the Ph.D. degree from University of Paris-Saclay. Now, He works in Xi’an University of Technology. His research interest include computational intelligence, big data analysis and integrated circuit design.

E-mail: [email protected]

References (26)

VerveridisD. et al.
Emotional speech recognition: resources, features, and methods
Speech Commun.
(2006)
RongJ. et al.
Acoustic feature selection for automatic emotion recognition from speech
Inf. Process. Manage.
(2009)
A. Graves, A.R. Mohamed, G. Hinton, Speech recognition with deep recurrent neural networks, in: IEEE International...
H. Shen, X. Zhou, Speech recognition techniques for a sign language recognition system, in: Interspeech, Conference of...
AbdelhamidO. et al.
Convolutional neural networks for speech recognition
IEEE/ACM Trans. Audio Speech Lang. Process.
(2014)
EkmanP.
Universals and cultural differences in facial expressions of emotion
N. Kamaruddin, A. Wahab, Features extraction for speech emotion, in: International Conference on Software Engineering &...
KuoH.K.J. et al.
Maximum entropy direct models for speech recognition
IEEE Trans. Audio Speech Lang. Process.
(2006)
NicholsonJ. et al.
Emotion recognition in speech using neural networks
Neural Comput. Appl.
(2000)
FayekH.M. et al.
Evaluating deep learning architectures for speech emotion recognition
Neural Netw.
(2017)

NeumannM. et al.

Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech

(2017)

ZhangS.Q. et al.

Speech emotion recognition based on an improved supervised manifold learning algorithm

Dianzi Yu Xinxi Xuebao/J. Electron. Inf. Technol.

(2010)

TzelepiM. et al.

Exploiting supervised learning for finetuning deep CNNs in content based image retrieval

Cited by (66)

Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments
2024, Computer Standards and Interfaces
In human–computer interaction, emotion recognition provides a deeper understanding of the user’s emotions, enabling empathetic and effective responses based on the user’s emotional state. While deep learning models have improved emotion recognition solutions, it is still an active area of research. One important limitation is that most emotion recognition systems use only text as input, ignoring features such as voice intonation. Another limitation is the limited number of datasets available for multimodal emotion recognition. In addition, most published datasets contain emotions that are simulated by professionals and produce limited results in real-world scenarios. In other languages, such as Spanish, hardly any datasets are available. Therefore, our contributions to emotion recognition are as follows. First, we compile and annotate a new corpus for multimodal emotion recognition in Spanish (Spanish MEACorpus 2023), which contains 13.16 h of speech divided into 5129 segments labeled by considering Ekman’s six basic emotions. The dataset is extracted from YouTube videos in natural environments. Second, we explore several deep learning models for emotion recognition using text- and audio-based features. Third, we evaluate different multimodal techniques to build a multimodal recognition system that improves the results of unimodal models, achieving a Macro F1-score of 87.745%, using late fusion with concatenation strategy approach.
STP-MFM: Semi-tensor product-based multi-modal factorized multilinear pooling for information fusion in sentiment analysis
2024, Digital Signal Processing: A Review Journal
Multi-modal fusion can exploit complementary information from various modalities and improve the accuracy of prediction or classification tasks. In this paper, we propose a semi-tensor product-based multi-modal factorized multilinear (STP-MFM) pooling method for information fusion in sentiment analysis. Initially, we extend the bilinear pooling to multilinear pooling for multi-modal fusion. Next, we propose a multi-modal factorized multilinear pooling (MFM) method, which parametrizes the fusion weight tensor with the Tucker decomposition. Furthermore, we propose to use Semi-Tensor Product (STP) in MFM to obtain more flexible and compact tensor decompositions with smaller factor sizes, this process permits the connection of two factors with different dimensionality by using the semi-tensor mode product. The proposed method removes the limitation of dimension consistency in matrix multiplication and expresses the information in a more compact structure with less memory. Most importantly, the STP leverages temporal and spatial information from video, audio, and text, producing a better representation of intra-modality correlations. We verified the proposed STP-MFM for sentiment analysis on the CMU-MOSI and the IEMOCAP datasets. The experimental results indicate that the proposed method outperforms the baselines by a significant margin. Moreover, it also gains a superior training speed and lowers model complexity.
Method of interacting between humans and conversational voice agent systems
2024, Heliyon
Voice agent systems are deployed widely in various devices such as mobile phones, smart speakers, televisions, and refrigerators. Although the market for voice agent systems has expanded significantly and the systems themselves are becoming intelligent, their usability remains limited. This study provides general and practical guidelines for voice interaction to help VUI (Voice User Interface) designers and developers in the process of designing voice agent systems. The current guidelines only offer generic information for VUI designers, as they only provide the static aspect of voice interaction rather than consider the entire conversation flow. Studies on designing voice interaction systems have shown that VUI designers and developers need more guidance. The new interaction guidelines were developed based on users’ cognitive strategy in the course of using voice agent systems. We proposed and evaluated 12 practical voice interaction design guidelines for conversational voice agent systems. To evaluate the guidelines, we recruited 60 experts with different levels of voice interaction. Participants evaluated the usefulness of the guidelines and provided insightful opinions on them.
Enhancement multi-module network for few-shot leaky cable fixture detection in railway tunnel
2023, Signal Processing: Image Communication
The acquisition process of leaky cable fixture in railway tunnel is very complicated, and the number of faulty samples that can be collected is very rare, which seriously restricts the detection accuracy of faulty fixture. Based on the few-shot learning algorithm, we propose the Enhancement Multi-module Network (EMN) to achieve leaky cable fixture detection and further improve the mean accuracy in this paper. Firstly, the multi-scale module and the multi-region module are presented to explore intra-class similarities and inter-class differences through different scale and region information. Then, we propose the fusion module to enhance feature expression, whose role is to distinguish the semantic information between fixture and background, thereby facilitating to identify fixture in highly similar tunnel scenes. Finally, the similarity scores between support and query fixture sample pairs are computed by the relation module, and the mean squared error loss is optimized by adding weight balance coefficient. Experiments are conducted on leaky cable fixture data set, the result demonstrates that our proposed algorithm outperforms excellent accuracy compared to several state-of-the-art methods based on traditional handcrafted feature algorithm, deep learning algorithm, and few-shot learning algorithm.
An ongoing review of speech emotion recognition
2023, Neurocomputing
User emotional status recognition is becoming a key feature in advanced Human Computer Interfaces (HCI). A key source of emotional information is the spoken expression, which may be part of the interaction between the human and the machine. Speech emotion recognition (SER) is a very active area of research that involves the application of current machine learning and neural networks tools. This ongoing review covers recent and classical approaches to SER reported in the literature.
Multimodal Machine Learning Approach for Emotion Recognition Using Physiological Signals
2024, SSRN

View all citing articles on Scopus

E-mail: [email protected]

Xing Chen was born in Shaanxi, P.R. China, in 1999. She received the Bachelor degree from Xi’an University of Technology, P.R. China. Now, she works as a college instructor in Xi’an University of Technology. Her research interests include big data analysis, computational intelligence.

Congjun Cao was born in October 1970. She graduated from Northwestern University with Ph.D. in computer Software and Theory. She is currently a Full professor of Xi’an University of Technology in P.R. China. Her research focuses on cross-media color reproduction, quality control technology, and computational intelligence.

View full text

Human emotion recognition by optimally fusing facial expression and speech feature

Highlights

Abstract

Introduction

Section snippets

Related work

The proposed method

Experiments and analysis

Conclusions

Declaration of Competing Interest

Speech Commun.

Inf. Process. Manage.

Convolutional neural networks for speech recognition

IEEE/ACM Trans. Audio Speech Lang. Process.

Universals and cultural differences in facial expressions of emotion

Maximum entropy direct models for speech recognition

IEEE Trans. Audio Speech Lang. Process.

Emotion recognition in speech using neural networks

Neural Comput. Appl.

Evaluating deep learning architectures for speech emotion recognition

Neural Netw.

Attentive convolutional neural network based speech emotion recognition: a study on the impact of input features, signal length, and acted speech

Speech emotion recognition based on an improved supervised manifold learning algorithm

Dianzi Yu Xinxi Xuebao/J. Electron. Inf. Technol.

Exploiting supervised learning for finetuning deep CNNs in content based image retrieval