Human emotion recognition by optimally fusing facial expression and speech feature,Signal Processing: Image Communication

当前位置： X-MOL 学术 › Signal Process. Image Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Human emotion recognition by optimally fusing facial expression and speech feature
Signal Processing: Image Communication ( IF 3.4 ) Pub Date : 2020-03-13 , DOI: 10.1016/j.image.2020.115831
Xusheng Wang , Xing Chen , Congjun Cao

Emotion recognition is a hot research in modern intelligent systems. The technique is pervasively used in autonomous vehicles, remote medical service, and human–computer interaction (HCI). Traditional speech emotion recognition algorithms cannot be effectively generalized since both training and testing data are from the same domain, which have the same data distribution. In practice, however, speech data is acquired from different devices and recording environments. Thus, the data may differ significantly in terms of language, emotional types and tags. To solve such problem, in this work, we propose a bimodal fusion algorithm to realize speech emotion recognition, where both facial expression and speech information are optimally fused. We first combine the CNN and RNN to achieve facial emotion recognition. Subsequently, we leverage the MFCC to convert speech signal to images. Therefore, we can leverage the LSTM and CNN to recognize speech emotion. Finally, we utilize the weighted decision fusion method to fuse facial expression and speech signal to achieve speech emotion recognition. Comprehensive experimental results have demonstrated that, compared with the uni-modal emotion recognition, bimodal features-based emotion recognition achieves a better performance.

中文翻译：

通过最佳融合面部表情和语音特征来识别人类情绪

情感识别是现代智能系统中的热门研究。该技术广泛用于自动驾驶汽车，远程医疗服务和人机交互（HCI）。传统的语音情感识别算法无法有效地推广，因为训练和测试数据都来自同一域，具有相同的数据分布。然而，实际上，语音数据是从不同的设备和记录环境中获取的。因此，数据在语言，情感类型和标签方面可能存在显着差异。为了解决这个问题，在这项工作中，我们提出了一种双峰融合算法来实现语音情感识别，其中面部表情和语音信息都得到了最佳融合。我们首先将CNN和RNN结合起来以实现面部表情识别。后来，我们利用MFCC将语音信号转换为图像。因此，我们可以利用LSTM和CNN识别语音情感。最后，利用加权决策融合方法融合表情和语音信号，实现语音情感识别。综合实验结果表明，与单峰情感识别相比，基于双峰特征的情感识别具有更好的性能。

更新日期：2020-03-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文