Fusion of Deep Learning Features with Mixture of Brain Emotional Learning for Audio-Visual Emotion Recognition,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fusion of Deep Learning Features with Mixture of Brain Emotional Learning for Audio-Visual Emotion Recognition
Speech Communication ( IF 2.4 ) Pub Date : 2020-12-03 , DOI: 10.1016/j.specom.2020.12.001
Zeinab Farhoudi , Saeed Setayeshi

Multimodal emotion recognition is a challenging task due to different modalities emotions expressed during a specific time in video clips. Considering the existed spatial-temporal correlation in the video, we propose an audio-visual fusion model of deep learning features with a Mixture of Brain Emotional Learning (MoBEL) model inspired by the brain limbic system. The proposed model is composed of two stages. First, deep learning methods, especially Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN), are applied to represent highly abstract features. Second, the fusion model, namely MoBEL, is designed to learn the previously joined audio-visual features simultaneously. For the visual modality representation, the 3D-CNN model has been used to learn the spatial-temporal features of visual expression. On the other hand, for the auditory modality, the Mel-spectrograms of speech signals have been fed into CNN-RNN for the spatial-temporal feature extraction. The high-level feature fusion approach with the MoBEL network is presented to make use of a correlation between the visual and auditory modalities for improving the performance of emotion recognition. The experimental results on the eNterface’05 database have been demonstrated that the performance of the proposed method is better than the hand-crafted features and the other state-of-the-art information fusion models in video emotion recognition.

中文翻译：

深度学习功能与脑部情感学习混合物的融合，用于视听情感识别

多模式情感识别是一项具有挑战性的任务，因为在特定时间段内在视频片段中表达的情感不同。考虑到视频中存在的时空相关性，我们提出了深度学习功能的视听融合模型，以及受大脑边缘系统启发的大脑情感学习混合模型（MoBEL）。所提出的模型包括两个阶段。首先，深度学习方法，特别是卷积神经网络（CNN）和递归神经网络（RNN），被用来表示高度抽象的特征。其次，融合模型（即MoBEL）旨在同时学习以前加入的视听功能。对于视觉模态表示，已使用3D-CNN模型学习视觉表达的时空特征。另一方面，对于听觉形态，语音信号的梅尔声谱图已被输入到CNN-RNN中以进行时空特征提取。提出了具有MoBEL网络的高级特征融合方法，以利用视觉和听觉模态之间的相关性来改善情感识别的性能。在eNterface'05数据库上的实验结果表明，该方法在视频情感识别中的性能优于手工制作的功能和其他最新的信息融合模型。提出了具有MoBEL网络的高级特征融合方法，以利用视觉和听觉模态之间的相关性来改善情感识别的性能。在eNterface'05数据库上的实验结果表明，该方法在视频情感识别中的性能优于手工制作的功能和其他最新的信息融合模型。提出了具有MoBEL网络的高级特征融合方法，以利用视觉和听觉模态之间的相关性来改善情感识别的性能。在eNterface'05数据库上的实验结果表明，该方法在视频情感识别中的性能优于手工制作的功能和其他最新的信息融合模型。

更新日期：2020-12-03

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11