当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning deep multimodal affective features for spontaneous speech emotion recognition
Speech Communication ( IF 3.2 ) Pub Date : 2020-12-26 , DOI: 10.1016/j.specom.2020.12.009
Shiqing Zhang , Xin Tao , Yuelong Chuang , Xiaoming Zhao

Recently, spontaneous speech emotion recognition has become an active and challenging research subject. This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs). The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multimodal segment-level features from the original 1D audio signal in three aspects: 1) a 1D CNN for 1D raw waveform modeling, 2) a 2D CNN for 2D time-frequency Mel-spectrogram modeling, and 3) a 3D CNN for temporal-spatial dynamic modeling. Then, an average-pooling is performed on the obtained segment-level classification results from 1D, 2D, and 3D CNN networks, to produce utterance-level classification results. Finally, a score-level fusion strategy is adopted as a multi-CNN fusion method to integrate different utterance-level classification results for final emotion classification. The learned deep multimodal audio features are shown to be complementary to each other so that they are combined in a multi-CNN fusion network to achieve significantly improved emotion classification performance. Experiments are conducted on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 and BAUM-1 s databases, demonstrating the promising performance of our proposed method.



中文翻译:

学习深层的多模态情感特征以自发地表达语音

近年来,自发的语音情感识别已经成为活跃而具有挑战性的研究课题。本文提出了一种基于多重深度卷积神经网络(multi-CNN)的深度多模态音频特征学习的自发语音情感识别新方法。所提出的方法最初为多CNN生成三种不同的音频输入,以便从三个方面从原始的1D音频信号中学习深度的多峰段级特征:1)1D CNN用于1D原始波形建模,2)2D CNN用于2D时频梅尔频谱图建模,以及3)用于时空动态建模的3D CNN。然后,对从1D,2D和3D CNN网络获得的段级分类结果执行平均合并,以生成话语级分类结果。最后,采用分数级融合策略作为多CNN融合方法,将不同话语级别的分类结果进行整合,以进行最终的情感分类。所学习的深度多模态音频特征显示为彼此互补,因此可以将它们组合在多CNN融合网络中,从而显着改善情感分类性能。在两个具有挑战性的自发情感语音数据集上进行了实验,AFEW5.0和BAUM-1的数据库,证明了我们提出的方法的有希望的性能。

更新日期:2021-01-13
down
wechat
bug