Learning deep multimodal affective features for spontaneous speech emotion recognition,Speech Communication

当前位置： X-MOL 学术 › Speech Commun. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning deep multimodal affective features for spontaneous speech emotion recognition
Speech Communication ( IF 3.2 ) Pub Date : 2020-12-26 , DOI: 10.1016/j.specom.2020.12.009
Shiqing Zhang , Xin Tao , Yuelong Chuang , Xiaoming Zhao

Recently, spontaneous speech emotion recognition has become an active and challenging research subject. This paper proposes a new method of spontaneous speech emotion recognition by using deep multimodal audio feature learning based on multiple deep convolutional neural networks (multi-CNNs). The proposed method initially generates three different audio inputs for multi-CNNs so as to learn deep multimodal segment-level features from the original 1D audio signal in three aspects: 1) a 1D CNN for 1D raw waveform modeling, 2) a 2D CNN for 2D time-frequency Mel-spectrogram modeling, and 3) a 3D CNN for temporal-spatial dynamic modeling. Then, an average-pooling is performed on the obtained segment-level classification results from 1D, 2D, and 3D CNN networks, to produce utterance-level classification results. Finally, a score-level fusion strategy is adopted as a multi-CNN fusion method to integrate different utterance-level classification results for final emotion classification. The learned deep multimodal audio features are shown to be complementary to each other so that they are combined in a multi-CNN fusion network to achieve significantly improved emotion classification performance. Experiments are conducted on two challenging spontaneous emotional speech datasets, i.e., the AFEW5.0 and BAUM-1 s databases, demonstrating the promising performance of our proposed method.

中文翻译：

学习深层的多模态情感特征以自发地表达语音

近年来，自发的语音情感识别已经成为活跃而具有挑战性的研究课题。本文提出了一种基于多重深度卷积神经网络（multi-CNN）的深度多模态音频特征学习的自发语音情感识别新方法。所提出的方法最初为多CNN生成三种不同的音频输入，以便从三个方面从原始的1D音频信号中学习深度的多峰段级特征：1）1D CNN用于1D原始波形建模，2）2D CNN用于2D时频梅尔频谱图建模，以及3）用于时空动态建模的3D CNN。然后，对从1D，2D和3D CNN网络获得的段级分类结果执行平均合并，以生成话语级分类结果。最后，采用分数级融合策略作为多CNN融合方法，将不同话语级别的分类结果进行整合，以进行最终的情感分类。所学习的深度多模态音频特征显示为彼此互补，因此可以将它们组合在多CNN融合网络中，从而显着改善情感分类性能。在两个具有挑战性的自发情感语音数据集上进行了实验，即AFEW5.0和BAUM-1的数据库，证明了我们提出的方法的有希望的性能。

更新日期：2021-01-13

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>