当前位置: X-MOL 学术Speech Commun. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Speech emotion recognition using fusion of three multi-task learning-based classifiers: HSF-DNN, MS-CNN and LLD-RNN
Speech Communication ( IF 2.4 ) Pub Date : 2020-03-28 , DOI: 10.1016/j.specom.2020.03.005
Zengwei Yao , Zihao Wang , Weihuang Liu , Yaqian Liu , Jiahui Pan

Speech emotion recognition plays an increasingly important role in emotional computing and is still a challenging task due to its complexity. In this study, we developed a framework integrating three distinctive classifiers: a deep neural network (DNN), a convolution neural network (CNN), and a recurrent neural network (RNN). The framework was used for categorical recognition of four discrete emotions (i.e., angry, happy, neutral and sad). Frame-level low-level descriptors (LLDs), segment-level mel-spectrograms (MS), and utterance-level outputs of high-level statistical functions (HSFs) on LLDs were passed to RNN, CNN, and DNN, separately. Three individual models of LLD-RNN, MS-CNN, and HSF-DNN were obtained. In the models of MS-CNN and LLD-RNN, the attention mechanism based weighted-pooling method was utilized to aggregate the CNN and RNN outputs. To effectively utilize the interdependencies between the two approaches of emotion description (discrete emotion categories and continuous emotion attributes), a multi-task learning strategy was implemented in these three models to acquire generalized features by simultaneously operating classification of discrete categories and regression of continuous attributes. Finally, a confidence-based fusion strategy was developed to integrate the power of different classifiers in recognizing different emotional states. Three experiments on emotion recognition based on the IEMOCAP corpus were conducted. Our experimental results show that the weighted pooling method based on attention mechanism endowed the neural networks with the capability to focus on emotionally salient parts. The generalized features learned in the multi-task learning helped the neural networks to achieve higher accuracies in the tasks of emotion classification. Furthermore, our proposed fusion system achieved weighted accuracy of 57.1% and unweighted accuracy of 58.3%, which were significantly higher than those of each individual classifier. The effectiveness of the proposed approach based on classifier fusion was thus validated.



中文翻译:

结合三个基于多任务学习的分类器进行语音情感识别:HSF-DNN,MS-CNN和LLD-RNN

语音情感识别在情感计算中起着越来越重要的作用,由于其复杂性,它仍然是一项具有挑战性的任务。在这项研究中,我们开发了一个框架,该框架集成了三个独特的分类器:深度神经网络(DNN),卷积神经网络(CNN)和递归神经网络(RNN)。该框架用于对四种离散情绪(即愤怒,快乐,中立和悲伤)的分类识别。分别将LLD的帧级低级描述符(LLD),段级mel频谱图(MS)和发话级输出的高级统计功能(HSF)传递给RNN,CNN和DNN。获得了LLD-RNN,MS-CNN和HSF-DNN的三个独立模型。在MS-CNN和LLD-RNN的模型中,基于注意力机制的加权池方法被用于聚合CNN和RNN的输出。为了有效利用两种情感描述方法(离散情感类别和连续情感属性)之间的相互依赖关系,在这三个模型中实施了多任务学习策略,通过同时操作离散类别的分类和连续属性的回归来获取广义特征。 。最后,开发了一种基于置信度的融合策略,以整合不同分类器在识别不同情绪状态时的功能。进行了三个基于IEMOCAP语料库的情感识别实验。我们的实验结果表明,基于注意力机制的加权合并方法使神经网络具有专注于情感突出部分的能力。在多任务学习中学习到的广义特征帮助神经网络在情感分类任务中获得更高的准确性。此外,我们提出的融合系统实现了57.1%的加权精度和58.3%的未加权精度,这明显高于每个单独的分类器。因此,验证了所提方法基于分类器融合的有效性。

更新日期:2020-03-28
down
wechat
bug