Learning Alignment for Multimodal Emotion Recognition from Speech,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning Alignment for Multimodal Emotion Recognition from Speech
arXiv - CS - Sound Pub Date : 2019-09-06 , DOI: arxiv-1909.05645
Haiyang Xu, Hui Zhang, Kun Han, Yun Wang, Yiping Peng, Xiangang Li

Speech emotion recognition is a challenging problem because human convey emotions in subtle and complex ways. For emotion recognition on human speech, one can either extract emotion related features from audio signals or employ speech recognition techniques to generate text from speech and then apply natural language processing to analyze the sentiment. Further, emotion recognition will be beneficial from using audio-textual multimodal information, it is not trivial to build a system to learn from multimodality. One can build models for two input sources separately and combine them in a decision level, but this method ignores the interaction between speech and text in the temporal domain. In this paper, we propose to use an attention mechanism to learn the alignment between speech frames and text words, aiming to produce more accurate multimodal feature representations. The aligned multimodal features are fed into a sequential model for emotion recognition. We evaluate the approach on the IEMOCAP dataset and the experimental results show the proposed approach achieves the state-of-the-art performance on the dataset.

中文翻译：

从语音中学习多模态情感识别的对齐

语音情感识别是一个具有挑战性的问题，因为人类以微妙而复杂的方式传达情感。对于人类语音的情感识别，可以从音频信号中提取与情感相关的特征，也可以采用语音识别技术从语音中生成文本，然后应用自然语言处理来分析情感。此外，情感识别将受益于使用音频-文本多模态信息，构建一个从多模态中学习的系统并非易事。可以分别为两个输入源构建模型并将它们组合在决策级别，但这种方法忽略了时域中语音和文本之间的交互。在本文中，我们建议使用注意力机制来学习语音帧和文本单词之间的对齐，旨在产生更准确的多模态特征表示。对齐的多模态特征被输入到用于情感识别的序列模型中。我们在 IEMOCAP 数据集上评估该方法，实验结果表明所提出的方法在数据集上实现了最先进的性能。

更新日期：2020-04-06

点击分享查看原文

点击收藏

阅读更多本刊最新论文