Multi-modal Attention for Speech Emotion Recognition,arXiv - CS - Sound

当前位置： X-MOL 学术 › arXiv.cs.SD › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-modal Attention for Speech Emotion Recognition
arXiv - CS - Sound Pub Date : 2020-09-09 , DOI: arxiv-2009.04107
Zexu Pan, Zhaojie Luo, Jichen Yang, Haizhou Li

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to make use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

中文翻译：

语音情感识别的多模态注意力

情感代表了人类语言的一个重要方面，它体现在语言韵律中。语音、视觉和文本提示在人类交流中是互补的。在本文中，我们研究了一种混合融合方法，称为多模态注意网络 (MMAN)，以在语音情感识别中利用视觉和文本线索。我们提出了一种新颖的多模态注意力机制 cLSTM-MMA，它促进了三种模态的注意力并选择性地融合了信息。cLSTM-MMA 在后期融合中与其他单模态子网络融合。实验表明，语音情感识别显着受益于视觉和文本线索，并且单独提出的 cLSTM-MMA 在准确性方面与其他融合方法一样具有竞争力，但具有更紧凑的网络结构。

更新日期：2020-09-10

点击分享查看原文

点击收藏

阅读更多本刊最新论文