当前位置: X-MOL 学术IEEE Multimed. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multimodal and Context-Aware Emotion Perception Model With Multiplicative Fusion
IEEE Multimedia ( IF 2.3 ) Pub Date : 2021-03-23 , DOI: 10.1109/mmul.2021.3068387
Trisha Mittal 1 , Aniket Bera 1 , Dinesh Manocha 1
Affiliation  

We present a learning model for multimodal context-aware emotion recognition. Our approach combines multiple human co-occurring modalities (such as facial, audio, textual, and pose/gaits) and two interpretations of context. To gather and encode background semantic information for the first context interpretation from the input image/video, we use a self-attention-based CNN to encode. Similarly, for modeling the sociodynamic interactions among people (second context interpretation) in the input image/video, we use depth maps. We use multiplicative fusion to combine the modality and context channels, which learn to focus on the more informative input channels and suppress others for every incoming datapoint. We demonstrate the efficiency of our model on four benchmark emotion recognition datasets (IEMOCAP, CMU-MOSEI, EMOTIC, and GroupWalk). Our model outperforms on state of the art (SOTA) learning methods with an average 5%−9%5\%-9\%5%-9% increase over all the datasets. We also perform ablation studies to motivate the importance of multimodality, context, and multiplicative fusion.

中文翻译:


具有乘法融合的多模态和上下文感知情绪感知模型



我们提出了一种多模式情境感知情感识别的学习模型。我们的方法结合了多种人类共现模式(例如面部、音频、文本和姿势/步态)和上下文的两种解释。为了从输入图像/视频中收集和编码背景语义信息以进行第一个上下文解释,我们使用基于自注意力的 CNN 进行编码。同样,为了对输入图像/视频中人与人之间的社会动力学交互(第二上下文解释)进行建模,我们使用深度图。我们使用乘法融合来组合模态和上下文通道,它学会专注于信息更丰富的输入通道并抑制每个传入数据点的其他通道。我们在四个基准情感识别数据集(IEMOCAP、CMU-MOSEI、EMOTIC 和 GroupWalk)上展示了我们的模型的效率。我们的模型优于最先进的 (SOTA) 学习方法,在所有数据集上平均提高 5%−9%5\%-9\%5%-9%。我们还进行消融研究,以激发多模态、上下文和乘法融合的重要性。
更新日期:2021-03-23
down
wechat
bug