当前位置: X-MOL 学术J. Visual Commun. Image Represent. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Cross-modal dynamic convolution for multi-modal emotion recognition
Journal of Visual Communication and Image Representation ( IF 2.6 ) Pub Date : 2021-06-03 , DOI: 10.1016/j.jvcir.2021.103178
Huanglu Wen , Shaodi You , Ying Fu

Understanding human emotions requires information from different modalities like vocal, visual, and verbal. Since human emotion is time-varying, the related information is usually represented as temporal sequences and we need to identify both emotion-related clues and their cross-modal interactions inside. However, emotion-related clues are sparse and misaligned in temporally unaligned sequences, making it hard for previous multi-modal emotion recognition methods to catch helpful cross-modal interactions. To this end, we present cross-modal dynamic convolution. To deal with sparsity, cross-modal dynamic convolution models the temporal dimension locally to avoid being overwhelmed by unrelated information. Cross-modal dynamic convolution is easy to stack, enabling it to model long-range cross-modal temporal interactions. Besides, models with cross-modal dynamic convolution are more stable during training than with cross-modal attention, bringing more possibilities in multi-modal sequential model designing. Extensive experiments show that our method can achieve competitive performance compared to previous works while being more efficient.



中文翻译:

用于多模态情感识别的跨模态动态卷积

理解人类情感需要来自不同形式的信息,如声音、视觉和语言。由于人类情感是随时间变化的,因此相关信息通常表示为时间序列,我们需要识别与情感相关的线索及其内部的跨模态交互。然而,与情绪相关的线索在时间未对齐的序列中稀疏且未对齐,这使得以前的多模态情绪识别方法难以捕捉有用的跨模态交互。为此,我们提出了跨模态动态卷积。为了处理稀疏性,跨模态动态卷积在本地对时间维度进行建模,以避免被无关信息淹没。跨模态动态卷积易于堆叠,使其能够对长程跨模态时间交互进行建模。除了,具有跨模态动态卷积的模型在训练过程中比具有跨模态注意力的模型更稳定,为多模态序列模型设计带来更多可能性。大量实验表明,与以前的工作相比,我们的方法可以实现有竞争力的性能,同时更高效。

更新日期:2021-06-08
down
wechat
bug