当前位置: X-MOL 学术IEEE Multimed. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
End-to-End Learning for Multimodal Emotion Recognition in Video With Adaptive Loss
IEEE Multimedia ( IF 2.3 ) Pub Date : 2021-05-14 , DOI: 10.1109/mmul.2021.3080305
Van Thong Huynh 1 , Hyung-Jeong Yang 1 , Guee-Sang Lee 1 , Soo-Hyung Kim 1
Affiliation  

This work presents an approach for emotion recognition in video through the interaction of visual, audio, and language information in an end-to-end learning manner with three key points: 1) lightweight feature extractor, 2) attention strategy, and 3) adaptive loss. We proposed a lightweight deep architecture with approximately 1 MB, which for the most crucial part, accounts for feature extraction, in the emotion recognition systems. The relationship in regard to the time dimension of features is explored with temporal convolutional network instead of RNNs-based architecture to leverage the parallelism and avoid the challenge of vanishing gradient. The attention strategy is employed to adjust the knowledge of temporal networks based on the time dimension and learning of each modality’s contribution to the final results. The interaction between the modalities is also investigated when training with adaptive objective function, which adjusts the network’s gradient. The experimental results obtained on a large-scale dataset for emotion recognition on Koreans demonstrate the superiority of our method when employing attention mechanism and adaptive loss during training.

中文翻译:


具有自适应损失的视频中多模态情感识别的端到端学习



这项工作提出了一种通过视觉、音频和语言信息交互以端到端学习方式进行视频中的情感识别的方法,具有三个关键点:1)轻量级特征提取器,2)注意力策略,3)自适应损失。我们提出了一种大约 1 MB 的轻量级深度架构,它是情感识别系统中最关键的部分,负责特征提取。使用时间卷积网络而不是基于 RNN 的架构来探索特征与时间维度的关系,以利用并行性并避免梯度消失的挑战。注意力策略用于根据时间维度和学习每种模态对最终结果的贡献来调整时间网络的知识。在使用自适应目标函数进行训练时,还会研究模态之间的相互作用,该函数调整网络的梯度。在韩国人情绪识别的大规模数据集上获得的实验结果证明了我们的方法在训练过程中采用注意机制和自适应损失时的优越性。
更新日期:2021-05-14
down
wechat
bug