当前位置: X-MOL 学术IEEE Trans. Multimedia › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Knowledge-Augmented Multimodal Deep Regression Bayesian Networks for Emotion Video Tagging
IEEE Transactions on Multimedia ( IF 7.3 ) Pub Date : 2020-04-01 , DOI: 10.1109/tmm.2019.2934824
Shangfei Wang , Longfei Hao , Qiang Ji

The immanent dependencies between audio and visual modalities extracted from video content and the well-established film grammar (i.e., domain knowledge) are important for emotion video recognition and regression. However, these tools have yet to be exploited successfully. Therefore, we propose a multimodal deep regression Bayesian network (MMDRBN) to capture the relationship between audio and visual modalities for emotion video tagging. We then modify the structure of the MMDRBN to incorporate domain knowledge. A regression Bayesian network (RBN) is formed from one latent layer, one visible layer and directed links from the latent layer to the visible layer. RBN is able to fully represent the data, since it captures the dependencies not only among the visible variables but also among the latent variables given visible variables. For the MMDRBN, first, we learn several layers of RBNs using audio and visual modalities, and then stack these RBNs to form two deep networks. A joint representation is obtained from the top layers of the two deep networks, capturing the deep dependencies between audio and visual modalities. We also summarize the main audio and visual elements used by filmmakers to convey emotions and formulate them as semantical meaningful middle-level representation, i.e., attributes. Through these attributes, we construct the knowledge-augmented MMDRBN, which learns a hybrid middle-level video representation using video data and the summarized attributes. Experimental results of both emotion recognition and regression from videos on the LIRIS-ACCEDE database demonstrate that the proposed model can successfully capture the intrinsic connections between audio and visual modalities, and integrate the middle-level representation learning from video data and semantical attributes summarized from film grammar. Thus, it achieves superior performance on emotion video tagging compared to state-of-the-art methods.

中文翻译:

用于情感视频标记的知识增强多模态深度回归贝叶斯网络

从视频内容中提取的音频和视觉模态与完善的电影语法(即领域知识)之间的内在依赖关系对于情感视频识别和回归很重要。然而,这些工具尚未被成功利用。因此,我们提出了一种多模态深度回归贝叶斯网络(MMDRBN)来捕捉情感视频标记的音频和视觉模态之间的关系。然后我们修改 MMDRBN 的结构以纳入领域知识。回归贝叶斯网络 (RBN) 由一个潜在层、一个可见层和从潜在层到可见层的有向链接构成。RBN 能够完全表示数据,因为它不仅捕获了可见变量之间的依赖关系,还捕获了给定可见变量的潜在变量之间的依赖关系。对于 MMDRBN,首先,我们使用音频和视觉模式学习几层 RBN,然后将这些 RBN 堆叠起来形成两个深度网络。从两个深层网络的顶层获得联合表示,捕获音频和视觉模式之间的深层依赖关系。我们还总结了电影制作人用来传达情感的主要音频和视觉元素,并将它们表述为语义上有意义的中间层表示,即属性。通过这些属性,我们构建了知识增强的 MMDRBN,它使用视频数据和汇总的属性来学习混合中级视频表示。LIRIS-ACCEDE 数据库上视频情感识别和回归的实验结果表明,所提出的模型可以成功捕捉音频和视觉模态之间的内在联系,并整合从视频数据中学习的中级表征和从电影中总结的语义属性语法。因此,与最先进的方法相比,它在情感视频标记方面实现了卓越的性能。
更新日期:2020-04-01
down
wechat
bug