Neural Multimodal Cooperative Learning Toward Micro-Video Understanding.,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Neural Multimodal Cooperative Learning Toward Micro-Video Understanding.
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2019-07-01 , DOI: 10.1109/tip.2019.2923608
Yinwei Wei , Xiang Wang , Weili Guan , Liqiang Nie , Zhouchen Lin , Baoquan Chen

The prevailing characteristics of micro-videos result in the less descriptive power of each modality. The micro-video representations, several pioneer efforts proposed, are limited in implicitly exploring the consistency between different modality information but ignore the complementarity. In this paper, we focus on how to explicitly separate the consistent features and the complementary features from the mixed information and harness their combination to improve the expressiveness of each modality. Toward this end, we present a neural multimodal cooperative learning (NMCL) model to split the consistent component and the complementary component by a novel relation-aware attention mechanism. Specifically, the computed attention score can be used to measure the correlation between the features extracted from different modalities. Then, a threshold is learned for each modality to distinguish the consistent and complementary features according to the score. Thereafter, we integrate the consistent parts to enhance the representations and supplement the complementary ones to reinforce the information in each modality. As to the problem of redundant information, which may cause overfitting and is hard to distinguish, we devise an attention network to dynamically capture the features which closely related the category and output a discriminative representation for prediction. The experimental results on a real-world micro-video dataset show that the NMCL outperforms the state-of-the-art methods. Further studies verify the effectiveness and cooperative effects brought by the attentive mechanism.

中文翻译：

神经多模式合作学习对微视频的理解。

微型视频的主要特征导致每种形式的描述力较低。提出的一些先驱性成果，微视频表示在隐式地探索不同模态信息之间的一致性方面受到限制，而忽略了互补性。在本文中，我们着重于如何从混合信息中明确区分出一致特征和互补特征，并利用它们的组合来提高每种形式的表现力。为此，我们提出了一种神经多模式合作学习（NMCL）模型，该模型通过一种新颖的关系感知注意力机制将一致成分和互补成分分开。具体而言，计算出的注意力得分可用于测量从不同模态提取的特征之间的相关性。然后，为每个模态学习一个阈值，以根据分数来区分一致和互补的特征。此后，我们整合一致的部分以增强表示形式，并补充互补的部分以加强每种形式的信息。对于可能导致过度拟合且难以区分的冗余信息问题，我们设计了一个注意力网络来动态捕获与类别密切相关的特征，并输出判别式表示进行预测。在真实世界的微型视频数据集上的实验结果表明，NMCL的性能优于最新方法。进一步的研究验证了该注意机制带来的有效性和协同作用。

更新日期：2020-04-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11