Robust Multimodal Representation Learning With Evolutionary Adversarial Attention Networks,IEEE Transactions on Evolutionary Computation

当前位置： X-MOL 学术 › IEEE T. Evolut. Comput. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Robust Multimodal Representation Learning With Evolutionary Adversarial Attention Networks
IEEE Transactions on Evolutionary Computation ( IF 14.3 ) Pub Date : 2021-03-17 , DOI: 10.1109/tevc.2021.3066285
Feiran Huang , Alireza Jolfaei , Ali Kashif Bashir

Multimodal representation learning is beneficial for many multimedia-oriented applications, such as social image recognition and visual question answering. The different modalities of the same instance (e.g., a social image and its corresponding description) are usually correlational and complementary. Most existing approaches for multimodal representation learning are not effective to model the deep correlation between different modalities. Moreover, it is difficult for these approaches to deal with the noise within social images. In this article, we propose a deep learning-based approach named evolutionary adversarial attention networks (EAANs), which combines the attention mechanism with adversarial networks through evolutionary training, for robust multimodal representation learning. Specifically, a two-branch visual-textual attention model is proposed to correlate visual and textual content for joint representation. Then adversarial networks are employed to impose regularization upon the representation by matching its posterior distribution to the given priors. Finally, the attention model and adversarial networks are integrated into an evolutionary training framework for robust multimodal representation learning. Extensive experiments have been conducted on four real-world datasets, including PASCAL, MIR, CLEF, and NUS-WIDE. Substantial performance improvements on the tasks of image classification and tag recommendation demonstrate the superiority of the proposed approach.

中文翻译：

具有进化对抗注意力网络的鲁棒多模态表示学习

多模态表示学习有利于许多面向多媒体的应用，例如社交图像识别和视觉问答。同一实例的不同模态（例如，社交图像及其相应的描述）通常是相关的和互补的。大多数现有的多模态表示学习方法不能有效地模拟不同模态之间的深度相关性。此外，这些方法很难处理社交图像中的噪声。在本文中，我们提出了一种基于深度学习的方法，称为进化对抗性注意力网络（EAANs），它通过进化训练将注意力机制与对抗性网络相结合，以实现稳健的多模态表示学习。具体来说，提出了一个两分支的视觉-文本注意模型来关联视觉和文本内容以进行联合表示。然后使用对抗网络通过将其后验分布与给定的先验匹配来对表示进行正则化。最后，注意力模型和对抗网络被集成到一个进化训练框架中，用于鲁棒的多模态表示学习。已经在四个真实世界的数据集上进行了广泛的实验，包括 PASCAL、MIR、CLEF 和 NUS-WIDE。图像分类和标签推荐任务的显着性能改进证明了所提出方法的优越性。然后使用对抗网络通过将其后验分布与给定的先验匹配来对表示进行正则化。最后，注意力模型和对抗网络被集成到一个进化训练框架中，用于鲁棒的多模态表示学习。已经在四个真实世界的数据集上进行了广泛的实验，包括 PASCAL、MIR、CLEF 和 NUS-WIDE。图像分类和标签推荐任务的显着性能改进证明了所提出方法的优越性。然后使用对抗网络通过将其后验分布与给定的先验匹配来对表示进行正则化。最后，注意力模型和对抗网络被集成到一个进化训练框架中，用于鲁棒的多模态表示学习。已经在四个真实世界的数据集上进行了广泛的实验，包括 PASCAL、MIR、CLEF 和 NUS-WIDE。图像分类和标签推荐任务的显着性能改进证明了所提出方法的优越性。包括 PASCAL、MIR、CLEF 和 NUS-WIDE。图像分类和标签推荐任务的显着性能改进证明了所提出方法的优越性。包括 PASCAL、MIR、CLEF 和 NUS-WIDE。图像分类和标签推荐任务的显着性能改进证明了所提出方法的优越性。

更新日期：2021-03-17

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>