当前位置: X-MOL 学术Pattern Recogn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Generalized pyramid co-attention with learnable aggregation net for video question answering
Pattern Recognition ( IF 8 ) Pub Date : 2021-06-30 , DOI: 10.1016/j.patcog.2021.108145
Lianli Gao , Tangming Chen , Xiangpeng Li , Pengpeng Zeng , Lei Zhao , Yuan-Fang Li

Video based visual question answering (V-VQA) remains challenging at the intersection of vision and language. In this paper, we propose a novel architecture, namely Generalized Pyramid Co-attention with Learnable Aggregation Net (GPC) to address two central problems: 1) how to deploy co-attention to V-VQA task considering the complex and diverse content of videos; and 2) how to aggregate the frame-level features (or word-level features) without destroying the feature distributions and temporal information. To solve the first problem, we propose a Generalized Pyramid Co-attention structure with a novel diversity learning module to explicitly encourage attention accuracy and diversity. And we first instantiate it into a Multi-path Pyramid Co-attention (MPC) to capture diverse feature. Then we find each attention branch of original co-attention mechanism does not interact with the others, which results in coarse attention maps. So we extend the MPC structure to a Cascaded Pyramid Transformer Co-attention (CPTC) module in which we replace co-attention with transformer co-attention. To solve the second problem, we propose a new learnable aggregation method with a set of evidence gates. It automatically aggregates adaptively-weighted frame-level features (or word-level features) to extract rich video (or question) context semantic information. With evidence gates, it then further chooses the most related signals representing the evidence information to predict the answer. Extensive validations on the two V-VQA datasets, TGIF-QA and TVQA show that both our proposed MPC and CPTC achieve the state-of-the-art performance and CPTC performs better under various settings and metrics. Code and model have been released at:https://github.com/lixiangpengcs/LAD-Net-for-VideoQA.



中文翻译:

具有可学习聚合网络的广义金字塔协同注意力用于视频问答

基于视频的视觉问答 (V-VQA) 在视觉和语言的交叉点上仍然具有挑战性。在本文中,我们提出了一种新颖的架构,即广义金字塔协同注意力与可学习聚合网络 (GPC) 以解决两个核心问题:1)考虑到视频的复杂和多样化内容,如何将协同注意力部署到 V-VQA 任务; 2)如何在不破坏特征分布和时间信息的情况下聚合帧级特征(或词级特征)。为了解决第一个问题,我们提出了一种具有新颖多样性学习模块的广义金字塔共同注意结构,以明确鼓励注意准确性和多样性。我们首先将其实例化为多路径金字塔协同注意(MPC)以捕获不同的特征。然后我们发现原始 co-attention 机制的每个注意力分支都没有与其他注意力分支交互,这导致了粗略的注意力图。因此,我们将 MPC 结构扩展到级联金字塔变压器共同注意 (CPTC) 模块,在该模块中,我们用变压器共同注意替换了共同注意。为了解决第二个问题,我们提出了一种新的具有一组证据门的可学习聚合方法。它自动聚合自适应加权的帧级特征(或词级特征)以提取丰富的视频(或问题)上下文语义信息。使用证据门,它然后进一步选择代表证据信息的最相关的信号来预测答案。对两个 V-VQA 数据集进行了广泛的验证,TGIF-QA 和 TVQA 表明,我们提出的 MPC 和 CPTC 都达到了最先进的性能,并且 CPTC 在各种设置和指标下表现更好。代码和模型已经发布在:https://github.com/lixiangpengcs/LAD-Net-for-VideoQA。

更新日期:2021-07-25
down
wechat
bug