当前位置: X-MOL 学术IEEE Trans. Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Video Question Answering With Prior Knowledge and Object-Sensitive Learning
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2022-09-09 , DOI: 10.1109/tip.2022.3205212
Pengpeng Zeng 1 , Haonan Zhang 1 , Lianli Gao 2 , Jingkuan Song 1 , Heng Tao Shen 1
Affiliation  

Video Question Answering (VideoQA), which explores spatial-temporal visual information of videos given a linguistic query, has received unprecedented attention over recent years. One of the main challenges lies in locating relevant visual and linguistic information, and therefore various attention-based approaches are proposed. Despite the impressive progress, two aspects are not fully explored by current methods to get proper attention. Firstly, prior knowledge, which in the human cognitive process plays an important role in assisting the reasoning process of VideoQA, is not fully utilized. Secondly, structured visual information (e.g., object) instead of the raw video is underestimated. To address the above two issues, we propose a Prior Knowledge and Object-sensitive Learning (PKOL) by exploring the effect of prior knowledge and learning object-sensitive representations to boost the VideoQA task. Specifically, we first propose a Prior Knowledge Exploring (PKE) module that aims to acquire and integrate prior knowledge into a question feature for feature enriching, where an information retriever is constructed to retrieve related sentences as prior knowledge from the massive corpus. In addition, we propose an Object-sensitive Representation Learning (ORL) module to generate object-sensitive features by interacting object-level features with frame and clip-level features. Our proposed PKOL achieves consistent improvements on three competitive benchmarks (i.e., MSVD-QA, MSRVTT-QA, and TGIF-QA) and gains state-of-the-art performance. The source code is available at https://github.com/zchoi/PKOL .

中文翻译:

具有先验知识和对象敏感学习的视频问答

视频问答(VideoQA),它探索给定语言查询的视频的时空视觉信息,近年来受到了前所未有的关注。主要挑战之一在于定位相关的视觉和语言信息,因此提出了各种基于注意力的方法。尽管取得了令人瞩目的进展,但目前的方法还没有充分探索两个方面以得到适当的关注。首先,在人类认知过程中对辅助 VideoQA 推理过程起重要作用的先验知识没有得到充分利用。其次,结构化的视觉信息(例如,对象)而不是原始视频被低估了。针对以上两个问题,我们通过探索先验知识和学习对象敏感表示的效果来提升 VideoQA 任务,提出了先验知识和对象敏感学习 (PKOL)。具体来说,我们首先提出了一个先验知识探索(PKE)模块,该模块旨在获取先验知识并将其整合到问题特征中以进行特征丰富,其中构建信息检索器以从海量语料库中检索相关句子作为先验知识。此外,我们提出了一个对象敏感表示学习(ORL)模块,通过将对象级特征与帧和剪辑级特征进行交互来生成对象敏感特征。我们提出的 PKOL 在三个竞争基准(即 MSVD-QA、MSRVTT-QA 和 TGIF-QA)上实现了一致的改进,并获得了最先进的性能。https://github.com/zchoi/PKOL .
更新日期:2022-09-09
down
wechat
bug