Learning to Answer Visual Questions from Web Videos,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning to Answer Visual Questions from Web Videos
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 5-9-2022 , DOI: 10.1109/tpami.2022.3173208
Antoine Yang ₁ , Antoine Miech ₂ , Josef Sivic ₃ , Ivan Laptev ₁ , Cordelia Schmid ₁

Affiliation

Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and the VideoQA feature probe evaluation setting and show excellent results. Furthermore, our method achieves competitive results on MSRVTT-QA, ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our approach generalizes to another source of web video and text data. We generate the WebVidVQA3M dataset from videos with alt-text annotations, and show its benefits for training VideoQA models. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.

中文翻译：

学习回答网络视频中的视觉问题

最近的视觉问答方法依赖于大规模带注释的数据集。然而，对视频问题和答案进行手动注释既繁琐又昂贵，并且阻碍了可扩展性。在这项工作中，我们建议避免手动注释，并利用自动跨模式监督生成用于视频问答的大规模训练数据集。我们利用在文本数据上训练的问题生成转换器，并使用它从转录的视频旁白生成问答对。给定叙述视频，我们会自动生成包含 69M 视频-问题-答案三元组的 HowToVQA69M 数据集。为了处理该数据集中不同答案的开放词汇，我们提出了一种基于视频问题多模态转换器和答案转换器之间的对比损失的训练程序。我们引入了零样本 VideoQA 任务和 VideoQA 特征探测评估设置，并显示出出色的结果。此外，我们的方法在 MSRVTT-QA、ActivityNet-QA、MSVD-QA 和 How2QA 数据集上取得了有竞争力的结果。我们还表明，我们的方法可以推广到网络视频和文本数据的另一个来源。我们从带有替代文本注释的视频生成 WebVidVQA3M 数据集，并展示其对训练 VideoQA 模型的好处。最后，为了进行详细评估，我们引入了 iVQA，这是一个新的 VideoQA 数据集，具有减少的语言偏差和高质量的手动注释。

更新日期：2024-08-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11