当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
End-to-End Video Question-Answer Generation with Generator-Pretester Network
arXiv - CS - Multimedia Pub Date : 2021-01-05 , DOI: arxiv-2101.01447
Hung-Ting Su, Chen-Hsi Chang, Po-Wei Shen, Yu-Siang Wang, Ya-Liang Chang, Yu-Cheng Chang, Pu-Jen Cheng, Winston H. Hsu

We study a novel task, Video Question-Answer Generation (VQAG), for challenging Video Question Answering (Video QA) task in multimedia. Due to expensive data annotation costs, many widely used, large-scale Video QA datasets such as Video-QA, MSVD-QA and MSRVTT-QA are automatically annotated using Caption Question Generation (CapQG) which inputs captions instead of the video itself. As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG). Existing video-to-text (V2T) approaches, despite taking a video as the input, only generate a question alone. In this work, we propose a novel model Generator-Pretester Network that focuses on two components: (1) The Joint Question-Answer Generator (JQAG) which generates a question with its corresponding answer to allow Video Question "Answering" training. (2) The Pretester (PT) verifies a generated question by trying to answer it and checks the pretested answer with both the model's proposed answer and the ground truth answer. We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances. Furthermore, using our generated QA pairs only on the Video QA task, we can surpass some supervised baselines. We apply our generated questions to Video QA applications and surpasses some supervised baselines using generated questions only. As a pre-training strategy, we outperform both CapQG and transfer learning approaches when employing semi-supervised (20%) or fully supervised learning with annotated data. These experimental results suggest the novel perspectives for Video QA training.

中文翻译:

使用Generator-Pretester网络进行端到端视频问答

我们研究了一种新颖的任务,视频问题答案生成(VQAG),用于挑战多媒体中的视频问题回答(Video QA)任务。由于昂贵的数据注释成本,许多使用广泛的大规模视频QA数据集(例如Video-QA,MSVD-QA和MSRVTT-QA)会使用字幕问题生成(CapQG)自动注释,该问题输入字幕而不是视频本身。由于字幕既不能完全代表视频,也不能始终实用,因此至关重要的是,通过视频问题解答生成(VQAG)基于视频生成问题答案对。尽管将视频作为输入,但是现有的视频到文本(V2T)方法仅产生一个问题。在这项工作中,我们提出了一个新颖的Generator-Pretester网络模型,该模型着重于两个组件:(1)联合问题解答生成器(JQAG),它会生成一个带有相应答案的问题,以允许对视频问题进行“回答”培训。(2)Pretester(PT)通过尝试回答所生成的问题来进行验证,并使用模型的建议答案和地面真实答案来检查预先测试的答案。我们使用仅有的两个可用的大规模人工注释视频质量检查数据集评估我们的系统,并实现最新的问题生成性能。此外,仅在视频质量检查任务中使用生成的质量检查对,我们可以超越一些监督基准。我们将生成的问题应用于视频质量检查应用程序,仅使用生成的问题就超过了一些监督基准。作为预训练策略,当使用带注释数据的半监督(20%)或完全监督的学习时,我们的性能优于CapQG和转移学习方法。这些实验结果为视频质量检查培训提供了新颖的视角。
更新日期:2021-01-06
down
wechat
bug