当前位置: X-MOL 学术arXiv.cs.MM › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Overcoming Language Priors with Self-supervised Learning for Visual Question Answering
arXiv - CS - Multimedia Pub Date : 2020-12-17 , DOI: arxiv-2012.11528
Xi Zhu, Zhendong Mao, Chunxiao Liu, Peng Zhang, Bin Wang, Yongdong Zhang

Most Visual Question Answering (VQA) models suffer from the language prior problem, which is caused by inherent data biases. Specifically, VQA models tend to answer questions (e.g., what color is the banana?) based on the high-frequency answers (e.g., yellow) ignoring image contents. Existing approaches tackle this problem by creating delicate models or introducing additional visual annotations to reduce question dependency while strengthening image dependency. However, they are still subject to the language prior problem since the data biases have not been even alleviated. In this paper, we introduce a self-supervised learning framework to solve this problem. Concretely, we first automatically generate labeled data to balance the biased data, and propose a self-supervised auxiliary task to utilize the balanced data to assist the base VQA model to overcome language priors. Our method can compensate for the data biases by generating balanced data without introducing external annotations. Experimental results show that our method can significantly outperform the state-of-the-art, improving the overall accuracy from 49.50% to 57.59% on the most commonly used benchmark VQA-CP v2. In other words, we can increase the performance of annotation-based methods by 16% without using external annotations.

中文翻译:

通过视觉学习的自我监督学习克服语言先验

大多数视觉问答(VQA)模型都存在语言先验问题,这是由固有的数据偏见引起的。具体地,VQA模型倾向于基于忽略图像内容的高频答案(例如,黄色)来回答问题(例如,香蕉是什么颜色?)。现有的方法通过创建精致的模型或引入其他视觉注释来减少问题的依赖性,同时增强图像的依赖性,从而解决了该问题。但是,由于数据偏差甚至没有得到缓解,因此它们仍然受到语言先验问题的困扰。在本文中,我们介绍了一种自我监督的学习框架来解决这一问题。具体而言,我们首先自动生成标记数据以平衡偏差数据,并提出一项自我监督的辅助任务,以利用平衡数据来帮助基础VQA模型克服语言先验。我们的方法可以通过生成平衡数据来补偿数据偏差,而无需引入外部注释。实验结果表明,我们的方法可以大大优于最新技术,将最常用的基准VQA-CP v2的整体精度从49.50%提高到57.59%。换句话说,我们可以在不使用外部注释的情况下将基于注释的方法的性能提高16%。最常用的基准VQA-CP v2的59%。换句话说,我们可以在不使用外部注释的情况下将基于注释的方法的性能提高16%。最常用的基准VQA-CP v2的59%。换句话说,我们可以在不使用外部注释的情况下将基于注释的方法的性能提高16%。
更新日期:2020-12-22
down
wechat
bug