当前位置: X-MOL 学术Image Vis. Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
From known to the unknown: Transferring knowledge to answer questions about novel visual and semantic concepts
Image and Vision Computing ( IF 4.7 ) Pub Date : 2020-08-04 , DOI: 10.1016/j.imavis.2020.103985
Moshiur R. Farazi , Salman H. Khan , Nick Barnes

Current Visual Question Answering (VQA) systems can answer intelligent questions about ‘known’ visual content. However, their performance drops significantly when questions about visually and linguistically ‘unknown’ concepts are presented during inference (‘Open-world’ scenario). A practical VQA system should be able to deal with novel concepts in real world settings. To address this problem, we propose an exemplar-based approach that transfers learning (i.e., knowledge) from previously ‘known’ concepts to answer questions about the ‘unknown’. We learn a highly discriminative joint embedding (JE) space, where visual and semantic features are fused to give a unified representation. Once novel concepts are presented to the model, it looks for the closest match from an exemplar set in the JE space. This auxiliary information is used alongside the given Image-Question pair to refine visual attention in a hierarchical fashion. Our novel attention model is based on a dual-attention mechanism that combines the complementary effect of spatial and channel attention. Since handling the high dimensional exemplars on large datasets can be a significant challenge, we introduce an efficient matching scheme that uses a compact feature description for search and retrieval. To evaluate our model, we propose a new dataset for VQA, separating unknown visual and semantic concepts from the training set. Our approach shows significant improvements over state-of-the-art VQA models on the proposed Open-World VQA dataset and other standard VQA datasets.



中文翻译:

从已知到未知:转移知识以回答有关新颖的视觉和语义概念的问题

当前的视觉问答系统(VQA)可以回答有关“已知”视觉内容的智能问题。但是,当在推理过程中提出关于视觉和语言“未知”概念的问题时(“开放世界”场景),它们的性能会大大下降。一个实用的VQA系统应该能够处理现实环境中的新颖概念。为了解决这个问题,我们提出了一种基于范例的方法,该方法将学习的知识(知识)从先前的“已知”概念转移到回答有关“未知”的问题。'。我们学习了高度区分性的联合嵌入(JE)空间,其中融合了视觉和语义特征以提供统一的表示形式。一旦向模型提出了新颖的概念,它将从JE空间中的示例集中寻找最接近的匹配项。此辅助信息与给定的“图像问题”对一起使用,以分层方式改善视觉注意力。我们新颖的注意力模型基于双重注意力机制,该机制结合了空间和渠道注意力的互补效应。由于在大型数据集上处理高维样本可能是一个重大挑战,因此我们引入了一种有效的匹配方案,该方案使用紧凑的特征描述进行搜索和检索。为了评估我们的模型,我们为VQA提出了一个新的数据集,将未知训练集中的视觉和语义概念。我们的方法显示了在提议的Open-World VQA数据集和其他标准VQA数据集上的最新VQA模型的显着改进。

更新日期:2020-08-04
down
wechat
bug