当前位置: X-MOL 学术Front. Inform. Technol. Electron. Eng. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Visual commonsense reasoning with directional visual connections
Frontiers of Information Technology & Electronic Engineering ( IF 3 ) Pub Date : 2021-05-28 , DOI: 10.1631/fitee.2000722
Yahong Han , Aming Wu , Linchao Zhu , Yi Yang

To boost research into cognition-level visual understanding, i.e., making an accurate inference based on a thorough understanding of visual details, visual commonsense reasoning (VCR) has been proposed. Compared with traditional visual question answering which requires models to select correct answers, VCR requires models to select not only the correct answers, but also the correct rationales. Recent research into human cognition has indicated that brain function or cognition can be considered as a global and dynamic integration of local neuron connectivity, which is helpful in solving specific cognition tasks. Inspired by this idea, we propose a directional connective network to achieve VCR by dynamically reorganizing the visual neuron connectivity that is contextualized using the meaning of questions and answers and leveraging the directional information to enhance the reasoning ability. Specifically, we first develop a GraphVLAD module to capture visual neuron connectivity to fully model visual content correlations. Then, a contextualization process is proposed to fuse sentence representations with visual neuron representations. Finally, based on the output of contextualized connectivity, we propose directional connectivity to infer answers and rationales, which includes a ReasonVLAD module. Experimental results on the VCR dataset and visualization analysis demonstrate the effectiveness of our method.



中文翻译:

具有定向视觉连接的视觉常识推理

为了促进对认知级视觉理解的研究,即基于对视觉细节的透彻理解进行准确推断,已经提出了视觉常识推理 (VCR)。与传统的视觉问答需要模型选择正确答案相比,VCR要求模型不仅要选择正确答案,还要选择正确的理由。最近对人类认知的研究表明,大脑功能或认知可以被认为是局部神经元连接的全局和动态整合,这有助于解决特定的认知任务。受到这个想法的启发,我们提出了一种定向连接网络,通过动态重组视觉神经元连接来实现 VCR,该连接使用问题和答案的含义进行情境化,并利用定向信息来增强推理能力。具体来说,我们首先开发了一个 GraphVLAD 模块来捕获视觉神经元连接以完全建模视觉内容相关性。然后,提出了一种上下文化过程来将句子表示与视觉神经元表示融合。最后,基于上下文连接的输出,我们提出定向连接来推断答案和基本原理,其中包括一个 ReasonVLAD 模块。VCR 数据集和可视化分析的实验结果证明了我们方法的有效性。我们首先开发了一个 GraphVLAD 模块来捕获视觉神经元连接以完全建模视觉内容相关性。然后,提出了一种上下文化过程来将句子表示与视觉神经元表示融合。最后,基于上下文连接的输出,我们提出定向连接来推断答案和基本原理,其中包括一个 ReasonVLAD 模块。VCR 数据集和可视化分析的实验结果证明了我们方法的有效性。我们首先开发了一个 GraphVLAD 模块来捕获视觉神经元连接以完全建模视觉内容相关性。然后,提出了一种上下文化过程来将句子表示与视觉神经元表示融合。最后,基于上下文连接的输出,我们提出定向连接来推断答案和基本原理,其中包括一个 ReasonVLAD 模块。VCR 数据集和可视化分析的实验结果证明了我们方法的有效性。其中包括一个 ReasonVLAD 模块。在VCR数据集上的实验结果和可视化分析证明了我们方法的有效性。其中包含一个ReasonVLAD模块。VCR 数据集和可视化分析的实验结果证明了我们方法的有效性。

更新日期:2021-05-28
down
wechat
bug