当前位置: X-MOL 学术Inf. Process. Manag. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
HVLM: Exploring Human-Like Visual Cognition and Language-Memory Network for Visual Dialog
Information Processing & Management ( IF 8.6 ) Pub Date : 2022-07-18 , DOI: 10.1016/j.ipm.2022.103008
Kaili Sun , Chi Guo , Huyin Zhang , Yuan Li

Visual dialog, a visual-language task, enables an AI agent to engage in conversation with humans grounded in a given image. To generate appropriate answers for a series of questions in the dialog, the agent is required to understand the comprehensive visual content of an image and the fine-grained textual context of the dialog. However, previous studies typically utilized the object-level visual feature to represent a whole image, which only focuses on the local perspective of an image but ignores the importance of the global information in an image. In this paper, we proposed a novel model Human-Like Visual Cognitive and Language-Memory Network for Visual Dialog (HVLM), to simulate global and local dual-perspective cognitions in the human visual system and understand an image comprehensively. HVLM consists of two key modules, Local-to-Global Graph Convolutional Visual Cognition (LG-GCVC) and Question-guided Language Topic Memory (T-Mem). Specifically, in the LG-GCVC module, we design a question-guided dual-perspective reasoning to jointly learn visual contents from both local and global perspectives through a simple spectral graph convolution network. Furthermore, in the T-Mem module, we design an iterative learning strategy to gradually enhance fine-grained textual context details via an attention mechanism. Experimental results demonstrate the superiority of our proposed model, which obtains the comparable performance on benchmark datasets VisDial v1.0 and VisDial v0.9.



中文翻译:

HVLM:探索视觉对话的类人视觉认知和语言记忆网络

视觉对话是一种视觉语言任务,它使 AI 代理能够与基于给定图像的人类进行对话。为了为对话中的一系列问题生成适当的答案,代理需要了解图像的综合视觉内容和对话的细粒度文本上下文。然而,以往的研究通常利用对象级视觉特征来表示整幅图像,只关注图像的局部视角,而忽略了图像中全局信息的重要性。在本文中,我们提出了一种新颖的模型 Human-Like Visual Cognitive and Language-Memory Network for Visual Dialog (HVLM),以模拟人类视觉系统中的全局和局部双视角认知并全面理解图像。HVLM 由两个关键模块组成,局部到全局图卷积视觉认知 (LG-GCVC) 和问题引导语言主题记忆 (T-Mem)。具体来说,在 LG-GCVC 模块中,我们设计了一个问题引导的双视角推理,通过一个简单的谱图卷积网络从局部和全局视角共同学习视觉内容。此外,在 T-Mem 模块中,我们设计了一种迭代学习策略,通过注意力机制逐步增强细粒度的文本上下文细节。实验结果证明了我们提出的模型的优越性,它在基准数据集 VisDial v1.0 和 VisDial v0.9 上获得了可比的性能。我们设计了一个问题引导的双视角推理,通过一个简单的谱图卷积网络从局部和全局角度共同学习视觉内容。此外,在 T-Mem 模块中,我们设计了一种迭代学习策略,通过注意力机制逐步增强细粒度的文本上下文细节。实验结果证明了我们提出的模型的优越性,它在基准数据集 VisDial v1.0 和 VisDial v0.9 上获得了可比的性能。我们设计了一个问题引导的双视角推理,通过一个简单的谱图卷积网络从局部和全局角度共同学习视觉内容。此外,在 T-Mem 模块中,我们设计了一种迭代学习策略,通过注意力机制逐步增强细粒度的文本上下文细节。实验结果证明了我们提出的模型的优越性,它在基准数据集 VisDial v1.0 和 VisDial v0.9 上获得了可比的性能。

更新日期:2022-07-19
down
wechat
bug