Visual Dialog,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Visual Dialog
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 4-19-2018 , DOI: 10.1109/tpami.2018.2828437
Abhishek Das , Satwik Kottur , Khushi Gupta , Avi Singh , Deshraj Yadav , Stefan Lee , Jose M. F. Moura , Devi Parikh , Dhruv Batra

We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounded in vision to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person real-time chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). VisDial v0.9 has been released and consists of _\sim1.2M dialog question-answer pairs from 10-round, human-human dialogs grounded in _\sim120k images from the COCO dataset. We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders—Late Fusion, Hierarchical Recurrent Encoder and Memory Network (optionally with attention over image features)—and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank and recall@k@k of human response. We quantify the gap between machine and human performance on the Visual Dialog task via human studies. Putting it all together, we demonstrate the first ‘visual chatbot’! Our dataset, code, pretrained models and visual chatbot are available on https://visualdialog.org.

中文翻译：

视觉对话

我们介绍了视觉对话的任务，它要求人工智能代理以自然的对话语言与人类就视觉内容进行有意义的对话。具体来说，给定图像、对话历史记录和有关图像的问题，智能体必须将问题基于图像，从历史记录中推断上下文，并准确回答问题。视觉对话与特定的下游任务充分分离，可以作为机器智能的一般测试，同时充分基于视觉，可以客观评估个人反应和基准进度。我们开发了一种新颖的两人实时聊天数据收集协议来管理大规模视觉对话数据集（VisDial）。 VisDial v0.9 已发布，由来自 10 轮人与人对话的 _\sim1.2M 对话问答对组成，这些对话基于 COCO 数据集的 _\sim120k 图像。我们引入了一系列用于视觉对话的神经编码器-解码器模型，其中包含 3 个编码器（后期融合、分层循环编码器和记忆网络（可选地关注图像特征））和 2 个解码器（生成式和判别式），其性能优于许多复杂的解码器基线。我们为视觉对话提出了一种基于检索的评估协议，其中人工智能代理被要求对一组候选答案进行排序，并根据人类反应的平均倒数排名和召回@k@k等指标进行评估。我们通过人类研究来量化机器和人类在视觉对话任务上的表现之间的差距。综上所述，我们展示了第一个“视觉聊天机器人”！我们的数据集、代码、预训练模型和视觉聊天机器人可在 https://visualdialog.org 上获取。

更新日期：2024-08-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11