VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

VideoABC: A Real-World Video Dataset for Abductive Visual Reasoning
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2022-09-14 , DOI: 10.1109/tip.2022.3205207
Wenliang Zhao ₁ , Yongming Rao ₁ , Yansong Tang ₁ , Jie Zhou ₁ , Jiwen Lu ₁

Affiliation

In this paper, we investigate the problem of abductive visual reasoning (AVR), which requires vision systems to infer the most plausible explanation for visual observations. Unlike previous work which performs visual reasoning on static images or synthesized scenes, we exploit long-term reasoning from instructional videos that contain a wealth of detailed information about the physical world. We conceptualize two tasks for this emerging and challenging topic. The primary task is AVR, which is based on the initial configuration and desired goal from an instructional video, and the model is expected to figure out what is the most plausible sequence of steps to achieve the goal. In order to avoid trivial solutions based on appearance information rather than reasoning, the second task called AVR++ is constructed, which requires the model to answer why the unselected options are less plausible. We introduce a new dataset called VideoABC, which consists of 46,354 unique steps derived from 11,827 instructional videos, formulated as 13,526 abductive reasoning questions with an average reasoning duration of 51 seconds. Through an adversarial hard hypothesis mining algorithm, non-trivial and high-quality problems are generated efficiently and effectively. To achieve human-level reasoning, we propose a Hierarchical Dual Reasoning Network (HDRNet) to capture the long-term dependencies among steps and observations. We establish a benchmark for abductive visual reasoning, and our method set state-of-the-arts on AVR (

$\sim 74$

%) and AVR++ (

$\sim 45$

%), and humans can easily achieve over 90% accuracy on these two tasks. The large performance gap reveals the limitation of current video understanding models on temporal reasoning and leaves substantial room for future research on this challenging problem. Our dataset and code are available at https://github.com/wl-zhao/VideoABC .

中文翻译：

VideoABC：用于溯因视觉推理的真实视频数据集

在本文中，我们研究了溯因视觉推理 (AVR) 问题，该问题需要视觉系统为视觉观察推断出最合理的解释。与以前对静态图像或合成场景进行视觉推理的工作不同，我们利用包含大量有关物理世界的详细信息的教学视频的长期推理。我们为这个新兴且具有挑战性的主题概念化了两项任务。主要任务是 AVR，它基于教学视频中的初始配置和预期目标，并且该模型有望找出实现目标的最合理的步骤顺序。为了避免基于外观信息而不是推理的琐碎解决方案，构造了称为 AVR++ 的第二个任务，这需要模型回答为什么未选择的选项不太合理。我们引入了一个名为 VideoABC 的新数据集，它由来自 11,827 个教学视频的 46,354 个独特步骤组成，形成 13,526 个外推推理问题，平均推理持续时间为 51 秒。通过对抗性硬假设挖掘算法，高效且有效地生成非平凡且高质量的问题。为了实现人类水平的推理，我们提出了一个分层双重推理网络（HDRNet）来捕捉步骤和观察之间的长期依赖关系。我们为溯因视觉推理建立了一个基准，我们的方法在 AVR 上设置了最先进的技术（827 个教学视频，构成 13,526 个溯因推理问题，平均推理持续时间为 51 秒。通过对抗性硬假设挖掘算法，高效且有效地生成非平凡且高质量的问题。为了实现人类水平的推理，我们提出了一个分层双重推理网络（HDRNet）来捕捉步骤和观察之间的长期依赖关系。我们为溯因视觉推理建立了一个基准，我们的方法在 AVR 上设置了最先进的技术（827 个教学视频，构成 13,526 个溯因推理问题，平均推理持续时间为 51 秒。通过对抗性硬假设挖掘算法，高效且有效地生成非平凡且高质量的问题。为了实现人类水平的推理，我们提出了一个分层双重推理网络（HDRNet）来捕捉步骤和观察之间的长期依赖关系。我们为溯因视觉推理建立了一个基准，我们的方法在 AVR 上设置了最先进的技术（我们提出了一个层次双推理网络（HDRNet）来捕捉步骤和观察之间的长期依赖关系。我们为溯因视觉推理建立了一个基准，我们的方法在 AVR 上设置了最先进的技术（我们提出了一个层次双推理网络（HDRNet）来捕捉步骤和观察之间的长期依赖关系。我们为溯因视觉推理建立了一个基准，我们的方法在 AVR 上设置了最先进的技术（

$\sim 74$

%) 和 AVR++ (

$\sim 45$

%)，人类可以轻松地在这两个任务上达到 90% 以上的准确率。巨大的性能差距揭示了当前视频理解模型在时间推理方面的局限性，并为未来研究这一具有挑战性的问题留下了很大的空间。我们的数据集和代码可在https://github.com/wl-zhao/VideoABC .

更新日期：2022-09-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11