Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning,arXiv - CS - Symbolic Computation

当前位置： X-MOL 学术 › arXiv.cs.SC › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Grounding Physical Concepts of Objects and Events Through Dynamic Visual Reasoning
arXiv - CS - Symbolic Computation Pub Date : 2021-03-30 , DOI: arxiv-2103.16564
Zhenfang Chen, Jiayuan Mao, Jiajun Wu, Kwan-Yee Kenneth Wong, Joshua B. Tenenbaum, Chuang Gan

We study the problem of dynamic visual reasoning on raw videos. This is a challenging problem; currently, state-of-the-art models often require dense supervision on physical object properties and events from simulation, which are impractical to obtain in real life. In this paper, we present the Dynamic Concept Learner (DCL), a unified framework that grounds physical objects and events from video and language. DCL first adopts a trajectory extractor to track each object over time and to represent it as a latent, object-centric feature vector. Building upon this object-centric representation, DCL learns to approximate the dynamic interaction among objects using graph networks. DCL further incorporates a semantic parser to parse questions into semantic programs and, finally, a program executor to run the program to answer the question, levering the learned dynamics model. After training, DCL can detect and associate objects across the frames, ground visual properties, and physical events, understand the causal relationship between events, make future and counterfactual predictions, and leverage these extracted presentations for answering queries. DCL achieves state-of-the-art performance on CLEVRER, a challenging causal video reasoning dataset, even without using ground-truth attributes and collision labels from simulations for training. We further test DCL on a newly proposed video-retrieval and event localization dataset derived from CLEVRER, showing its strong generalization capacity.

中文翻译：

通过动态视觉推理使对象和事件的物理概念扎根

我们研究原始视频上的动态视觉推理问题。这是一个具有挑战性的问题。当前，最先进的模型通常需要对物理对象的属性和模拟事件进行密集的监控，而这在现实生活中是不切实际的。在本文中，我们介绍了动态概念学习器（DCL），这是一个统一的框架，可基于视频和语言来显示物理对象和事件。DCL首先采用轨迹提取器来跟踪随时间变化的每个对象，并将其表示为潜在的，以对象为中心的特征向量。在此以对象为中心的表示的基础上，DCL学习使用图网络来逼近对象之间的动态交互。DCL进一步整合了语义解析器以将问题解析为语义程序，最后，还包含了程序执行器以运行程序来回答问题，利用学到的动力学模型。训练后，DCL可以检测并关联整个帧，地面视觉属性和物理事件的对象，了解事件之间的因果关系，做出未来和反事实的预测，并利用这些提取的表示来回答查询。DCL可以在CLEVRER（具有挑战性的因果视频推理数据集）上实现最先进的性能，甚至无需使用仿真的地面真实属性和碰撞标签进行训练。我们在从CLEVRER派生的新提出的视频检索和事件定位数据集上进一步测试了DCL，显示了其强大的泛化能力。做出未来和反事实的预测，并利用这些提取的演示文稿来回答查询。DCL可以在CLEVRER（具有挑战性的因果视频推理数据集）上实现最先进的性能，甚至无需使用仿真的地面真实属性和碰撞标签进行训练。我们在从CLEVRER派生的新提出的视频检索和事件定位数据集上进一步测试了DCL，显示了其强大的泛化能力。做出未来和反事实的预测，并利用这些提取的演示文稿来回答查询。DCL可以在CLEVRER（具有挑战性的因果视频推理数据集）上实现最先进的性能，甚至无需使用仿真的地面真实属性和碰撞标签进行训练。我们在从CLEVRER派生的新提出的视频检索和事件定位数据集上进一步测试了DCL，显示了其强大的泛化能力。

更新日期：2021-03-31

点击分享查看原文

点击收藏

阅读更多本刊最新论文