Learning from Narrated Instruction Videos,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning from Narrated Instruction Videos
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2017-09-05 , DOI: 10.1109/tpami.2017.2749223
Jean-Baptiste Alayrac , Piotr Bojanowski , Nishant Agrawal , Josef Sivic , Ivan Laptev , Simon Lacoste-Julien

Automatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos.

中文翻译：

从解说教学视频中学习

自动助手可以指导人或机器人执行新任务，例如更换汽车轮胎或重新种植植物。然而，创建这样的助手并不简单，需要理解视频的视觉和语言内容。为了实现这一目标，我们在这里解决从一组叙述的教学视频中自动学习任务的主要步骤的问题。我们开发了一种新的无监督学习方法，该方法利用输入视频和相关旁白的互补性。该方法按顺序对任务的文本和视觉表示进行聚类，其中两个聚类问题通过联合约束链接起来，以获得两种模式中的单个连贯步骤序列。为了评估我们的方法，我们从互联网上收集并注释了一个新的具有挑战性的现实世界教学视频数据集。该数据集包含五种不同任务的视频，这些任务涉及人与物体之间复杂的交互，这些视频是在各种室内和室外环境中捕获的。我们通过实验证明，所提出的方法可以自动发现、学习和定位输入视频中任务的主要步骤。

更新日期：2017-09-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11