Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video,IEEE Transactions on Pattern Analysis and Machine Intelligence

当前位置： X-MOL 学术 › IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Rolling-Unrolling LSTMs for Action Anticipation from First-Person Video
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 23.6 ) Pub Date : 2020-05-06 , DOI: 10.1109/tpami.2020.2992889
Antonino Furnari , Giovanni Maria Farinella

In this paper, we tackle the problem of egocentric action anticipation, i.e., predicting what actions the camera wearer will perform in the near future and which objects they will interact with. Specifically, we contribute Rolling-Unrolling LSTM, a learning architecture to anticipate actions from egocentric videos. The method is based on three components: 1) an architecture comprised of two LSTMs to model the sub-tasks of summarizing the past and inferring the future, 2) a Sequence Completion Pre-Training technique which encourages the LSTMs to focus on the different sub-tasks, and 3) a Modality ATTention (MATT) mechanism to efficiently fuse multi-modal predictions performed by processing RGB frames, optical flow fields and object-based features. The proposed approach is validated on EPIC-Kitchens, EGTEA Gaze+ and ActivityNet. The experiments show that the proposed architecture is state-of-the-art in the domain of egocentric videos, achieving top performances in the 2019 EPIC-Kitchens egocentric action anticipation challenge. The approach also achieves competitive performance on ActivityNet with respect to methods not based on unsupervised pre-training and generalizes to the tasks of early action recognition and action recognition. To encourage research on this challenging topic, we made our code, trained models, and pre-extracted features available at our web page: http://iplab.dmi.unict.it/rulstm .

中文翻译：

用于第一人称视频动作预测的滚动-展开 LSTM

在本文中，我们解决了以自我为中心的动作预期问题，即预测相机佩戴者将在不久的将来执行哪些动作以及他们将与哪些对象进行交互。具体来说，我们贡献了 Rolling-Unrolling LSTM，这是一种学习架构，用于预测以自我为中心的视频的行为。该方法基于三个组件：1) 由两个 LSTM 组成的架构，用于对总结过去和推断未来的子任务进行建模，2) 一种序列完成预训练技术，它鼓励 LSTM 专注于不同的子任务-tasks，以及 3) 模态注意 (MATT) 机制，可有效融合通过处理 RGB 帧、光流场和基于对象的特征执行的多模态预测。所提出的方法在 EPIC-Kitchens、EGTEA Gaze+ 和 ActivityNet 上得到了验证。实验表明，所提出的架构在以自我为中心的视频领域是最先进的，在 2019 年 EPIC-Kitchens 以自我为中心的动作预测挑战中取得了最佳表现。相对于不基于无监督预训练的方法，该方法还在 ActivityNet 上取得了有竞争力的性能，并推广到早期动作识别和动作识别的任务。为了鼓励对这个具有挑战性的主题进行研究，我们在我们的网页上提供了我们的代码、训练有素的模型和预提取的功能：相对于不基于无监督预训练的方法，该方法还在 ActivityNet 上取得了有竞争力的性能，并推广到早期动作识别和动作识别的任务。为了鼓励对这个具有挑战性的主题进行研究，我们在我们的网页上提供了我们的代码、训练有素的模型和预提取的功能：相对于不基于无监督预训练的方法，该方法还在 ActivityNet 上取得了有竞争力的性能，并推广到早期动作识别和动作识别的任务。为了鼓励对这个具有挑战性的主题进行研究，我们在我们的网页上提供了我们的代码、训练有素的模型和预提取的功能：http://iplab.dmi.unict.it/rulstm .

更新日期：2020-05-06

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>