Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting.,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting.
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2020-09-10 , DOI: 10.1109/tip.2020.3021497
Yan Bin Ng , Basura Fernando

Future human action forecasting from partial observations of activities is an important problem in many practical applications such as assistive robotics, video surveillance and security. We present a method to forecast actions for the unseen future of the video using a neural machine translation technique that uses encoder-decoder architecture. The input to this model is the observed RGB video, and the objective is to forecast the correct future symbolic action sequence. Unlike prior methods that make action predictions for some unseen percentage of video one for each frame, we predict the complete action sequence that is required to accomplish the activity. We coin this task action sequence forecasting. To cater for two types of uncertainty in the future predictions, we propose a novel loss function. We show a combination of optimal transport and future uncertainty losses help to improve results. We evaluate our model in three challenging video datasets (Charades, MPII cooking and Breakfast). We extend our action sequence forecasting model to perform weakly supervised action forecasting on two challenging datasets, the Breakfast and the 50Salads. Specifically, we propose a model to predict actions of future unseen frames without using frame level annotations during training. Using Fisher vector features, our supervised model outperforms the state-of-the-art action forecasting model by 0.83% and 7.09% on the Breakfast and the 50Salads datasets respectively. Our weakly supervised model is only 0.6% behind the most recent state-of-the-art supervised model and obtains comparable results to other published fully supervised methods, and sometimes even outperforms them on the Breakfast dataset. Most interestingly, our weakly supervised model outperforms prior models by 1.04% leveraging on proposed weakly supervised architecture, and effective use of attention mechanism and loss functions.

中文翻译：

用注意力预测未来的动作序列：弱监督动作预测的新方法。

根据对活动的部分观察来预测未来人类行为是辅助机器人、视频监控和安全等许多实际应用中的一个重要问题。我们提出了一种使用编码器-解码器架构的神经机器翻译技术来预测视频中看不见的未来动作的方法。该模型的输入是观察到的 RGB 视频，目标是预测正确的未来符号动作序列。与之前对每一帧的视频中某些看不见的百分比进行动作预测的方法不同，我们预测完成该活动所需的完整动作序列。我们创造了这个任务动作序列预测。为了满足未来预测中的两种不确定性，我们提出了一种新的损失函数。我们证明，最佳运输和未来不确定性损失的结合有助于改善结果。我们在三个具有挑战性的视频数据集（Charades、MPII 烹饪和早餐）中评估我们的模型。我们扩展了我们的动作序列预测模型，以对两个具有挑战性的数据集（早餐和 50Salads）执行弱监督动作预测。具体来说，我们提出了一种模型来预测未来未见过的帧的动作，而无需在训练期间使用帧级注释。使用 Fisher 向量特征，我们的监督模型在早餐和 50Salads 数据集上分别比最先进的动作预测模型高出 0.83% 和 7.09%。我们的弱监督模型仅落后最新最先进的监督模型 0.6%，并且获得了与其他已发布的完全监督方法相当的结果，有时甚至在早餐数据集上优于它们。最有趣的是，我们的弱监督模型比之前的模型性能高出 1 倍。04% 利用所提出的弱监督架构，并有效利用注意力机制和损失函数。

更新日期：2020-09-18

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11