Prediction and Description of Near-Future Activities in Video,Computer Vision and Image Understanding

当前位置： X-MOL 学术 › Comput. Vis. Image Underst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Prediction and Description of Near-Future Activities in Video
Computer Vision and Image Understanding ( IF 4.3 ) Pub Date : 2021-05-29 , DOI: 10.1016/j.cviu.2021.103230
Tahmida Mahmud , Mohammad Billah , Mahmudul Hasan , Amit K. Roy-Chowdhury

Most of the existing works on human activity analysis focus on recognition or early recognition of the activity labels from complete or partial observations. Similarly, almost all of the existing video captioning approaches focus on the observed events in videos. Predicting the labels and the captions of future activities where no frames of the predicted activities have been observed is a challenging problem, with important applications that require anticipatory response. In this work, we propose a system that can infer the labels and the captions of a sequence of future activities. Our proposed network for label prediction of a future activity sequence has three branches where the first branch takes visual features from the objects present in the scene, the second branch takes observed sequential activity features, and the third branch captures the last observed activity features. The predicted labels and the observed scene context are then mapped to meaningful captions using a sequence-to-sequence learning-based method. Experiments on four challenging activity analysis datasets and a video description dataset demonstrate that our label prediction approach achieves comparable performance with the state-of-the-arts and our captioning framework outperform the state-of-the-arts.

中文翻译：

视频中近期活动的预测和描述

大多数关于人类活动分析的现有工作都集中在从完整或部分观察中识别或早期识别活动标签。同样，几乎所有现有的视频字幕方法都专注于视频中观察到的事件。在没有观察到预测活动的框架的情况下预测未来活动的标签和标题是一个具有挑战性的问题，重要的应用需要预期的响应。在这项工作中，我们提出了一个可以推断一系列未来活动的标签和标题的系统。我们提出的用于未来活动序列标签预测的网络具有三个分支，其中第一个分支从场景中存在的对象中获取视觉特征，第二个分支获取观察到的顺序活动特征，第三个分支捕获最后观察到的活动特征。然后使用基于序列到序列学习的方法将预测的标签和观察到的场景上下文映射到有意义的标题。在四个具有挑战性的活动分析数据集和一个视频描述数据集上的实验表明，我们的标签预测方法实现了与最先进技术相当的性能，我们的字幕框架优于最先进技术。

更新日期：2021-06-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11