Temporal-Spatial Mapping for Action Recognition,IEEE Transactions on Circuits and Systems for Video Technology

当前位置： X-MOL 学术 › IEEE Trans. Circ. Syst. Video Technol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Temporal-Spatial Mapping for Action Recognition
IEEE Transactions on Circuits and Systems for Video Technology ( IF 8.4 ) Pub Date : 2020-03-01 , DOI: 10.1109/tcsvt.2019.2896029
Xiaolin Song , Cuiling Lan , Wenjun Zeng , Junliang Xing , Xiaoyan Sun , Jingyu Yang

Deep learning models have enjoyed great success for image related computer vision tasks such as image classification and object detection. For video related tasks such as human action recognition, however, the advancements are not as significant yet. The main challenge is the lack of effective and efficient models in modeling the rich temporal–spatial information in a video. We introduce a simple yet effective operation, termed temporal–spatial mapping, for capturing the temporal evolution of the frames by jointly analyzing all the frames of a video. We propose a video level 2D feature representation by transforming the convolutional features of all frames to a 2D feature map, referred to as VideoMap. With each row being the vectorized feature representation of a frame, the temporal–spatial features are compactly represented, while the temporal dynamic evolution is also well embedded. Based on the VideoMap representation, we further propose a temporal attention model within a shallow convolutional neural network to efficiently exploit the temporal–spatial dynamics. The experiment results show that the proposed scheme achieves state-of-the-art performance, with 4.2% accuracy gain over the temporal segment network, a competing baseline method, on the challenging human action benchmark dataset HMDB51.

中文翻译：

用于动作识别的时空映射

深度学习模型在图像相关的计算机视觉任务（例如图像分类和对象检测）方面取得了巨大成功。然而，对于与视频相关的任务，例如人类动作识别，进步还不是很明显。主要挑战是缺乏对视频中丰富的时空信息进行建模的有效模型。我们引入了一种简单而有效的操作，称为时间-空间映射，用于通过联合分析视频的所有帧来捕获帧的时间演变。我们通过将所有帧的卷积特征转换为 2D 特征图（称为 VideoMap）来提出视频级 2D 特征表示。每一行都是一个帧的矢量化特征表示，时空特征被紧凑地表示，同时时间动态演化也很好嵌入。基于 VideoMap 表示，我们在浅层卷积神经网络中进一步提出了一个时间注意模型，以有效地利用时间-空间动态。实验结果表明，在具有挑战性的人类动作基准数据集 HMDB51 上，所提出的方案实现了最先进的性能，比时间段网络（一种竞争基线方法）的准确度提高了 4.2%。

更新日期：2020-03-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>