T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition,Pattern Recognition Letters

当前位置： X-MOL 学术 › Pattern Recogn. Lett. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

T-VLAD: Temporal vector of locally aggregated descriptor for multiview human action recognition
Pattern Recognition Letters ( IF 3.9 ) Pub Date : 2021-05-04 , DOI: 10.1016/j.patrec.2021.04.023
Hajra Binte Naeem , Fiza Murtaza , Muhammad Haroon Yousaf , Sergio A. Velastin

Robust view-invariant human action recognition (HAR) requires effective representation of its temporal structure in multi-view videos. This study explores a view-invariant action representation based on convolutional features. Action representation over long video segments is computationally expensive, whereas features in short video segments limit the temporal coverage locally. Previous methods are based on complex multi-stream deep convolutional feature maps extracted over short segments. To cope with this issue, a novel framework is proposed based on a temporal vector of locally aggregated descriptors (T-VLAD). T-VLAD encodes long term temporal structure of the video employing single stream convolutional features over short segments. A standard VLAD vector size is a multiple of its feature codebook size (256 is normally recommended). VLAD is modified to incorporate time-order information of segments, where the T-VLAD vector size is a multiple of its smaller time-order codebook size. Previous methods have not been extensively validated for view-variation. Results are validated in a challenging setup, where one view is used for testing and the remaining views are used for training. State-of-the-art results have been obtained on three multi-view datasets with fixed cameras, IXMAS, MuHAVi and MCAD. Also, the proposed encoding approach T-VLAD works equally well on a dynamic background dataset, UCF101.

中文翻译：

T-VLAD：用于多视图人类动作识别的局部聚集描述符的时间向量

稳健的视图不变人类动作识别（HAR）需要在多视图视频中有效表示其时间结构。这项研究探索了基于卷积特征的视图不变动作表示。长视频段上的动作表示在计算上很昂贵，而短视频段上的功能会局部限制时间范围。先前的方法基于在短片段上提取的复杂多流深度卷积特征图。为了解决这个问题，提出了一种基于局部聚集描述符的时间向量（T-VLAD）的新颖框架。T-VLAD使用短片段上的单流卷积特征对视频的长期时间结构进行编码。标准VLAD向量大小是其功能码本大小的倍数（通常建议使用256）。修改了VLAD，以合并段的时间顺序信息，其中T-VLAD向量大小是其较小的时间顺序码本大小的倍数。先前的方法尚未针对视图变化进行广泛验证。在具有挑战性的设置中验证结果，其中一个视图用于测试，其余视图用于训练。使用固定相机IXMAS，MuHAVi和MCAD在三个多视图数据集上获得了最新的结果。同样，提出的编码方法T-VLAD在动态背景数据集UCF101上也能很好地工作。其中一个视图用于测试，其余视图用于培训。使用固定相机IXMAS，MuHAVi和MCAD在三个多视图数据集上获得了最新的结果。同样，提出的编码方法T-VLAD在动态背景数据集UCF101上也能很好地工作。其中一个视图用于测试，其余视图用于培训。使用固定相机IXMAS，MuHAVi和MCAD在三个多视图数据集上获得了最新的结果。同样，提出的编码方法T-VLAD在动态背景数据集UCF101上也能很好地工作。

更新日期：2021-05-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11