当前位置: X-MOL 学术IEEE Trans. Image Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Joint Feature Optimization and Fusion for Compressed Action Recognition
IEEE Transactions on Image Processing ( IF 10.8 ) Pub Date : 2021-09-17 , DOI: 10.1109/tip.2021.3112008
Hanhui Li , Xudong Jiang , Boliang Guan , Raymond Rui Ming Tan , Ruomei Wang , Nadia Magnenat Thalmann

Recent methods including CoViAR and DMC-Net provide a new paradigm for action recognition since they are directly targeted at compressed videos (e.g., MPEG4 files). It avoids the cumbersome decoding procedure of traditional methods, and leverages the pre-encoded motion vectors and residuals in compressed videos to complete recognition efficiently. However, motion vectors and residuals are noisy, sparse and highly correlated information, which cannot be effectively exploited by plain and separated networks. To tackle these issues, we propose a joint feature optimization and fusion framework that better utilizes motion vectors and residuals in the following three aspects. (i) We model the feature optimization problem as a reconstruction process that represents features by a set of bases, and propose a joint feature optimization module that extracts bases in the both modalities. (ii) A low-rank non-local attention module, which combines the non-local operation with the low-rank constraint, is proposed to tackle the noise and sparsity problem during the feature reconstruction process. (iii) A lightweight feature fusion module and a self-adaptive knowledge distillation method are introduced, which use motion vectors and residuals to generate predictions similar to those from networks with optical flows. With these proposed components embedded in a baseline network, the proposed network not only achieves the state-of-the-art performance on HMDB-51 and UCF-101, but also maintains its advantage in computational complexity.

中文翻译:


压缩动作识别的联合特征优化与融合



最近的方法(包括 CoViAR 和 DMC-Net)为动作识别提供了新的范例,因为它们直接针对压缩视频(例如 MPEG4 文件)。它避免了传统方法繁琐的解码过程,并利用压缩视频中预编码的运动向量和残差来高效地完成识别。然而,运动向量和残差是噪声、稀疏且高度相关的信息,无法被简单且分离的网络有效利用。为了解决这些问题,我们提出了一种联合特征优化和融合框架,在以下三个方面更好地利用运动向量和残差。 (i)我们将特征优化问题建模为通过一组基数表示特征的重建过程,并提出了一种联合特征优化模块,可以提取两种模态的基数。 (ii)提出了一种低秩非局部注意模块,将非局部操作与低秩约束相结合,以解决特征重建过程中的噪声和稀疏问题。 (iii)引入了轻量级特征融合模块和自适应知识蒸馏方法,它使用运动向量和残差来生成类似于光流网络的预测。通过将这些提出的组件嵌入到基线网络中,提出的网络不仅在 HMDB-51 和 UCF-101 上实现了最先进的性能,而且还保持了其在计算复杂性方面的优势。
更新日期:2021-09-17
down
wechat
bug