当前位置: X-MOL 学术Comput. Vis. Image Underst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Real-time and accurate object detection in compressed video by long short-term feature aggregation
Computer Vision and Image Understanding ( IF 4.5 ) Pub Date : 2021-03-05 , DOI: 10.1016/j.cviu.2021.103188
Xinggang Wang , Zhaojin Huang , Bencheng Liao , Lichao Huang , Yongchao Gong , Chang Huang

Video object detection is a fundamental problem in computer vision and has a wide spectrum of applications. Based on deep networks, video object detection is actively studied for pushing the limits of detection speed and accuracy. To reduce the computation cost, we sparsely sample key frames in video and treat the rest frames are non-key frames; a large and deep network is used to extract features for key frames and a tiny network is used for non-key frames. To enhance the features of non-key frames, we propose a novel short-term feature aggregation method to propagate the rich information in key frame features to non-key frame features in a fast way. The fast feature aggregation is enabled by the freely available motion cues in compressed videos. Further, key frame features are also aggregated based on optical flow. The propagated deep features are then integrated with the directly extracted features for object detection. The feature extraction and feature integration parameters are optimized in an end-to-end manner. The proposed video object detection network is evaluated on the large-scale ImageNet VID benchmark and achieves 77.2% mAP, which is on-par with the state-of-the-art accuracy, at the speed of 30 FPS using a Titan X GPU. The source codes are available at https://github.com/hustvl/LSFA.



中文翻译:

通过长期的短期特征聚合,对压缩视频进行实时,准确的目标检测

视频对象检测是计算机视觉中的一个基本问题,具有广泛的应用范围。基于深度网络,积极研究视频对象检测以突破检测速度和准确性的极限。为了降低计算成本,我们在视频中稀疏采样关键帧,将其余帧视为非关键帧。大型深层网络用于提取关键帧的特征,而小型网络用于非关键帧。为了增强非关键帧的特征,我们提出了一种新颖的短期特征聚合方法,可以将关键帧特征中的丰富信息快速传播到非关键帧特征中。快速功能聚合是通过压缩视频中免费提供的运动提示来实现的。此外,关键帧特征也基于光流进行聚合。然后,将传播的深度特征与直接提取的特征集成在一起以进行对象检测。特征提取和特征集成参数以端到端的方式进行了优化。拟议的视频对象检测网络在大型ImageNet VID基准上进行了评估,并使用Titan X GPU以30 FPS的速度达到了77.2%的mAP,具有最先进的精度。源代码位于https://github.com/hustvl/LSFA。使用Titan X GPU以30 FPS的速度运行。源代码位于https://github.com/hustvl/LSFA。使用Titan X GPU以30 FPS的速度运行。源代码位于https://github.com/hustvl/LSFA。

更新日期:2021-03-10
down
wechat
bug