当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Context and Structure Mining Network for Video Object Detection
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2021-08-13 , DOI: 10.1007/s11263-021-01507-2
Liang Han 1 , Zhaozheng Yin 1, 2 , Pichao Wang 3 , Fan Wang 4 , Hao Li 5
Affiliation  

Aggregating temporal features from other frames is verified to be very effective for video object detection to overcome the challenges in still images, such as occlusion, motion blur, and rare pose. Currently, proposal-level feature aggregation dominates this direction. However, there are two main problems for the holistic proposal-level feature aggregation. First, the object proposals generated by the region proposal network ignore the useful context information around the object which is proved to be helpful for object classification. Second, the traditional proposal-level feature aggregation regards the proposal as a whole without considering the important object structure information, which makes the similarity comparison between two proposals less effective when occlusion or pose misalignment occurs on proposal objects. To deal with these problems, we propose the Context and Structure Mining Network to better aggregate features for video object detection. In our method, we first encode the spatial-temporal context information into object features in a global manner, which can benefit the object classification. In addition, the holistic proposal is divided into several patches to capture the structure information of the object, and cross patch matching is conducted to alleviate the pose misalignment between objects in target and support proposals. Moreover, an importance weight is learned for each target proposal patch to indicate how informative this patch is for the final feature aggregation, by which the occluded patches can be neglected. This enables the aggregation module to leverage the most important and informative patches to obtain the final feature aggregation. The proposed framework outperforms all the latest state-of-the-art methods on the ImageNet VID dataset with a large margin. This project is publicly available https://github.com/LiangHann/Context-and-Structure-Mining-Network-for-Video-Object-Detection.



中文翻译:

用于视频对象检测的上下文和结构挖掘网络

经验证,从其他帧聚合时间特征对于视频对象检测非常有效,以克服静止图像中的挑战,例如遮挡、运动模糊和罕见姿势。目前,提案级别的特征聚合主导了这个方向。然而,整体提案级特征聚合存在两个主要问题。首先,区域提议网络生成的对象提议忽略了对象周围有用的上下文信息,这被证明有助于对象分类。其次,传统的提议级特征聚合将提议视为一个整体,而没有考虑重要的对象结构信息,这使得当提议对象上发生遮挡或姿势错位时,两个提议之间的相似性比较不太有效。上下文和结构挖掘网络以更好地聚合视频对象检测的特征。在我们的方法中,我们首先以全局方式将时空上下文信息编码为对象特征,这有利于对象分类。另外,将整体proposal分为几个patch来捕捉对象的结构信息,交叉patch匹配进行以减轻目标和支持建议中对象之间的姿势错位。此外,为每个目标提议补丁学习了一个重要性权重,以表明该补丁对最终特征聚合的信息量有多大,通过它可以忽略被遮挡的补丁。这使聚合模块能够利用最重要和信息量最大的补丁来获得最终的特征聚合。所提出的框架在 ImageNet VID 数据集上以很大的优势优于所有最新的最先进的方法。这个项目是公开的 https://github.com/LiangHann/Context-and-Structure-Mining-Network-for-Video-Object-Detection。

更新日期:2021-08-13
down
wechat
bug