Video object detection with a convolutional regression tracker,ISPRS Journal of Photogrammetry and Remote Sensing

当前位置： X-MOL 学术 › ISPRS J. Photogramm. Remote Sens. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Video object detection with a convolutional regression tracker
ISPRS Journal of Photogrammetry and Remote Sensing ( IF 10.6 ) Pub Date : 2021-04-30 , DOI: 10.1016/j.isprsjprs.2021.04.004
Ye Lyu , Michael Ying Yang , George Vosselman , Gui-Song Xia

Video object detection is a fundamental research task for scene understanding. Compared with object detection in images, object detection in videos has been less researched due to shortage of labelled video datasets. As frames in a video clip are highly correlated, a larger quantity of video labels are needed to have good data variation, which are not always available as the labels are much more expensive to attain. Regarding the above-mentioned problem, it is easy to train an image object detector, but not always possible to train a video object detector if there are insufficient video labels for certain classes. In order to deal with this problem and improve the performance of an image object detector for the classes without video labels, we propose to augment a well-trained image object detector with an efficient and effective class-agnostic convolutional regression tracker for the video object detection task. The tracker learns to track objects by reusing the features from the image object detector, which is a light-weighted increment to the detector, with only a slight speed drop for the video object detection task. The performance of our model is evaluated on the large-scale ImageNet VID dataset. Our strategy improves the mean average precision (mAP) score for the image object detector by around $5 %$ and around $3 %$ for the image object detector plus Seq-NMS post-processing.

中文翻译：

卷积回归跟踪器的视频对象检测

视频对象检测是场景理解的一项基本研究任务。与图像中的对象检测相比，由于缺少标记的视频数据集，对视频中的对象检测的研究较少。由于视频剪辑中的帧高度相关，因此需要大量的视频标签才能具有良好的数据变化，由于获得的标签要昂贵得多，因此这些数据标签并不总是可用。关于上述问题，训练图像对象检测器很容易，但是如果对于某些类别的视频标签不足，则并非总是可以训练视频对象检测器。为了解决此问题并提高没有视频标签的类的图像对象检测器的性能，我们建议为视频对象检测任务增加一个训练有素的图像对象检测器，以使用高效且有效的类不可知卷积回归跟踪器。跟踪器通过重用来自图像对象检测器的特征来学习跟踪对象，这是检测器的轻量化增量，而对于视频对象检测任务只有很小的速度下降。我们在大型ImageNet VID数据集上评估了模型的性能。我们的策略将图像对象检测器的平均平均精度（mAP）得分提高了约我们在大型ImageNet VID数据集上评估了模型的性能。我们的策略将图像对象检测器的平均平均精度（mAP）得分提高了约我们在大型ImageNet VID数据集上评估了模型的性能。我们的策略将图像对象检测器的平均平均精度（mAP）得分提高了约 $5 ％$ 和周围 $3 ％$ 用于图像对象检测器以及Seq-NMS后处理。

更新日期：2021-04-30

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11