当前位置: X-MOL 学术IEEE Trans. Multimedia › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Single Shot Video Object Detector
IEEE Transactions on Multimedia ( IF 8.4 ) Pub Date : 2020-01-01 , DOI: 10.1109/tmm.2020.2990070
Jiajun Deng , Yingwei Pan , Ting Yao , Wengang Zhou , Houqiang Li , Tao Mei

Single shot detectors that are potentially faster and simpler than two-stage detectors tend to be more applicable to object detection in videos. Nevertheless, the extension of such object detectors from image to video is not trivial especially when appearance deterioration exists in videos, \emph{e.g.}, motion blur or occlusion. A valid question is how to explore temporal coherence across frames for boosting detection. In this paper, we propose to address the problem by enhancing per-frame features through aggregation of neighboring frames. Specifically, we present Single Shot Video Object Detector (SSVD) -- a new architecture that novelly integrates feature aggregation into a one-stage detector for object detection in videos. Technically, SSVD takes Feature Pyramid Network (FPN) as backbone network to produce multi-scale features. Unlike the existing feature aggregation methods, SSVD, on one hand, estimates the motion and aggregates the nearby features along the motion path, and on the other, hallucinates features by directly sampling features from the adjacent frames in a two-stream structure. Extensive experiments are conducted on ImageNet VID dataset, and competitive results are reported when comparing to state-of-the-art approaches. More remarkably, for $448 \times 448$ input, SSVD achieves 79.2% mAP on ImageNet VID, by processing one frame in 85 ms on an Nvidia Titan X Pascal GPU. The code is available at \url{this https URL}.

中文翻译:

单镜头视频目标检测器

可能比两级检测器更快、更简单的单次检测器往往更适用于视频中的对象检测。然而,这种物体检测器从图像到视频的扩展并不是微不足道的,尤其是当视频中存在外观恶化时,\emph{例如},运动模糊或遮挡。一个有效的问题是如何探索跨帧的时间一致性以促进检测。在本文中,我们建议通过聚合相邻帧来增强每帧特征来解决该问题。具体来说,我们提出了单镜头视频对象检测器 (SSVD)——一种新的架构,它新颖地将特征聚合集成到一个单级检测器中,用于视频中的对象检测。从技术上讲,SSVD 以特征金字塔网络(FPN)为骨干网络来产生多尺度特征。与现有的特征聚合方法不同,SSVD一方面估计运动并沿运动路径聚合附近的特征,另一方面通过直接从双流结构中的相邻帧采样特征来产生幻觉。在 ImageNet VID 数据集上进行了大量实验,并在与最先进的方法进行比较时报告了具有竞争力的结果。更值得注意的是,对于 448 美元\乘以 448 美元的输入,SSVD 通过在 Nvidia Titan X Pascal GPU 上在 85 毫秒内处理一帧,在 ImageNet VID 上实现了 79.2% 的 mAP。代码位于 \url{this https URL}。通过直接从双流结构中的相邻帧中采样特征来产生幻觉。在 ImageNet VID 数据集上进行了大量实验,并在与最先进的方法进行比较时报告了具有竞争力的结果。更值得注意的是,对于 448 美元\乘以 448 美元的输入,SSVD 通过在 Nvidia Titan X Pascal GPU 上在 85 毫秒内处理一帧,在 ImageNet VID 上实现了 79.2% 的 mAP。代码位于 \url{this https URL}。通过直接从双流结构中的相邻帧中采样特征来产生幻觉。在 ImageNet VID 数据集上进行了大量实验,并在与最先进的方法进行比较时报告了具有竞争力的结果。更值得注意的是,对于 448 美元\乘以 448 美元的输入,SSVD 通过在 Nvidia Titan X Pascal GPU 上在 85 毫秒内处理一帧,在 ImageNet VID 上实现了 79.2% 的 mAP。代码位于 \url{this https URL}。
更新日期:2020-01-01
down
wechat
bug