当前位置: X-MOL 学术Int. J. Comput. Vis. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Synthesizing a Scene-Specific Pedestrian Detector and Pose Estimator for Static Video Surveillance
International Journal of Computer Vision ( IF 19.5 ) Pub Date : 2018-03-16 , DOI: 10.1007/s11263-018-1077-3
Hironori Hattori , Namhoon Lee , Vishnu Naresh Boddeti , Fares Beainy , Kris M. Kitani , Takeo Kanade

We consider scenarios where we have zero instances of real pedestrian data (e.g., a newly installed surveillance system in a novel location in which no labeled real data or unsupervised real data exists yet) and a pedestrian detector must be developed prior to any observations of pedestrians. Given a single image and auxiliary scene information in the form of camera parameters and geometric layout of the scene, our approach infers and generates a large variety of geometrically and photometrically accurate potential images of synthetic pedestrians along with purely accurate ground-truth labels through the use of computer graphics rendering engine. We first present an efficient discriminative learning method that takes these synthetic renders and generates a unique spatially-varying and geometry-preserving pedestrian appearance classifier customized for every possible location in the scene. In order to extend our approach to multi-task learning for further analysis (i.e., estimating pose and segmentation of pedestrians besides detection), we build a more generalized model employing a fully convolutional neural network architecture for multi-task learning leveraging the “free" ground-truth annotations that can be obtained from our pedestrian synthesizer. We demonstrate that when real human annotated data is scarce or non-existent, our data generation strategy can provide an excellent solution for an array of tasks for human activity analysis including detection, pose estimation and segmentation. Experimental results show that our approach (1) outperforms classical models and hybrid synthetic-real models, (2) outperforms various combinations of off-the-shelf state-of-the-art pedestrian detectors and pose estimators that are trained on real data, and (3) surprisingly, our method using purely synthetic data is able to outperform models trained on real scene-specific data when data is limited.

中文翻译:

为静态视频监控合成特定场景的行人检测器和姿态估计器

我们考虑了真实行人数据为零的场景(例如,在一个新位置新安装的监视系统,其中尚不存在标记的真实数据或无监督的真实数据),并且必须在对行人进行任何观察之前开发行人检测器. 给定单个图像和以相机参数和场景几何布局形式存在的辅助场景信息,我们的方法通过使用推断并生成合成行人的各种几何和光度学准确的潜在图像以及纯准确的地面实况标签计算机图形渲染引擎。我们首先提出了一种有效的判别学习方法,该方法采用这些合成渲染并为场景中的每个可能位置定制一个独特的空间变化和几何保留行人外观分类器。为了将我们的方法扩展到多任务学习以进行进一步分析(即,除了检测之外,还估计行人的姿态和分割),我们构建了一个更通用的模型,该模型采用全卷积神经网络架构进行多任务学习,利用“自由”可以从我们的行人合成器中获得的真实注释。我们证明,当真实的人类注释数据稀缺或不存在时,我们的数据生成策略可以为人类活动分析的一系列任务提供出色的解决方案,包括检测、姿态估计和分割。实验结果表明,我们的方法 (1) 优于经典模型和混合合成真实模型,(2) 优于在真实数据上训练的现成最先进行人检测器和姿势估计器的各种组合, (3) 令人惊讶的是,当数据有限时,我们使用纯合成数据的方法能够胜过在真实场景特定数据上训练的模型。
更新日期:2018-03-16
down
wechat
bug