Skip to main content
Log in

A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Deep learning-based computer vision is usually data-hungry. Many researchers attempt to augment datasets with synthesized data to improve model robustness. However, the augmentation of popular pedestrian datasets, such as Caltech and Citypersons, can be extremely challenging because real pedestrians are commonly in low quality. Due to the factors like occlusions, blurs, and low-resolution, it is significantly difficult for existing augmentation approaches, which generally synthesize data using 3D engines or generative adversarial networks (GANs), to generate realistic-looking pedestrians. Alternatively, to access much more natural-looking pedestrians, we propose to augment pedestrian detection datasets by transforming real pedestrians from the same dataset into different shapes. Accordingly, we propose the Shape Transformation-based Dataset Augmentation (STDA) framework. The proposed framework is composed of two subsequent modules, i.e.  the shape-guided deformation and the environment adaptation. In the first module, we introduce a shape-guided warping field to help deform the shape of a real pedestrian into a different shape. Then, in the second stage, we propose an environment-aware blending map to better adapt the deformed pedestrians into surrounding environments, obtaining more realistic-looking pedestrians and more beneficial augmentation results for pedestrian detection. Extensive empirical studies on different pedestrian detection benchmarks show that the proposed STDA framework consistently produces much better augmentation results than other pedestrian synthesis approaches using low-quality pedestrians. By augmenting the original datasets, our proposed framework also improves the baseline pedestrian detector by up to 38% on the evaluated benchmarks, achieving state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

Notes

  1. https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

References

  • Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W. S., Nguyen, A. (2018). Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. arXiv preprint arXiv:1811.11553

  • Arjovsky, M., Chintala, S., Bottou, L. (2017). Wasserstein gan. arXiv preprint arXiv:1701.07875

  • Bar-Hillel, A., Levi, D., Krupka, E., & Goldberg, C. (2010). Part-based feature synthesis for human detection. ECCV (pp. 127–142). New york: Springer.

    Google Scholar 

  • Brazil, G., Yin, X., Liu, X. (2017). Illuminating pedestrians via simultaneous detection & segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy

  • Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. ECCV (pp. 354–370). New york: Springer.

    Google Scholar 

  • Chen, Z., Li, J., Chen, Z., & You, X. (2017). Generic pixel level object tracker using bi-channel fully convolutional network. International conference on neural information processing (pp. 666–676). New york: Springer.

    Chapter  Google Scholar 

  • Chen, Z., Zhang, J., & Tao, D. (2019). Progressive lidar adaptation for road detection. IEEE/CAA Journal of Automatica Sinica, 6(3), 693–702.

    Article  Google Scholar 

  • Cireşan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2010). Deep, big, simple neural nets for handwritten digit recognition. Neural Computation, 22(12), 3207–3220.

    Article  Google Scholar 

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y. (2017). Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773

  • Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., Re, C. (2019). A kernel theory of modern data augmentation. In: International Conference on Machine Learning, pp 1528–1537

  • Dollár, P., Wojek, C., Schiele, B., Perona, P. (2009). Pedestrian detection: A benchmark. In: CVPR, IEEE, pp 304–311

  • Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. T-PAMI, 34(4), 743–761.

    Article  Google Scholar 

  • Dosovitskiy, A., Fischer, P., Springenberg, J. T., Riedmiller, M., & Brox, T. (2015). Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1734–1747.

    Article  Google Scholar 

  • Du, X., El-Khamy, M., Lee, J., Davis, L. (2017). Fused dnn: A deep neural network fusion approach to fast and robust pedestrian detection. In: WACV, IEEE, pp 953–961

  • Enzweiler, M., & Gavrila, D. M. (2008). Monocular pedestrian detection: Survey and experiments. T-PAMI, 12, 2179–2195.

    Google Scholar 

  • Felzenszwalb, P., McAllester, D., Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In: CVPR, IEEE, pp 1–8

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D. (2010a). Cascade object detection with deformable part models. In: CVPR, IEEE, pp 2241–2248

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010b). Object detection with discriminatively trained part-based models. T-PAMI, 32(9), 1627–1645.

    Article  Google Scholar 

  • Ge, Y., Li, Z., Zhao, H., Yin, G., Yi, S., Wang, X., et al. (2018). Fd-gan: Pose-guided feature distilling gan for robust person re-identification. Advances in Neural Information Processing Systems, 31, 1230–1241.

    Google Scholar 

  • Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11), 1231–1237.

    Article  Google Scholar 

  • Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014). Generative adversarial nets. In: NIPS, pp 2672–2680

  • Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1, No. 2). Cambridge: MIT press.

  • Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A. C. (2017). Improved training of wasserstein gans. In: NIPS, pp 5767–5777

  • Hattori, H., Naresh Boddeti, V., Kitani, K. M., Kanade, T.(2015). Learning scene-specific pedestrian detectors without real data. In: CVPR, pp 3819–3827

  • He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In: CVPR, pp 770–778

  • He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask r-cnn. In: ICCV, IEEE, pp 2980–2988

  • Huang, S., Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In: CVPR, IEEE, vol 1

  • Isola, P., Zhu, J. Y., Zhou, T., Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. arXiv preprint

  • Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In: Advances in neural information processing systems, pp 2017–2025

  • Lee, D., Liu, S., Gu, J., Liu, M. Y., Yang, M.H., Kautz, J. (2018). Context-aware synthesis and placement of object instances. In: NeurIPS, pp 10393–10403

  • Lerer, A., Gross, S., Fergus, R. (2016). Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312

  • Li, J., Liang, X., Shen, S., Xu, T., Feng, J., & Yan, S. (2018). Scale-aware fast r-cnn for pedestrian detection. TMM, 20(4), 985–996.

    Google Scholar 

  • Lin, C., Lu, J., Wang, G., & Zhou, J. (2018). Graininess-aware deep feature learning for pedestrian detection. ECCV (pp. 732–747). New york: Springer.

    Google Scholar 

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. ECCV (pp. 740–755). New york: Springer.

    Google Scholar 

  • Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J. (2017). Feature pyramid networks for object detection. In: CVPR, vol 1, p 4

  • Liu, J., Ni, B., Yan, Y., Zhou P., Cheng, S., Hu, J. (2018). Pose transferrable person re-identification. In: CVPR, IEEE, pp 4099–4108

  • Liu, L., Muelly, M., Deng, J., Pfister, T., Li, L. J. (2019). Generative modeling for small-data object detection. In: ICCV, pp 6073–6081

  • Liu, M. Y., Breuel, T., Kautz, J .(2017a). Unsupervised image-to-image translation networks. In: NIPS, pp 700–708

  • Liu, T., Lugosi, G., Neu, G., Tao, D. (2017b). Algorithmic stability and hypothesis complexity. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, pp 2159–2167

  • Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). Ssd: Single shot multibox detector. ECCV (pp. 21–37). New york: Springer.

    Google Scholar 

  • Loy, C. C., Lin, D., Ouyang, W., Xiong, Y., Yang, S., Huang, Q., Zhou, D., Xia, W., Li, Q., Luo, P., et al. (2019). Wider face and pedestrian challenge 2018: Methods and results. arXiv preprint arXiv:1902.06854

  • Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L. (2017). Pose guided person image generation. In: Advances in Neural Information Processing Systems, pp 406–416

  • Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., Fritz, M. (2018). Disentangled person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 99–108

  • Ouyang, W., Wang, X. (2013). Joint deep learning for pedestrian detection. In: Proceedings of the IEEE international conference on computer vision, pp 2056–2063

  • Ouyang, W., Zhou, H., Li, H., Li, Q., Yan, J., & Wang, X. (2017). Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection. IEEE transactions on pattern analysis and machine intelligence, 40(8), 1874–1887.

    Article  Google Scholar 

  • Ouyang, W., Zhou, H., Li, H., Li, Q., Yan, J., & Wang, X. (2018a). Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection. T-PAMI, 40(8), 1874–1887.

    Article  Google Scholar 

  • Ouyang, X., Cheng, Y., Jiang, Y., Li, C. L., Zhou, P. (2018b). Pedestrian-synthesis-gan: Generating pedestrian data in real scene and beyond. arXiv preprint arXiv:1804.02047

  • Park, D., Ramanan, D., & Fowlkes, C. (2010). Multiresolution models for object detection. ECCV (pp. 241–254). New york: Springer.

    Google Scholar 

  • Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T., Schiele, B. (2011). Learning people detection models from few training samples. In: CVPR, IEEE, pp 1473–1480

  • Radford, A., Metz, L., Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434

  • Ran, Y., Weiss, I., Zheng, Q., & Davis, L. S. (2007). Pedestrian detection via periodic motion analysis. International Journal of Computer Vision, 71(2), 143–160.

    Article  Google Scholar 

  • Redmon, J., Divvala, S., Girshick, R., Farhadi, A. (2016). You only look once: Unified, real-time object detection. In: CVPR, IEEE, pp 779–788

  • Ren, S., He, K., Girshick, R., Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS, pp 91–99

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention (pp. 234–241). New york: Springer.

    Google Scholar 

  • Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A. M. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR, IEEE, pp 3234–3243

  • Sajjadi, M., Javanmardi, M., Tasdizen, T. (2016). Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Advances in Neural Information Processing Systems, pp 1163–1171

  • Siarohin, A., Sangineto, E., Lathuilière, S., Sebe, N. (2018). Deformable gans for pose-based human image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3408–3416

  • Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  • Song, T., Sun, L., Xie, D., Sun, H., Pu, S. (2018). Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In: The European Conference on Computer Vision (ECCV)

  • Vapnik, V. (2013). The nature of statistical learning theory. New york: Springer.

    MATH  Google Scholar 

  • Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H. (2017). Learning to generate long-term future via hierarchical prediction. arXiv preprint arXiv:1704.05831

  • Viola, P., Jones, M. J., & Snow, D. (2005). Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, 63(2), 153–161.

    Article  Google Scholar 

  • Vobecky, A., Uricár, M., Hurych, D., Skoviera, R. (2019). Advanced pedestrian dataset augmentation for autonomous driving. In: ICCV Workshops, pp 0–0

  • Wang, X., Shrivastava, A., Gupta, A. (2017). A-fast-rcnn: Hard positive generation via adversary for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2606–2615

  • Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., Shen, C. (2018). Repulsion loss: detecting pedestrians in a crowd. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7774–7783

  • Yan, Y., Xu, J., Ni, B., Zhang, W., Yang, X. (2017). Skeleton-aided articulated motion generation. In: Proceedings of the 2017 ACM on Multimedia Conference, ACM , pp 199–207

  • Zanfir, M., Popa, A. I., Zanfir, A., Sminchisescu, C. (2018). Human appearance transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5391–5399

  • Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. (2016a). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530

  • Zhang, J., Chen, Z., Tao, D. (2020). Towards high performance human keypoint detection. arXiv preprint arXiv:2002.00537

  • Zhang, L., Lin, L., Liang, X., & He, K. (2016b). Is faster r-cnn doing well for pedestrian detection? ECCV (pp. 443–457). New york: Springer.

    Google Scholar 

  • Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B. (2016c). How far are we from solving pedestrian detection? In: CVPR, IEEE, pp 1259–1267

  • Zhang, S., Benenson, R., Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In: CVPR, IEEE, vol 1, p 3

  • Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S. Z. (2018a). Occlusion-aware r-cnn: detecting pedestrians in a crowd. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 637–653

  • Zhang, S., Yang, J., Schiele, B. (2018b). Occluded pedestrian detection through guided attention in cnns. In: CVPR, IEEE, pp 6995–7003

  • Zheng, Z., Zheng, L., Yang, Y. (2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3754–3762

  • Zhu, J.Y., Park, T., Isola, P., Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV, IEEE

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dacheng Tao.

Additional information

Communicated by Cha Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by Australian Research Council Projects FL-170100117, IH-180100002, IC-190100031, DP200103223, and Australian Medical Research Future Fund MRFAI000085.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Ouyang, W., Liu, T. et al. A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection. Int J Comput Vis 129, 1121–1138 (2021). https://doi.org/10.1007/s11263-020-01412-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01412-0

Keywords

Navigation