A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection

Chen, Zhe; Ouyang, Wanli; Liu, Tongliang; Tao, Dacheng

doi:10.1007/s11263-020-01412-0

A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection

Published: 09 January 2021

Volume 129, pages 1121–1138, (2021)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Zhe Chen ORCID: orcid.org/0000-0001-5004-8975¹,
Wanli Ouyang¹,
Tongliang Liu¹ &
…
Dacheng Tao¹

1244 Accesses
18 Citations
2 Altmetric
Explore all metrics

Abstract

Deep learning-based computer vision is usually data-hungry. Many researchers attempt to augment datasets with synthesized data to improve model robustness. However, the augmentation of popular pedestrian datasets, such as Caltech and Citypersons, can be extremely challenging because real pedestrians are commonly in low quality. Due to the factors like occlusions, blurs, and low-resolution, it is significantly difficult for existing augmentation approaches, which generally synthesize data using 3D engines or generative adversarial networks (GANs), to generate realistic-looking pedestrians. Alternatively, to access much more natural-looking pedestrians, we propose to augment pedestrian detection datasets by transforming real pedestrians from the same dataset into different shapes. Accordingly, we propose the Shape Transformation-based Dataset Augmentation (STDA) framework. The proposed framework is composed of two subsequent modules, i.e. the shape-guided deformation and the environment adaptation. In the first module, we introduce a shape-guided warping field to help deform the shape of a real pedestrian into a different shape. Then, in the second stage, we propose an environment-aware blending map to better adapt the deformed pedestrians into surrounding environments, obtaining more realistic-looking pedestrians and more beneficial augmentation results for pedestrian detection. Extensive empirical studies on different pedestrian detection benchmarks show that the proposed STDA framework consistently produces much better augmentation results than other pedestrian synthesis approaches using low-quality pedestrians. By augmenting the original datasets, our proposed framework also improves the baseline pedestrian detector by up to 38% on the evaluated benchmarks, achieving state-of-the-art performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Generation: A Review

Article 11 March 2022

Deep learning-based 3D reconstruction: a survey

Article 28 January 2023

CIM-WV: A 2D semantic segmentation dataset of rich window view contents in high-rise, high-density Hong Kong based on photorealistic city information models

Article Open access 28 March 2024

Notes

https://github.com/junyanz/pytorch-CycleGAN-and-pix2pix

References

Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W. S., Nguyen, A. (2018). Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. arXiv preprint arXiv:1811.11553
Arjovsky, M., Chintala, S., Bottou, L. (2017). Wasserstein gan. arXiv preprint arXiv:1701.07875
Bar-Hillel, A., Levi, D., Krupka, E., & Goldberg, C. (2010). Part-based feature synthesis for human detection. ECCV (pp. 127–142). New york: Springer.
Google Scholar
Brazil, G., Yin, X., Liu, X. (2017). Illuminating pedestrians via simultaneous detection & segmentation. In: Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy
Cai, Z., Fan, Q., Feris, R. S., & Vasconcelos, N. (2016). A unified multi-scale deep convolutional neural network for fast object detection. ECCV (pp. 354–370). New york: Springer.
Google Scholar
Chen, Z., Li, J., Chen, Z., & You, X. (2017). Generic pixel level object tracker using bi-channel fully convolutional network. International conference on neural information processing (pp. 666–676). New york: Springer.
Chapter Google Scholar
Chen, Z., Zhang, J., & Tao, D. (2019). Progressive lidar adaptation for road detection. IEEE/CAA Journal of Automatica Sinica, 6(3), 693–702.
Article Google Scholar
Cireşan, D. C., Meier, U., Gambardella, L. M., & Schmidhuber, J. (2010). Deep, big, simple neural nets for handwritten digit recognition. Neural Computation, 22(12), 3207–3220.
Article Google Scholar
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y. (2017). Deformable convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 764–773
Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., Re, C. (2019). A kernel theory of modern data augmentation. In: International Conference on Machine Learning, pp 1528–1537
Dollár, P., Wojek, C., Schiele, B., Perona, P. (2009). Pedestrian detection: A benchmark. In: CVPR, IEEE, pp 304–311
Dollar, P., Wojek, C., Schiele, B., & Perona, P. (2012). Pedestrian detection: An evaluation of the state of the art. T-PAMI, 34(4), 743–761.
Article Google Scholar
Dosovitskiy, A., Fischer, P., Springenberg, J. T., Riedmiller, M., & Brox, T. (2015). Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9), 1734–1747.
Article Google Scholar
Du, X., El-Khamy, M., Lee, J., Davis, L. (2017). Fused dnn: A deep neural network fusion approach to fast and robust pedestrian detection. In: WACV, IEEE, pp 953–961
Enzweiler, M., & Gavrila, D. M. (2008). Monocular pedestrian detection: Survey and experiments. T-PAMI, 12, 2179–2195.
Google Scholar
Felzenszwalb, P., McAllester, D., Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In: CVPR, IEEE, pp 1–8
Felzenszwalb, P. F., Girshick, R. B., McAllester, D. (2010a). Cascade object detection with deformable part models. In: CVPR, IEEE, pp 2241–2248
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010b). Object detection with discriminatively trained part-based models. T-PAMI, 32(9), 1627–1645.
Article Google Scholar
Ge, Y., Li, Z., Zhao, H., Yin, G., Yi, S., Wang, X., et al. (2018). Fd-gan: Pose-guided feature distilling gan for robust person re-identification. Advances in Neural Information Processing Systems, 31, 1230–1241.
Google Scholar
Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11), 1231–1237.
Article Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y. (2014). Generative adversarial nets. In: NIPS, pp 2672–2680
Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1, No. 2). Cambridge: MIT press.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., Courville, A. C. (2017). Improved training of wasserstein gans. In: NIPS, pp 5767–5777
Hattori, H., Naresh Boddeti, V., Kitani, K. M., Kanade, T.(2015). Learning scene-specific pedestrian detectors without real data. In: CVPR, pp 3819–3827
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In: CVPR, pp 770–778
He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask r-cnn. In: ICCV, IEEE, pp 2980–2988
Huang, S., Ramanan, D. (2017). Expecting the unexpected: Training detectors for unusual pedestrians with adversarial imposters. In: CVPR, IEEE, vol 1
Isola, P., Zhu, J. Y., Zhou, T., Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. arXiv preprint
Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spatial transformer networks. In: Advances in neural information processing systems, pp 2017–2025
Lee, D., Liu, S., Gu, J., Liu, M. Y., Yang, M.H., Kautz, J. (2018). Context-aware synthesis and placement of object instances. In: NeurIPS, pp 10393–10403
Lerer, A., Gross, S., Fergus, R. (2016). Learning physical intuition of block towers by example. arXiv preprint arXiv:1603.01312
Li, J., Liang, X., Shen, S., Xu, T., Feng, J., & Yan, S. (2018). Scale-aware fast r-cnn for pedestrian detection. TMM, 20(4), 985–996.
Google Scholar
Lin, C., Lu, J., Wang, G., & Zhou, J. (2018). Graininess-aware deep feature learning for pedestrian detection. ECCV (pp. 732–747). New york: Springer.
Google Scholar
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. ECCV (pp. 740–755). New york: Springer.
Google Scholar
Lin, T.Y., Dollár, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J. (2017). Feature pyramid networks for object detection. In: CVPR, vol 1, p 4
Liu, J., Ni, B., Yan, Y., Zhou P., Cheng, S., Hu, J. (2018). Pose transferrable person re-identification. In: CVPR, IEEE, pp 4099–4108
Liu, L., Muelly, M., Deng, J., Pfister, T., Li, L. J. (2019). Generative modeling for small-data object detection. In: ICCV, pp 6073–6081
Liu, M. Y., Breuel, T., Kautz, J .(2017a). Unsupervised image-to-image translation networks. In: NIPS, pp 700–708
Liu, T., Lugosi, G., Neu, G., Tao, D. (2017b). Algorithmic stability and hypothesis complexity. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, pp 2159–2167
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., et al. (2016). Ssd: Single shot multibox detector. ECCV (pp. 21–37). New york: Springer.
Google Scholar
Loy, C. C., Lin, D., Ouyang, W., Xiong, Y., Yang, S., Huang, Q., Zhou, D., Xia, W., Li, Q., Luo, P., et al. (2019). Wider face and pedestrian challenge 2018: Methods and results. arXiv preprint arXiv:1902.06854
Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L. (2017). Pose guided person image generation. In: Advances in Neural Information Processing Systems, pp 406–416
Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., Fritz, M. (2018). Disentangled person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 99–108
Ouyang, W., Wang, X. (2013). Joint deep learning for pedestrian detection. In: Proceedings of the IEEE international conference on computer vision, pp 2056–2063
Ouyang, W., Zhou, H., Li, H., Li, Q., Yan, J., & Wang, X. (2017). Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection. IEEE transactions on pattern analysis and machine intelligence, 40(8), 1874–1887.
Article Google Scholar
Ouyang, W., Zhou, H., Li, H., Li, Q., Yan, J., & Wang, X. (2018a). Jointly learning deep features, deformable parts, occlusion and classification for pedestrian detection. T-PAMI, 40(8), 1874–1887.
Article Google Scholar
Ouyang, X., Cheng, Y., Jiang, Y., Li, C. L., Zhou, P. (2018b). Pedestrian-synthesis-gan: Generating pedestrian data in real scene and beyond. arXiv preprint arXiv:1804.02047
Park, D., Ramanan, D., & Fowlkes, C. (2010). Multiresolution models for object detection. ECCV (pp. 241–254). New york: Springer.
Google Scholar
Pishchulin, L., Jain, A., Wojek, C., Andriluka, M., Thormählen, T., Schiele, B. (2011). Learning people detection models from few training samples. In: CVPR, IEEE, pp 1473–1480
Radford, A., Metz, L., Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434
Ran, Y., Weiss, I., Zheng, Q., & Davis, L. S. (2007). Pedestrian detection via periodic motion analysis. International Journal of Computer Vision, 71(2), 143–160.
Article Google Scholar
Redmon, J., Divvala, S., Girshick, R., Farhadi, A. (2016). You only look once: Unified, real-time object detection. In: CVPR, IEEE, pp 779–788
Ren, S., He, K., Girshick, R., Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. In: NIPS, pp 91–99
Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention (pp. 234–241). New york: Springer.
Google Scholar
Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A. M. (2016). The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: CVPR, IEEE, pp 3234–3243
Sajjadi, M., Javanmardi, M., Tasdizen, T. (2016). Regularization with stochastic transformations and perturbations for deep semi-supervised learning. In: Advances in Neural Information Processing Systems, pp 1163–1171
Siarohin, A., Sangineto, E., Lathuilière, S., Sebe, N. (2018). Deformable gans for pose-based human image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3408–3416
Simonyan, K., Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Song, T., Sun, L., Xie, D., Sun, H., Pu, S. (2018). Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In: The European Conference on Computer Vision (ECCV)
Vapnik, V. (2013). The nature of statistical learning theory. New york: Springer.
MATH Google Scholar
Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H. (2017). Learning to generate long-term future via hierarchical prediction. arXiv preprint arXiv:1704.05831
Viola, P., Jones, M. J., & Snow, D. (2005). Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, 63(2), 153–161.
Article Google Scholar
Vobecky, A., Uricár, M., Hurych, D., Skoviera, R. (2019). Advanced pedestrian dataset augmentation for autonomous driving. In: ICCV Workshops, pp 0–0
Wang, X., Shrivastava, A., Gupta, A. (2017). A-fast-rcnn: Hard positive generation via adversary for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2606–2615
Wang, X., Xiao, T., Jiang, Y., Shao, S., Sun, J., Shen, C. (2018). Repulsion loss: detecting pedestrians in a crowd. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 7774–7783
Yan, Y., Xu, J., Ni, B., Zhang, W., Yang, X. (2017). Skeleton-aided articulated motion generation. In: Proceedings of the 2017 ACM on Multimedia Conference, ACM , pp 199–207
Zanfir, M., Popa, A. I., Zanfir, A., Sminchisescu, C. (2018). Human appearance transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 5391–5399
Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O. (2016a). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530
Zhang, J., Chen, Z., Tao, D. (2020). Towards high performance human keypoint detection. arXiv preprint arXiv:2002.00537
Zhang, L., Lin, L., Liang, X., & He, K. (2016b). Is faster r-cnn doing well for pedestrian detection? ECCV (pp. 443–457). New york: Springer.
Google Scholar
Zhang, S., Benenson, R., Omran, M., Hosang, J., Schiele, B. (2016c). How far are we from solving pedestrian detection? In: CVPR, IEEE, pp 1259–1267
Zhang, S., Benenson, R., Schiele, B. (2017). Citypersons: A diverse dataset for pedestrian detection. In: CVPR, IEEE, vol 1, p 3
Zhang, S., Wen, L., Bian, X., Lei, Z., Li, S. Z. (2018a). Occlusion-aware r-cnn: detecting pedestrians in a crowd. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 637–653
Zhang, S., Yang, J., Schiele, B. (2018b). Occluded pedestrian detection through guided attention in cnns. In: CVPR, IEEE, pp 6995–7003
Zheng, Z., Zheng, L., Yang, Y. (2017). Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE International Conference on Computer Vision, pp 3754–3762
Zhu, J.Y., Park, T., Isola, P., Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV, IEEE

Download references

Author information

Authors and Affiliations

University of Sydney, Sydney, NSW, Australia
Zhe Chen, Wanli Ouyang, Tongliang Liu & Dacheng Tao

Authors

Zhe Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wanli Ouyang
View author publications
You can also search for this author in PubMed Google Scholar
Tongliang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Dacheng Tao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dacheng Tao.

Additional information

Communicated by Cha Zhang.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by Australian Research Council Projects FL-170100117, IH-180100002, IC-190100031, DP200103223, and Australian Medical Research Future Fund MRFAI000085.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Ouyang, W., Liu, T. et al. A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection. Int J Comput Vis 129, 1121–1138 (2021). https://doi.org/10.1007/s11263-020-01412-0

Download citation

Received: 09 December 2019
Accepted: 30 November 2020
Published: 09 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s11263-020-01412-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection

Abstract

Access this article

Similar content being viewed by others

Image Generation: A Review

Deep learning-based 3D reconstruction: a survey

CIM-WV: A 2D semantic segmentation dataset of rich window view contents in high-rise, high-density Hong Kong based on photorealistic city information models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Shape Transformation-based Dataset Augmentation Framework for Pedestrian Detection

Abstract

Access this article

Similar content being viewed by others

Image Generation: A Review

Deep learning-based 3D reconstruction: a survey

CIM-WV: A 2D semantic segmentation dataset of rich window view contents in high-rise, high-density Hong Kong based on photorealistic city information models

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation