Elsevier

Pattern Recognition Letters

Volume 136, August 2020, Pages 244-250
Pattern Recognition Letters

Data augmentation method for improving the accuracy of human pose estimation with cropped images

https://doi.org/10.1016/j.patrec.2020.06.015Get rights and content

Highlights

  • We propose a data augmentation technique that incorporates cropped images for human pose estimation.

  • The proposed method enhances the accuracy of state-of-the-art networks.

  • The proposed method effectively reduces false negatives of localizing keypoints.

  • Proper cropping and training strategies maximize the effect of the proposed method.

Abstract

Neural networks have improved the accuracy of human pose estimation from a single RGB image. However, such estimation remains difficult, especially when the human body is only partially visible due to a limited field of view of the camera or occlusions. In this paper, we introduce a data augmentation method called body-cropping augmentation (BCA), which generalizes the dataset for effective training in human pose estimation. This technique includes the policies of data generation and the training strategy using the augmented data. The experiments with the COCO val2017 dataset with ground-truth bounding boxes show BCA consistently enhances accuracies of state-of-the-art neural networks by an average of 1.08% without any modification to the network architecture. Moreover, the proposed BCA technique effectively reduces the false negatives of localizing keypoints, especially in an input image with a few visible keypoints.

Introduction

Body poses are primarily used to understand human behavior. The available large-scale image dataset of human poses (e.g., in [1] and [2]) encourages researchers and practitioners to develop a neural network for human pose estimation. However, pose estimation task remains challenging, especially when a body is only partially visible; the performance of such estimation tends to deteriorate due to the lack of clues from the image. Ruggero Ronchi and Perona [3] also reported that errors increase when an input image has a few visible keypoints and that a large portion of the errors involves false negatives; this phenomenon indicates that the network fails to localize certain keypoints.

Data augmentation is a regularization method that improves the training performance of neural networks without modifying their architectures. This technique allows networks to generalize the problem effectively by adding modified data in the training process. The affine transform-based data augmentation method introduced by Cireşan et al. [4], Krizhevsky et al. [5] and Simonyan and Zisserman [6] has become an essential process in the human pose estimation domain. In addition to the data augmentations to change feature patterns in the region of interest (ROI), alterations of the ROI reinforce the training data. When we modify ROI capturing a human figure, the new ROI obtains a different amount of information for human pose estimation.

In the present study, we propose body-cropping augmentation (BCA) to improve the accuracy of human pose estimation. BCA generates new training data by selecting a human ROI in various ways. Therefore, a neural network can localize keypoints using only a partial feature of the human body, thereby affecting the general performance of human pose estimation. BCA includes policies for compiling proper training data in order to prevent the augmented data from having overly small segments and images with high similarities. In addition, we provide a proper learning strategy that maximizes the effectiveness of the proposed data augmentation. Using the results of various experiments involving state-of-the-art neural networks (e.g., Chen et al. [7], Xiao et al. [8] and Sun et al. [9]), we verify that the proposed BCA can improve the performance of a state-of-the-art network by an average of 1.08% on the val2017 dataset. Furthermore, using the result from the benchmarking tool of [3], we show that BCA contributes to the reduction of false negatives, especially when the input image has limited keypoints.

Section snippets

Keypoint detection for human pose estimation

Recently, various types of neural network architectures have estimated the probability maps of keypoint locations from a single image. Research on human pose estimation has two main streams, namely, the top-down and the bottom-up approaches. Chen et al. [7], Xiao et al. [8], Newell et al. [10], He et al. [11], and Sun et al. [9] adopt top-down-based approaches; in this technique, human pose estimation is performed after human detection which provides the ROI of a target human. The accuracy of a

Body-cropping augmentation(BCA)

Cropping may not be always effective for data augmentation. For example, cropped images might contain insufficient information that can distract the initial problem or slight difference to make any meaningful role in data augmentation. Meanwhile, there can be various ways to utilize the augmented data in a combination with the original data. In this section, we first introduce data cropping strategies to alleviate the possible problems caused by the cropping method(Section 3.1). Then, we

Dataset and experimental setting

Various datasets, such as those of [1], [21] and [2], are available for the study. Among them, we choose the COCO dataset of [2] specifically, because it has the largest volume of images with plenty of annotations such as bounding boxes, keypoint, and segmentation. The dataset also contains various wild cases. The term ‘wild’ here indicates that the dataset includes images with large variance without any constraints for human pose estimation. For example, the data include an image with a

Conclusion

We have proposed body-cropping augmentation (BCA), which include data collection method and learning strategy, for enhancing human pose estimation. Using the COCO dataset, the proposed BCA improves the accuracy of state-of-the-art neural networks by approximately 1.08% and 0.92% on the average on val2017 and test2017, respectively. In addition, BCA allows the networks to alleviate false negatives effectively, especially when an image has a few keypoints. The result is even promising when we

Declaration of Competing Interest

  • All authors have participated in (a) conception and design, or analysis and interpretation of the data; (b) drafting the article or revising it critically for important intellectual content; and (c) approval of the final version.

  • This manuscript has not been submitted to, nor is under review at, another journal or other publishing venue.

  • The authors have no affiliation with any organization with a direct or indirect financial interest in the subject matter discussed in the manuscript

Acknowledgments

This work was supported by Institute of Information and Communications Technology Planning and Evaluation (IITP) grant funded by the Korea government (MIST) [R0118-19-1004, Development of Intelligent Interaction Technology based on Recognition of User’s State and Intention for Digital Life]

References (23)

  • M. Andriluka et al.

    2D human pose estimation: new benchmark and state of the art analysis

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2014)
  • T.-Y. Lin et al.

    Microsoft coco: common objects in context

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2014)
  • M. Ruggero Ronchi et al.

    Benchmarking and error diagnosis in multi-instance pose estimation

    Proceedings of IEEE International Conference on Computer Vision (ICCV)

    (2017)
  • D.C. Cireşan et al.

    Deep, big, simple neural nets for handwritten digit recognition

    Neural Comput.

    (2010)
  • A. Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

    NIPS

    (2012)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    ICLR

    (2014)
  • Y. Chen et al.

    Cascaded pyramid network for multi-person pose estimation

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2018)
  • B. Xiao et al.

    Simple baselines for human pose estimation and tracking

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2018)
  • K. Sun, B. Xiao, D. Liu, J. Wang, Deep high-resolution representation learning for human pose estimation,...
  • A. Newell et al.

    Stacked hourglass networks for human pose estimation

    Proceedings of the European Conference on Computer Vision (ECCV)

    (2016)
  • K. He et al.

    Mask R-CNN

    Proceedings of IEEE International Conference on Computer Vision (ICCV)

    (2017)
  • Cited by (0)

    View full text