Skip to main content
Log in

Abstract

Batch Normalization (BN) is a milestone technique in the development of deep learning, enabling various networks to train. However, normalizing along the batch dimension introduces problems—BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation. This limits BN’s usage for training larger models and transferring features to computer vision tasks including detection, segmentation, and video, which require small batches constrained by memory consumption. In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants. Moreover, GN can be naturally transferred from pre-training to fine-tuning. GN can outperform its BN-based counterparts for object detection and segmentation in COCO (https://github.com/facebookresearch/Detectron/blob/master/projects/GN), and for video classification in Kinetics, showing that GN can effectively replace the powerful BN in a variety of tasks. GN can be easily implemented by a few lines of code in modern libraries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. In the context of this paper, we use “batch size” to refer to the number of samples per worker (e.g., GPU), unless noted. BN’s statistics are computed for each worker, but not broadcast across workers, as is standard in many libraries.

  2. https://github.com/pytorch/pytorch/blob/master/caffe2/operators/group_norm_op.h.

  3. https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/Normalization.cpp.

  4. For completeness, we have also trained ResNet-50 with WN (Salimans and Kingma 2016), which is filter (instead of feature) normalization. WN’s result is 28.2%.

  5. Detectron Girshick et al. (2018) uses pre-trained models provided by the authors of He et al. (2016). For fair comparisons, we instead use the models pre-trained in this paper. The object detection and segmentation accuracy is statistically similar between these pre-trained models.

  6. For models trained from scratch, we turn off the default StopGrad in Detectron that freezes the first few layers.

  7. We refer to “distributed training” as training with multiple workers (GPUs), which are often hosted in multiple machines. In our infrastructure, typical settings are 8 GPUs per machine.

References

  • Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. (2016). Tensorflow: A system for large-scale machine learning. In Operating systems design and implementation (OSDI).

  • Arpit, D., Zhou, Y., Kota, B., & Govindaraju, V. (2016). Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. In International conference on machine learning (ICML).

  • Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450.

  • Bottou, L., Curtis, F. E., & Nocedal, J. (2016). Optimization methods for large-scale machine learning. arXiv:1606.04838.

  • Carandini, M., & Heeger, D. J. (2012). Normalization as a canonical neural computation. Nature Reviews Neuroscience, 13, 51.

    Article  Google Scholar 

  • Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? A new model and the kinetics dataset. In Computer vision and pattern recognition (CVPR).

  • Chollet, F. (2017). Xception: Deep learning with depthwise separable convolutions. In Computer vision and pattern recognition (CVPR).

  • Cohen, T., & Welling, M. (2016). Group equivariant convolutional networks. In International conference on machine learning (ICML).

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Computer vision and pattern recognition (CVPR).

  • Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. (2012). Large scale distributed deep networks. In Neural information processing systems (NeurIPS).

  • Dieleman, S., De Fauw, J., & Kavukcuoglu, K. (2016). Exploiting cyclic symmetry in convolutional neural networks. In International conference on machine learning (ICML).

  • Girshick, R. (2015). Fast R-CNN. In International conference on computer vision (ICCV).

  • Girshick, R., Radosavovic, I., Gkioxari, G., Dollár, P., & He, K. (2018). Detectron. https://github.com/facebookresearch/detectron.

  • Gitman, Y. Y. I., & Ginsburg, B. (2017). Scaling SGD batch size to 32 k for imagenet training. arXiv:1708.03888.

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics (AISTATS).

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Neural information processing systems (NeurIPS).

  • Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., et al. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv:1706.02677.

  • Gross, S., & Wilber, M. (2016). Training and investigating residual nets. https://github.com/facebook/fb.resnet.torch.

  • He, K., Girshick, R., & Dollár, P. (2018). Rethinking imagenet pre-training. arXiv:1811.08883.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. In International conference on computer vision (ICCV).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In International conference on computer vision (ICCV).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Computer vision and pattern recognition (CVPR).

  • Heeger, D. J. (1992). Normalization of cell responses in cat striate cortex. Visual Neuroscience, 9, 181–197.

    Article  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9, 1735–1780.

    Article  Google Scholar 

  • Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., et al. (2017). MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861.

  • Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K.Q. (2017). Densely connected convolutional networks. In Computer vision and pattern recognition (CVPR).

  • Ioffe, S. (2017). Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Neural Information processing systems (NeurIPS).

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (ICML).

  • Isola, P., Zhu, J. Y., Zhou, T., & Efros, A. A. (2017). Image-to-image translation with conditional adversarial networks. In Computer vision and pattern recognition (CVPR).

  • Jarrett, K., Kavukcuoglu, K., LeCun, Y., et al. (2009). What is the best multi-stage architecture for object recognition? In International conference on computer vision (ICCV).

  • Jegou, H., Douze, M., Schmid, C., & Perez, P. (2010). Aggregating local descriptors into a compact image representation. In Computer vision and pattern recognition (CVPR).

  • Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., et al. (2017). The Kinetics human action video dataset. arXiv:1705.06950.

  • Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. arXiv:1404.5997.

  • Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In Neural information processing systems (NeurIPS).

  • LeCun, Y., Bottou, L., Orr, G. B., & Müller, K. R. (1998). Efficient backprop. In Neural networks: Tricks of the trade.

  • Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., & Sun, J. (2018). DetNet: A backbone network for object detection. arXiv:1804.06215.

  • Lin, T. Y., Dollár, P., Girshick, R., He, K., Hariharan, B., & Belongie, S. (2017a). Feature pyramid networks for object detection. In Computer vision and pattern recognition (CVPR).

  • Lin, T. Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017b). Focal loss for dense object detection. In International conference on computer vision (ICCV).

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In European conference on computer vision (ECCV).

  • Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In Computer vision and pattern recognition (CVPR).

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 60, 91–110.

    Article  Google Scholar 

  • Lyu, S., & Simoncelli, E. P. (2008). Nonlinear image representation using divisive normalization. In Computer vision and pattern recognition (CVPR).

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision (IJCV), 42, 145–175.

    Article  Google Scholar 

  • Peng, C., Xiao, T., Li, Z., Jiang, Y., Zhang, X., Jia, K., Yu, G., & Sun, J. (2018). MegDet: A large mini-batch object detector. In Computer vision and pattern recognition (CVPR).

  • Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In Computer vision and pattern recognition (CVPR).

  • Rebuffi, S. A., Bilen, H., & Vedaldi, A. (2017). Learning multiple visual domains with residual adapters. In Neural information processing systems (NeurIPS).

  • Ren, M., Liao, R., Urtasun, R., Sinz, F. H., & Zemel, R. S. (2017a). Normalizing the normalizers: Comparing and extending network normalization schemes. In International conference on learning representations (ICLR).

  • Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Neural information processing systems (NeurIPS).

  • Ren, S., He, K., Girshick, R., Zhang, X., & Sun, J. (2017b). Object detection networks on convolutional feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 39, 1476–1481.

    Article  Google Scholar 

  • Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 5, 1.

    MATH  Google Scholar 

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV)., 115, 211–252.

    Article  MathSciNet  Google Scholar 

  • Salimans, T., & Kingma, D. P. (2016). Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Neural information processing systems (NeurIPS).

  • Schwartz, O., & Simoncelli, E. P. (2001). Natural signal statistics and sensory gain control. Nature Neuroscience, 4, 819.

    Article  Google Scholar 

  • Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). Overfeat: Integrated recognition, localization and detection using convolutional networks. In International conference on learning representations (ICLR).

  • Shillingford, B., Assael, Y., Hoffman, M. W., Paine, T., Hughes, C., Prabhu, U., et al. (2018). Large-scale visual speech recognition. arXiv:1807.05162.

  • Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., et al. (2017). Mastering the game of go without human knowledge. Nature, 550, 354.

    Article  Google Scholar 

  • Simoncelli, E. P., & Olshausen, B. A. (2001). Natural image statistics and neural representation. Annual Review of Neuroscience, 24, 1193–1216.

    Article  Google Scholar 

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (ICLR).

  • Szegedy, C., Ioffe, S., & Vanhoucke, V. (2016a). Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR workshop.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In Computer vision and pattern recognition (CVPR).

  • Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016b). Rethinking the inception architecture for computer vision. In Computer vision and pattern recognition (CVPR).

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In International conference on computer vision (ICCV).

  • Ulyanov, D., Vedaldi, A., & Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022.

  • Wang, X., Girshick, R., Gupta, A., & He, K. (2018). Non-local neural networks. In Computer vision and pattern recognition (CVPR).

  • Wu, Y., & He, K. (2018). Group normalization. In European conference on computer vision (ECCV).

  • Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. In Computer vision and pattern recognition (CVPR).

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional neural networks. In European conference on computer vision (ECCV).

  • Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Computer vision and pattern recognition (CVPR).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuxin Wu.

Additional information

Communicated by M. Hebert.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Y., He, K. Group Normalization. Int J Comput Vis 128, 742–755 (2020). https://doi.org/10.1007/s11263-019-01198-w

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-019-01198-w

Keywords

Navigation