Elsevier

Pattern Recognition

Volume 121, January 2022, 108159
Pattern Recognition

Delving deep into spatial pooling for squeeze-and-excitation networks

https://doi.org/10.1016/j.patcog.2021.108159Get rights and content

Highlights

  • We revisit the squeeze operation in SENets, and shed lights on why and how to embed rich (both global and local) spatial information into the excitation module to improve accuracy.

  • We propose an integrated two-stage spatial pooling method with two efficient implementation approaches for rich descriptor extraction.

  • We conduct extensive experiments to verify convincing improvements over SENets and their extension on various fundamental computer vision tasks.

Abstract

Squeeze-and-Excitation (SE) blocks have demonstrated significant accuracy gains for state-of-the-art deep architectures by re-weighting channel-wise feature responses. The SE block is an architecture unit that integrates two operations: a squeeze operation that employs global average pooling to aggregate spatial convolutional features into a channel feature, and an excitation operation that learns instance-specific channel weights from the squeezed feature to re-weight each channel. In this paper, we revisit the squeeze operation in SE blocks, and shed lights on why and how to embed rich (both global and local) information into the excitation module at minimal extra costs. In particular, we introduce a simple but effective two-stage spatial pooling process: rich descriptor extraction and information fusion. The rich descriptor extraction step aims to obtain a set of diverse (i.e., global and especially local) deep descriptors that contain more informative cues than global average-pooling. While, absorbing more information delivered by these descriptors via a fusion step can aid the excitation operation to return more accurate re-weight scores in a data-driven manner. We validate the effectiveness of our method by extensive experiments on ImageNet for image classification and on MS-COCO for object detection and instance segmentation. For these experiments, our method achieves consistent improvements over the SENets on all tasks, in some cases, by a large margin.

Introduction

Convolutional neural networks (CNNs) are at the core of state-of-the-art solutions for central vision tasks, such as image classification [1], [2], [3], [4], [5], object detection [6], [7], semantic segmentation [8], as well as some real-life applications [9], [10], [11], [12]. Since the impressive, record-breaking performance on 2012 ImageNet competition [13], CNNs have been widely studied by both academic and industry communities from different aspects, yielding good returns in architecture design [2], [3], [4], [14], optimization [15], [16], regularization [17], initialization [18], normalization [19], acceleration [20], [21], [22]. These research achievements significantly push the performance of CNN algorithms [23], e.g., surpassing human performance in image classification [18].

Apart from above research lines, a research trend emerged recently is to explicitly model the spatial or channel correlations of feature responses to enhance the representational power of deep CNNs [24], [25], [26], [27]. Among them, the “Squeeze-and-Excitation” (SE) networks [24] have shown remarkable improvements over various deep architectures by introducing so-called SE blocks. The SE block is a computational unit that learns to selectively emphasize channel-wise informative features and suppress less useful ones. Specifically, in each SE block, a squeeze operation (i.e., global average-pooling) is first performed to aggregate the global spatial information of input features into a channel feature, and then an excitation module (i.e., multi-layer perceptron) induces instance-specific channel activations from the squeezed descriptor to re-weight each channel.

Despite compelling results, a limitation of SE block, arguably, lies in the squeeze operation that performs global information embedding. Local information obscured by global average-pooling, however, may be crucial to identify the importance of different channels. As shown in Fig. 1, without local information as necessary cues, the excitation module may generate high weights for some noisy channels with improper activations on backgrounds. Directly exploiting additional local information in the excitation module might alleviate this problem, which however will change the structure of the excitation module and introduce considerable computational burdens to the whole network.

Aiming at the aforementioned issues, in this paper, we propose a simple but effective two-stage spatial pooling process: rich descriptor extraction and information fusion. The rich descriptor extraction step aims to obtain a set of diverse deep descriptors that collaboratively express both the global and local information of the inputs. Furthermore, the information fusion step absorbs the rich information delivered by these descriptors and then return a powerful channel feature.

Specifically, we utilize two different strategies for rich descriptor extraction: 1) spatial pyramid pooling (SPP) [28] that enjoys multi-scale representation of the inputs and generates a fixed number of descriptors across all stages, and 2) a novel resolution-guided pooling that can generate stage-aware number of descriptors. This resolution-guided pooling is implemented by using GAP for the last stage (conv5), and using this GAP window (i.e., 7×7 for ResNet) as a fixed window to perform non-overlapping average-pooling for all earlier stages. The second step, i.e., information fusion, is simply implemented by depth-wise fully connection, followed by batch normalization (BN) [19] and ReLU [29]. Figure 2 gives the pipeline of our two-stage spatial pooling method.

To summarize, our main contributions lie in three aspects:

  • We revisit the squeeze operation in SENets, and shed lights on why and how to embed rich (both global and local) spatial information into the excitation module to improve performance.

  • We propose an integrated two-stage spatial pooling method with two efficient implementation approaches for rich descriptor extraction. Our method leverages more informative cues that can aid the excitation module to return more accurate channel weights, also with minimal additional computational costs.

  • We conduct extensive experiments to verify convincing improvements over SENets [24] and their extension [26] on various fundamental computer vision tasks. For example, our method decreases the top-1 error by 0.94% and 1.11% for SE-ResNet-50 and SE-ResNet-101 on ImageNet [13] classification, and increases mmAP by 1.1% for Faster R-CNN with SE-ResNet-50 as the backbone on MS-COCO [30] object detection.

The rest of the paper is organized as follows. Section 2 retrospects the related works. Section 3 details proposed method and Section 4 describes the implementation details. Experiments and analysis are provided in the Section 5, followed by the conclusion part in Section 6.

Section snippets

Related work

In this section, we briefly review two closely related research topics, i.e., attention mechanism and spatial feature pooling.

Approach

We start with a brief recap of the “Squeeze-and-Excitation” building block. Given input X to a CNN block, a set of learned convolutional filters are applied on X to produce corresponding feature responses URH×W×C, where H×W is the spatial dimension, and C is the channel dimension. Then, a squeeze operation and an excitation operation are applied on U sequentially to re-weight channel-wise feature responses.

Specifically, the squeeze operation employs global average-pooling (GAP) to aggregate

Implementation details

We follow the practice of [1], [4] for ImageNet classification. Input images are 224×224 patches with per-pixel mean subtracted, randomly cropped from resized images. Standard data augmentation and random horizontal flipping are performed to prevent model from overfitting. Optimization is performed by SGD with momentum of 0.9 and a mini-batch size of 256 on 8 GPUs, and the weight decay is 0.0001. We start from a learning rate of 0.1, and divide it by 10 every 30 epochs. We adopt the weight

Experiments

To validate the effectiveness of our spatial pooling method, we conduct comprehensive experiments on ImageNet-1K [13] for classification, and on MS-COCO [30] for object detection and instance segmentation. We also give some visualization analyses to better interpret our method. All baselines are re-implemented by ourselves with PyTorch for fair comparisons.

Conclusion

In this paper, we revisited the squeeze operation in SE blocks and proposed a simple but effective two-stage spatial pooling process consisting of rich descriptor extraction and information fusion. In our method, the rich descriptor extraction step produced a set of deep descriptors containing more informative (and local) cues than global average-pooling. While, absorbing these cues by information fusion can aid the excitation operation to return more accurate re-weight scores in a data driven

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the Fundamental Research Funds for the Central Universities (No. 30920041111) and CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2020-022A).

Xin Jin received the BS, MS, and PhD degrees from the Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2009, 2012, and 2017, respectively. He is currently a researcher with Megvii Research Nanjing. His research interests include computer vision and deep learning, especially focusing on face landmark detection and general object detection.

References (54)

  • S. Ren et al.

    Faster R-CNN: towards real-time object detection with region proposal networks

    Proc. Advances in Neural Inf. Process. Syst.

    (2015)
  • J. Long et al.

    Fully convolutional networks for semantic segmentation

    Proc. IEEE Conf. Comp. Vis. Patt. Recogn.

    (2015)
  • X. Jin et al.

    Pornographic image recognition via weighted multiple instance learning

    IEEE Trans. Cybern.

    (2018)
  • O. Russakovsky et al.

    ImageNet large scale visual recognition challenge

    Int. J. Comput. Vision

    (2015)
  • B. Du et al.

    Stacked convolutional denoising auto-encoders for feature representation

    IEEE Trans. Cybern.

    (2016)
  • D.P. Kingma et al.

    Adam: a method for stochastic optimization

    Proc. Int. Conf. Learn. Representations

    (2014)
  • Y. Sun et al.

    A particle swarm optimization-based flexible convolutional autoencoder for image classification

    IEEE Trans. Neural Netw. Learn. Syst.

    (2018)
  • N. Srivastava et al.

    Dropout: a simple way to prevent neural networks from overfitting

    J. Mach. Learn. Res.

    (2014)
  • K. He et al.

    Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification

    Proc. IEEE Int. Conf. Comp. Vis.

    (2015)
  • S. Ioffe et al.

    Batch normalization: accelerating deep network training by reducing internal covariate shift

    Proc. Int. Conf. Mach. Learn.

    (2015)
  • J. Cheng et al.

    Quantized CNN: a unified approach to accelerate and compress convolutional networks

    IEEE Trans. Neural Netw. Learn. Syst.

    (2017)
  • R.J. Cintra et al.

    Low-complexity approximate convolutional neural networks

    IEEE Trans. Neural Netw. Learn. Syst.

    (2018)
  • A. Aimar et al.

    NullHop: a flexible convolutional neural network accelerator based on sparse representations of feature maps

    IEEE Trans. Neural Netw. Learn. Syst.

    (2018)
  • J. Hu et al.

    Squeeze-and-Excitation networks

    Proc. IEEE Conf. Comp. Vis. Patt. Recogn.

    (2018)
  • F. Wang et al.

    Residual attention network for image classification

    Proc. IEEE Conf. Comp. Vis. Patt. Recogn.

    (2017)
  • S. Woo et al.

    CBAM: convolutional block attention module

    Proc. Eur. Conf. Comp. Vis.

    (2018)
  • J. Hu et al.

    Gather-Excite: exploiting feature context in convolutional neural networks

    Proc. Advances in Neural Inf. Process. Syst.

    (2018)
  • Cited by (44)

    View all citing articles on Scopus

    Xin Jin received the BS, MS, and PhD degrees from the Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2009, 2012, and 2017, respectively. He is currently a researcher with Megvii Research Nanjing. His research interests include computer vision and deep learning, especially focusing on face landmark detection and general object detection.

    Yanping Xie received the BS degree from the Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2017, and is currently working toward the MS degree. His research interests include computer vision and deep learning.

    Xiu-Shen Wei received his Ph.D. degree in computer science and technology from Nanjing University. He is a Professor at Nanjing University of Science and Technology (NJUST). Before joining NJUST, he served as the Founding Director of Megvii Research Nanjing, Megvii Technology. He has published more than thirty academic papers on the top-tier international journals and conferences, such as IEEE TPAMI, IEEE TIP, IEEE TNNLS, IEEE TKDE, Machine Learning, CVPR, ICCV, ECCV, IJCAI, ICDM, ACCV, etc. He won four world championships in international authoritative computer vision competitions, including iWildCam (in association with CVPR 2020), iNaturalist (in association with CVPR 2019), Apparent Personality Analysis (in association with ECCV 2016), etc. He also received the Presidential Special Scholarship (the highest honor for Ph.D. students) in Nanjing University, and received the Outstanding Reviewer Award in CVPR 2017. His research interests are computer vision and machine learning. He has served as a PC member of CVPR, ICCV, ECCV, NeurIPS, IJCAI, AAAI, etc. He is a member of the IEEE.

    Bo-Rui Zhao received his BS and MS degrees in 2016 and 2019, at the Department of Electronic Science and Engineering of Nanjing University, China. He is currently a researcher with Megvii Research Nanjing. His research interests include computer vision, deep learning, and general object detection.

    Zhao-Min Chen received the B.S. degree from Hunan University and is now a Ph.D. candidate of computer science and technology from Nanjing University. He has published several academic papers on international conferences, such as ICME, CVPR, etc. His research interests are deep learning, computer vision and multi-label image recognition.

    Xiaoyang Tan received the BS and MS degrees in computer applications from the Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, China, in 1993 and 1996, respectively, and the PhD degree from the Department of Computer Science and Technology, Nanjing University, Nanjing, in 2005. In 1996, he was an Assistant Lecturer with NUAA. From 2006 to 2007, he was a Post-Doctoral Researcher with the Learning and Recognition in Vision team, INRIA Rhone-Alpes, Grenoble, France. His current research interests include face recognition, machine learning, pattern recognition, and computer vision.

    The first two authors contributed equally.

    View full text