Delving deep into spatial pooling for squeeze-and-excitation networks☆
Introduction
Convolutional neural networks (CNNs) are at the core of state-of-the-art solutions for central vision tasks, such as image classification [1], [2], [3], [4], [5], object detection [6], [7], semantic segmentation [8], as well as some real-life applications [9], [10], [11], [12]. Since the impressive, record-breaking performance on 2012 ImageNet competition [13], CNNs have been widely studied by both academic and industry communities from different aspects, yielding good returns in architecture design [2], [3], [4], [14], optimization [15], [16], regularization [17], initialization [18], normalization [19], acceleration [20], [21], [22]. These research achievements significantly push the performance of CNN algorithms [23], e.g., surpassing human performance in image classification [18].
Apart from above research lines, a research trend emerged recently is to explicitly model the spatial or channel correlations of feature responses to enhance the representational power of deep CNNs [24], [25], [26], [27]. Among them, the “Squeeze-and-Excitation” (SE) networks [24] have shown remarkable improvements over various deep architectures by introducing so-called SE blocks. The SE block is a computational unit that learns to selectively emphasize channel-wise informative features and suppress less useful ones. Specifically, in each SE block, a squeeze operation (i.e., global average-pooling) is first performed to aggregate the global spatial information of input features into a channel feature, and then an excitation module (i.e., multi-layer perceptron) induces instance-specific channel activations from the squeezed descriptor to re-weight each channel.
Despite compelling results, a limitation of SE block, arguably, lies in the squeeze operation that performs global information embedding. Local information obscured by global average-pooling, however, may be crucial to identify the importance of different channels. As shown in Fig. 1, without local information as necessary cues, the excitation module may generate high weights for some noisy channels with improper activations on backgrounds. Directly exploiting additional local information in the excitation module might alleviate this problem, which however will change the structure of the excitation module and introduce considerable computational burdens to the whole network.
Aiming at the aforementioned issues, in this paper, we propose a simple but effective two-stage spatial pooling process: rich descriptor extraction and information fusion. The rich descriptor extraction step aims to obtain a set of diverse deep descriptors that collaboratively express both the global and local information of the inputs. Furthermore, the information fusion step absorbs the rich information delivered by these descriptors and then return a powerful channel feature.
Specifically, we utilize two different strategies for rich descriptor extraction: 1) spatial pyramid pooling (SPP) [28] that enjoys multi-scale representation of the inputs and generates a fixed number of descriptors across all stages, and 2) a novel resolution-guided pooling that can generate stage-aware number of descriptors. This resolution-guided pooling is implemented by using GAP for the last stage (conv), and using this GAP window (i.e., for ResNet) as a fixed window to perform non-overlapping average-pooling for all earlier stages. The second step, i.e., information fusion, is simply implemented by depth-wise fully connection, followed by batch normalization (BN) [19] and ReLU [29]. Figure 2 gives the pipeline of our two-stage spatial pooling method.
To summarize, our main contributions lie in three aspects:
- •
We revisit the squeeze operation in SENets, and shed lights on why and how to embed rich (both global and local) spatial information into the excitation module to improve performance.
- •
We propose an integrated two-stage spatial pooling method with two efficient implementation approaches for rich descriptor extraction. Our method leverages more informative cues that can aid the excitation module to return more accurate channel weights, also with minimal additional computational costs.
- •
We conduct extensive experiments to verify convincing improvements over SENets [24] and their extension [26] on various fundamental computer vision tasks. For example, our method decreases the top-1 error by 0.94% and 1.11% for SE-ResNet-50 and SE-ResNet-101 on ImageNet [13] classification, and increases mmAP by 1.1% for Faster R-CNN with SE-ResNet-50 as the backbone on MS-COCO [30] object detection.
The rest of the paper is organized as follows. Section 2 retrospects the related works. Section 3 details proposed method and Section 4 describes the implementation details. Experiments and analysis are provided in the Section 5, followed by the conclusion part in Section 6.
Section snippets
Related work
In this section, we briefly review two closely related research topics, i.e., attention mechanism and spatial feature pooling.
Approach
We start with a brief recap of the “Squeeze-and-Excitation” building block. Given input to a CNN block, a set of learned convolutional filters are applied on to produce corresponding feature responses , where is the spatial dimension, and is the channel dimension. Then, a squeeze operation and an excitation operation are applied on sequentially to re-weight channel-wise feature responses.
Specifically, the squeeze operation employs global average-pooling (GAP) to aggregate
Implementation details
We follow the practice of [1], [4] for ImageNet classification. Input images are patches with per-pixel mean subtracted, randomly cropped from resized images. Standard data augmentation and random horizontal flipping are performed to prevent model from overfitting. Optimization is performed by SGD with momentum of 0.9 and a mini-batch size of 256 on 8 GPUs, and the weight decay is 0.0001. We start from a learning rate of 0.1, and divide it by 10 every 30 epochs. We adopt the weight
Experiments
To validate the effectiveness of our spatial pooling method, we conduct comprehensive experiments on ImageNet-1K [13] for classification, and on MS-COCO [30] for object detection and instance segmentation. We also give some visualization analyses to better interpret our method. All baselines are re-implemented by ourselves with PyTorch for fair comparisons.
Conclusion
In this paper, we revisited the squeeze operation in SE blocks and proposed a simple but effective two-stage spatial pooling process consisting of rich descriptor extraction and information fusion. In our method, the rich descriptor extraction step produced a set of deep descriptors containing more informative (and local) cues than global average-pooling. While, absorbing these cues by information fusion can aid the excitation operation to return more accurate re-weight scores in a data driven
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the Fundamental Research Funds for the Central Universities (No. 30920041111) and CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2020-022A).
Xin Jin received the BS, MS, and PhD degrees from the Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2009, 2012, and 2017, respectively. He is currently a researcher with Megvii Research Nanjing. His research interests include computer vision and deep learning, especially focusing on face landmark detection and general object detection.
References (54)
- et al.
Towards better exploiting convolutional neural networks for remote sensing scene classification
Pattern Recognit.
(2017) - et al.
Deep and joint learning of longitudinal data for Alzheimer’s disease prediction
Pattern Recognit.
(2020) - et al.
Handling gaussian blur without deconvolution
Pattern Recognit.
(2020) - et al.
Visual and semantic prototypes-jointly guided CNN for generalized zero-shot and open-set recognition
Pattern Recognit.
(2020) - et al.
Recent advances in convolutional neural networks
Pattern Recognit.
(2018) - et al.
ImageNet classification with deep convolutional neural networks
Proc. Advances in Neural Inf. Process. Syst.
(2012) - et al.
Very deep convolutional networks for large-scale image recognition
Proc. Int. Conf. Learn. Representations
(2015) - et al.
Going deeper with convolutions
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
(2015) - et al.
Deep residual learning for image recognition
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
(2016) - et al.
Rich feature hierarchies for accurate object detection and semantic segmentation
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
(2014)
Faster R-CNN: towards real-time object detection with region proposal networks
Proc. Advances in Neural Inf. Process. Syst.
Fully convolutional networks for semantic segmentation
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
Pornographic image recognition via weighted multiple instance learning
IEEE Trans. Cybern.
ImageNet large scale visual recognition challenge
Int. J. Comput. Vision
Stacked convolutional denoising auto-encoders for feature representation
IEEE Trans. Cybern.
Adam: a method for stochastic optimization
Proc. Int. Conf. Learn. Representations
A particle swarm optimization-based flexible convolutional autoencoder for image classification
IEEE Trans. Neural Netw. Learn. Syst.
Dropout: a simple way to prevent neural networks from overfitting
J. Mach. Learn. Res.
Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification
Proc. IEEE Int. Conf. Comp. Vis.
Batch normalization: accelerating deep network training by reducing internal covariate shift
Proc. Int. Conf. Mach. Learn.
Quantized CNN: a unified approach to accelerate and compress convolutional networks
IEEE Trans. Neural Netw. Learn. Syst.
Low-complexity approximate convolutional neural networks
IEEE Trans. Neural Netw. Learn. Syst.
NullHop: a flexible convolutional neural network accelerator based on sparse representations of feature maps
IEEE Trans. Neural Netw. Learn. Syst.
Squeeze-and-Excitation networks
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
Residual attention network for image classification
Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
CBAM: convolutional block attention module
Proc. Eur. Conf. Comp. Vis.
Gather-Excite: exploiting feature context in convolutional neural networks
Proc. Advances in Neural Inf. Process. Syst.
Cited by (44)
From global challenges to local solutions: A review of cross-country collaborations and winning strategies in road damage detection
2024, Advanced Engineering InformaticsA subtle defect recognition method for catenary fastener in high-speed railroad using destruction and reconstruction learning
2024, Advanced Engineering InformaticsEDGCNet: Joint dynamic hyperbolic graph convolution and dual squeeze-and-attention for 3D point cloud segmentation
2024, Expert Systems with ApplicationsTwo-stage fine-grained image classification model based on multi-granularity feature fusion
2024, Pattern RecognitionMFAN: Mixing Feature Attention Network for trajectory prediction
2024, Pattern RecognitionBTSC-TNAS: A neural architecture search-based transformer for brain tumor segmentation and classification
2023, Computerized Medical Imaging and Graphics
Xin Jin received the BS, MS, and PhD degrees from the Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2009, 2012, and 2017, respectively. He is currently a researcher with Megvii Research Nanjing. His research interests include computer vision and deep learning, especially focusing on face landmark detection and general object detection.
Yanping Xie received the BS degree from the Department of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 2017, and is currently working toward the MS degree. His research interests include computer vision and deep learning.
Xiu-Shen Wei received his Ph.D. degree in computer science and technology from Nanjing University. He is a Professor at Nanjing University of Science and Technology (NJUST). Before joining NJUST, he served as the Founding Director of Megvii Research Nanjing, Megvii Technology. He has published more than thirty academic papers on the top-tier international journals and conferences, such as IEEE TPAMI, IEEE TIP, IEEE TNNLS, IEEE TKDE, Machine Learning, CVPR, ICCV, ECCV, IJCAI, ICDM, ACCV, etc. He won four world championships in international authoritative computer vision competitions, including iWildCam (in association with CVPR 2020), iNaturalist (in association with CVPR 2019), Apparent Personality Analysis (in association with ECCV 2016), etc. He also received the Presidential Special Scholarship (the highest honor for Ph.D. students) in Nanjing University, and received the Outstanding Reviewer Award in CVPR 2017. His research interests are computer vision and machine learning. He has served as a PC member of CVPR, ICCV, ECCV, NeurIPS, IJCAI, AAAI, etc. He is a member of the IEEE.
Bo-Rui Zhao received his BS and MS degrees in 2016 and 2019, at the Department of Electronic Science and Engineering of Nanjing University, China. He is currently a researcher with Megvii Research Nanjing. His research interests include computer vision, deep learning, and general object detection.
Zhao-Min Chen received the B.S. degree from Hunan University and is now a Ph.D. candidate of computer science and technology from Nanjing University. He has published several academic papers on international conferences, such as ICME, CVPR, etc. His research interests are deep learning, computer vision and multi-label image recognition.
Xiaoyang Tan received the BS and MS degrees in computer applications from the Nanjing University of Aeronautics and Astronautics (NUAA), Nanjing, China, in 1993 and 1996, respectively, and the PhD degree from the Department of Computer Science and Technology, Nanjing University, Nanjing, in 2005. In 1996, he was an Assistant Lecturer with NUAA. From 2006 to 2007, he was a Post-Doctoral Researcher with the Learning and Recognition in Vision team, INRIA Rhone-Alpes, Grenoble, France. His current research interests include face recognition, machine learning, pattern recognition, and computer vision.
- ☆
The first two authors contributed equally.