Semantic segmentation using stride spatial pyramid pooling and dual attention decoder
Introduction
As a basic task in computer vision, semantic segmentation has a great usage potential in autonomous vehicles [1], human parsing [2], and medical diagnosis [3]. Different from image classification, the purpose of semantic segmentation task is to assign a semantic label to each pixel rather than the whole image, which increases its complexity. Traditional methods [4], [5], [6] select feature manually and typically exhibit low performance on this task. Owing to the development of deep convolutional neural networks (DCNNs) [7] and the success of fully convolutional networks (FCNs) [8], an increasing number of DCNN-based methods have been applied to semantic segmentation and have exhibited striking performance improvements [9], [10]. Nevertheless, the application of DCNN to semantic segmentation continues to encounter challenges which are caused by the low utilization of the feature maps. In particular, existing methods usually could not fully utilize the semantic information capturing ability of high-level feature map and the spatial information maintaining ability of the low-level feature map. To improve the semantic information capturing ability of high-level feature map, multiscale semantic information extraction is the most important problem that should be considered. In general, a class of objects may have different scales in different images, and a good network should be able to capture this property, which contributes to more accurate semantic information. In the recent past, a number of state-of-the-art methods have been developed to address this problem, in which the Deeplab series [11], [12], [13] based on atrous convolution are perhaps the most popular among them. Considering a two-dimensional signals, for each location i on the output ya and a filter wa, the atrous convolution is applied over the input feature map xa: where the atrous rate r corresponds to the stride with which we sample the input signal, which is equivalent to convolving the input xa with up-sampled filters produced by inserting zeros between two consecutive filter values along each spatial dimension. The atrous convolution allows us to adaptively modify the field-of-view of a filter by changing the rate value. Hence, it can be used to capturing multiscale semantic information via different rates. The atrous spatial pyramid pooling (ASPP) structure [12], as shown in Fig. 1(a), is motivated by this property and has acquired satisfactory performance. Although ASPP is helpful for multiscale semantic information extraction, it has a low utilization of the input feature map where nine tenths of its information are ignored, as proven in [14]. In addition, atrous convolution could result in gridding artifacts and decrease network performance [15], [16].
Different from the high-level feature map, the low-level feature map whose spatial information is accurate can be utilized to address the resolution decreasing problem caused by successive pooling and convolution with stride in DCNN. Commonly, the resolution decreasing problem can be resolved via two means. The first one is the atrous convolution-based method proposed in [11], where the stride can be discarded in the last few blocks of DCNN through the introduction of atrous convolution, as shown in Fig. 1(b). This method can efficiently maintain the resolution and receipt field of networks simultaneously. Many state-of-the-art semantic segmentation methods [14], [17], [18] have adopted this approach in their networks. However, this method does not involve the high-level feature map and consumes too much inference time, which leads to a low-efficient network. The second method is the U-shape structure introduced in FCN [8], as shown in Fig. 1(c). It uses a skip structure to create a U-shape net, fuses low- and high-level feature maps of networks, and increases the output resolution gradually. However, the low-level feature map has limited semantic information, and the high-level one has a low resolution; the gap between them hampers their fusion, leading to only a slight performance improvement, even when both feature maps are combined and many convolution blocks are used to refine the fusion result [19]. How to take full advantage of these feature maps and how to bridge the gap between them are still open and crucial in improving the performance of the U-shape structure. Note that although high-level feature maps have a low resolution, they always contain abundant semantic information. Therefore, it is possible to provide a guidance for producing low-level feature maps with more semantic information, and the performance improvement of the network will significantly increase, as proven in PAnet [20]. Moreover, low-level feature maps have more spatial information than high-level ones, and this can provide spatial guidance for high-level feature maps. Through these characteristics, the gap between the two kinds of feature maps can be reduced, and then both can be fully utilized.
On the basis of the above observation, we propose two novel structures to make full use of the high- and low-level feature maps. We first design a structure called stride spatial pyramid pooling (SSPP) to resolve the shortcoming of the ASPP structure and fully utilize the high-level feature map. Different from the ASPP structure, the SSPP structure captures the multiscale semantic information via the pooling operation. It is able to take full advantage of the input feature maps, and the structure can eliminate gridding artifacts caused by atrous convolution. Moreover, SSPP consumes less inference time compared with ASPP because the input resolution of its convolution layers are smaller than that of ASPP. To address the resolution decreasing problem via the low-level feature map, inspired by the PAnet [20] and the convolutional block attention module (CBAM) [21], we develop a new decoder called dual attention decoder with two branches. Through the first branch, i.e., the channel attention branch, we can obtain a low-level feature map with abundant “semantic” information. The second branch, i.e., the spatial attention branch, can produce a high-level feature map with more accurate spatial information. This decoder bridges the gap between low- and high-level feature maps and makes full use of them.
Our contributions in this work can be summarized as follows:
- •
We propose a new structure called SSPP to acquire multiscale semantic information from the high-level feature map. Compared with existing methods, SSPP can significantly improve the utilization of its input with higher inference speed.
- •
We propose a novel decoder structure to bridge the gap between low- and high-level feature maps in semantic segmentation networks. It is able to take full advantage of these two kinds of feature maps by embedding abundant semantic information into the low-level feature map and embedding accurate spatial information into the high-level feature map, which benefits their fusion a lot.
- •
We test the proposed network on the publicly available PASCAL VOC 2012, Cityscapes and COCO-Stuff semantic image segmentation benchmarks, and attain better performance over several state-of-the-art approaches.
Section snippets
Related work
With the development of DCNN, an increasing number of DCNN-based semantic segmentation networks have been proposed and exhibited good performance in different benchmarks. Among them, many methods attempt to improve the utilization of the high- and low-level feature maps. Here we briefly review the background material applied as reference for the current studies.
Method
This section first introduces the baseline network that we used to encode the semantic information. Then, we propose the SSPP structure which can be used to improve the utilization of the high-level feature map. Subsequently, a dual attention decoder is developed to take full advantage of low- and high-level feature maps. Finally, we provide a full and complete network architecture to combine the SSPP structure and the dual attention decoder.
Experimental results
To verify the effectiveness of the proposed structure, in the following we first evaluate it on the PASCAL VOC 2012 semantic segmentation dataset [33], a well-known benchmark that includes 20 object categories and 1 background class. This dataset is split into training, validation, and test sets, with 1,464, 1,449, and 1,456 images, respectively. The dataset is augmented by the extra annotations provided by [34], resulting in 10,582 training images. We conduct a complete ablation study on our
Conclusion
Within this paper, we propose two novel structures for taking full advantage of the high- and low-level feature maps in a semantic segmentation network. The first one, i.e. SSPP, is able to capture multiscale semantic information from high-level feature map with satisfying inference speed. Compared with similar previous methods based on atrous convolution, our proposed structure utilizes more information from its input and attains better performance. The second one, i.e. dual attention decoder,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This work was supported by the National Natural Science Foundation of China under Grant No. 61773295 and the Natural Science Foundation of Hubei Province under Grant No. 2019CFA037.
Chengli Peng received the B.E. degree from the School of Electrical Engineering, Xinjiang University, Urumchi, China, in 2016, and the M.S. degrees from the School of Engineering, Huazhong Agricultural University, Wuhan, China, in 2018. He is currently a Ph.D. student with the Electronic Information School, Wuhan University. His research interests include computer vision, deep learning.
References (42)
- et al.
Dual-force convolutional neural networks for accurate brain tumor segmentation
Pattern Recognit
(2019) - et al.
Color image segmentation based on multi-level tsallis-havrda-charvát entropy and 2d histogram using pso algorithms
Pattern Recognit
(2019) - et al.
Active contours driven by global and local weighted signed pressure force for image segmentation
Pattern Recognit
(2019) - et al.
A multi-scale level set method based on local features for segmentation of images with intensity inhomogeneity
Pattern Recognit
(2019) - et al.
Deep gated attention networks for large-scale street-level scene segmentation
Pattern Recognit
(2019) - et al.
Semantic segmentation via highly fused convolutional network with multiple soft cost functions
Cogn Syst Res
(2019) - et al.
Importance-aware semantic segmentation for autonomous vehicles
IEEE Trans. Intell. Transp. Syst.
(2019) - et al.
Deep human parsing with active template regression
IEEE Trans Pattern Anal Mach Intell
(2015) - et al.
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2016) - et al.
Fully convolutional networks for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(2015)
Large-scale sparse learning from noisy tags for semantic segmentation
IEEE Trans Cybern
Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE Trans Pattern Anal Mach Intell
Rethinking atrous convolution for semantic image segmentation
arXiv preprint arXiv:1706.05587
Encoder-decoder with atrous separable convolution for semantic image segmentation
Proceedings of the European Conference on Computer Vision
Vortex pooling: improving context representation in semantic segmentation
arXiv Preprint arXiv:1804.06242
Understanding convolution for semantic segmentation
Proceedings of the IEEE Winter Conference on Applications of Computer Vision
Dilated residual networks
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Pyramid scene parsing network
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Denseaspp for semantic segmentation in street scenes
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Exfuse: Enhancing feature fusion for semantic segmentation
Proceedings of the European Conference on Computer Vision
Pyramid attention network for semantic segmentation
Proceedings of the British Machine Vision Conference
Cited by (62)
Multi-branch residual image semantic segmentation combined with inverse weight gated-control
2024, Image and Vision ComputingCS-net: Conv-simpleformer network for agricultural image segmentation
2024, Pattern RecognitionMultiscale probability map guided index pooling with attention-based learning for road and building segmentation
2023, ISPRS Journal of Photogrammetry and Remote Sensing
Chengli Peng received the B.E. degree from the School of Electrical Engineering, Xinjiang University, Urumchi, China, in 2016, and the M.S. degrees from the School of Engineering, Huazhong Agricultural University, Wuhan, China, in 2018. He is currently a Ph.D. student with the Electronic Information School, Wuhan University. His research interests include computer vision, deep learning.
Jiayi Ma received the B.S. degree in Information and Computing Science and the Ph.D. degree in Control Science and Engineering, both from the Huazhong University of Science and Technology, Wuhan, China, in 2008 and 2014, respectively. From 2012 to 2013, he was an Exchange Student with the Department of Statistics, University of California at Los Angeles, Los Angeles, CA, USA. He is currently a Professor with the Electronic Information School, Wuhan University, Wuhan, China. He has authored or co-authored over 130 refereed journal and conference papers, including IEEE TPAMI/TIP, IJCV, CVPR, ICCV, etc. He has been identified in the 2019 Highly Cited Researchers list from the Web of Science Group. He has won the Natural Science Award of Hubei Province (first class) as the first author. He has received the CAAI (Chinese Association for Artificial Intelligence) Excellent Doctoral Dissertation Award (a total of 8 winners in China), and the CAA (Chinese Association of Automation) Excellent Doctoral Dissertation Award (a total of 10 winners in China). He is an Editorial Board Member of Information Fusion and Neurocomputing, and a Guest Editor of Remote Sensing. His current research interests include the areas of computer vision, machine learning, and pattern recognition.