Elsevier

Pattern Recognition

Volume 107, November 2020, 107498
Pattern Recognition

Semantic segmentation using stride spatial pyramid pooling and dual attention decoder

https://doi.org/10.1016/j.patcog.2020.107498Get rights and content

Highlights

  • We propose an SSPP structure to capture multiscale semantic information.

  • Attention mechanism is applied to bridge the information gap in segmentation networks.

  • We propose a new decoder to make full use of the low- and high-level feature maps.

  • Auxiliary loss is applied to make the network easier to train.

  • Our method attains state-of-the-art performance on PASCAL VOC 2012, Cityscapes and COCO-Stuff.

Abstract

Semantic segmentation is an end-to-end task that requires both semantic and spatial accuracy. It is important for deep learning-based segmentation methods to effectively utilize the high-level feature map whose semantic information is abundant and the low-level feature map whose spatial information is accurate. However, existing segmentation networks typically cannot take full advantage of these two kinds of feature maps, leading to inferior performance. This paper attempts to overcome this challenge by introducing two novel structures. On the one hand, we propose a structure called stride spatial pyramid pooling (SSPP) to capture multiscale semantic information from the high-level feature map. Compared with existing pyramid pooling methods based on the atrous convolution, the SSPP structure is able to gather more information from the high-level feature map with faster inference speed, which improves the utilization efficiency of the high-level feature map significantly. On the other hand, we propose a dual attention decoder consisting of a channel attention branch and a spatial attention branch to make full use of the high- and low-level feature maps simultaneously. The dual attention decoder can result in a more “semantic” low-level feature map and a high-level feature map with more accurate spatial information, which bridges the gap between these two kinds of feature maps and benefits their fusion. We evaluate the proposed model on several publicly available semantic image segmentation benchmarks including PASCAL VOC 2012, Cityscapes and COCO-Stuff. The qualitative and quantitative results demonstrate that our method can achieve the state-of-the-art performance.

Introduction

As a basic task in computer vision, semantic segmentation has a great usage potential in autonomous vehicles [1], human parsing [2], and medical diagnosis [3]. Different from image classification, the purpose of semantic segmentation task is to assign a semantic label to each pixel rather than the whole image, which increases its complexity. Traditional methods [4], [5], [6] select feature manually and typically exhibit low performance on this task. Owing to the development of deep convolutional neural networks (DCNNs) [7] and the success of fully convolutional networks (FCNs) [8], an increasing number of DCNN-based methods have been applied to semantic segmentation and have exhibited striking performance improvements [9], [10]. Nevertheless, the application of DCNN to semantic segmentation continues to encounter challenges which are caused by the low utilization of the feature maps. In particular, existing methods usually could not fully utilize the semantic information capturing ability of high-level feature map and the spatial information maintaining ability of the low-level feature map. To improve the semantic information capturing ability of high-level feature map, multiscale semantic information extraction is the most important problem that should be considered. In general, a class of objects may have different scales in different images, and a good network should be able to capture this property, which contributes to more accurate semantic information. In the recent past, a number of state-of-the-art methods have been developed to address this problem, in which the Deeplab series [11], [12], [13] based on atrous convolution are perhaps the most popular among them. Considering a two-dimensional signals, for each location i on the output ya and a filter wa, the atrous convolution is applied over the input feature map xa: ya[i]=kxa[i+r·k]wa[k], where the atrous rate r corresponds to the stride with which we sample the input signal, which is equivalent to convolving the input xa with up-sampled filters produced by inserting r1 zeros between two consecutive filter values along each spatial dimension. The atrous convolution allows us to adaptively modify the field-of-view of a filter by changing the rate value. Hence, it can be used to capturing multiscale semantic information via different rates. The atrous spatial pyramid pooling (ASPP) structure [12], as shown in Fig. 1(a), is motivated by this property and has acquired satisfactory performance. Although ASPP is helpful for multiscale semantic information extraction, it has a low utilization of the input feature map where nine tenths of its information are ignored, as proven in [14]. In addition, atrous convolution could result in gridding artifacts and decrease network performance [15], [16].

Different from the high-level feature map, the low-level feature map whose spatial information is accurate can be utilized to address the resolution decreasing problem caused by successive pooling and convolution with stride in DCNN. Commonly, the resolution decreasing problem can be resolved via two means. The first one is the atrous convolution-based method proposed in [11], where the stride can be discarded in the last few blocks of DCNN through the introduction of atrous convolution, as shown in Fig. 1(b). This method can efficiently maintain the resolution and receipt field of networks simultaneously. Many state-of-the-art semantic segmentation methods [14], [17], [18] have adopted this approach in their networks. However, this method does not involve the high-level feature map and consumes too much inference time, which leads to a low-efficient network. The second method is the U-shape structure introduced in FCN [8], as shown in Fig. 1(c). It uses a skip structure to create a U-shape net, fuses low- and high-level feature maps of networks, and increases the output resolution gradually. However, the low-level feature map has limited semantic information, and the high-level one has a low resolution; the gap between them hampers their fusion, leading to only a slight performance improvement, even when both feature maps are combined and many convolution blocks are used to refine the fusion result [19]. How to take full advantage of these feature maps and how to bridge the gap between them are still open and crucial in improving the performance of the U-shape structure. Note that although high-level feature maps have a low resolution, they always contain abundant semantic information. Therefore, it is possible to provide a guidance for producing low-level feature maps with more semantic information, and the performance improvement of the network will significantly increase, as proven in PAnet [20]. Moreover, low-level feature maps have more spatial information than high-level ones, and this can provide spatial guidance for high-level feature maps. Through these characteristics, the gap between the two kinds of feature maps can be reduced, and then both can be fully utilized.

On the basis of the above observation, we propose two novel structures to make full use of the high- and low-level feature maps. We first design a structure called stride spatial pyramid pooling (SSPP) to resolve the shortcoming of the ASPP structure and fully utilize the high-level feature map. Different from the ASPP structure, the SSPP structure captures the multiscale semantic information via the pooling operation. It is able to take full advantage of the input feature maps, and the structure can eliminate gridding artifacts caused by atrous convolution. Moreover, SSPP consumes less inference time compared with ASPP because the input resolution of its convolution layers are smaller than that of ASPP. To address the resolution decreasing problem via the low-level feature map, inspired by the PAnet [20] and the convolutional block attention module (CBAM) [21], we develop a new decoder called dual attention decoder with two branches. Through the first branch, i.e., the channel attention branch, we can obtain a low-level feature map with abundant “semantic” information. The second branch, i.e., the spatial attention branch, can produce a high-level feature map with more accurate spatial information. This decoder bridges the gap between low- and high-level feature maps and makes full use of them.

Our contributions in this work can be summarized as follows:

  • We propose a new structure called SSPP to acquire multiscale semantic information from the high-level feature map. Compared with existing methods, SSPP can significantly improve the utilization of its input with higher inference speed.

  • We propose a novel decoder structure to bridge the gap between low- and high-level feature maps in semantic segmentation networks. It is able to take full advantage of these two kinds of feature maps by embedding abundant semantic information into the low-level feature map and embedding accurate spatial information into the high-level feature map, which benefits their fusion a lot.

  • We test the proposed network on the publicly available PASCAL VOC 2012, Cityscapes and COCO-Stuff semantic image segmentation benchmarks, and attain better performance over several state-of-the-art approaches.

Section snippets

Related work

With the development of DCNN, an increasing number of DCNN-based semantic segmentation networks have been proposed and exhibited good performance in different benchmarks. Among them, many methods attempt to improve the utilization of the high- and low-level feature maps. Here we briefly review the background material applied as reference for the current studies.

Method

This section first introduces the baseline network that we used to encode the semantic information. Then, we propose the SSPP structure which can be used to improve the utilization of the high-level feature map. Subsequently, a dual attention decoder is developed to take full advantage of low- and high-level feature maps. Finally, we provide a full and complete network architecture to combine the SSPP structure and the dual attention decoder.

Experimental results

To verify the effectiveness of the proposed structure, in the following we first evaluate it on the PASCAL VOC 2012 semantic segmentation dataset [33], a well-known benchmark that includes 20 object categories and 1 background class. This dataset is split into training, validation, and test sets, with 1,464, 1,449, and 1,456 images, respectively. The dataset is augmented by the extra annotations provided by [34], resulting in 10,582 training images. We conduct a complete ablation study on our

Conclusion

Within this paper, we propose two novel structures for taking full advantage of the high- and low-level feature maps in a semantic segmentation network. The first one, i.e. SSPP, is able to capture multiscale semantic information from high-level feature map with satisfying inference speed. Compared with similar previous methods based on atrous convolution, our proposed structure utilizes more information from its input and attains better performance. The second one, i.e. dual attention decoder,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work was supported by the National Natural Science Foundation of China under Grant No. 61773295 and the Natural Science Foundation of Hubei Province under Grant No. 2019CFA037.

Chengli Peng received the B.E. degree from the School of Electrical Engineering, Xinjiang University, Urumchi, China, in 2016, and the M.S. degrees from the School of Engineering, Huazhong Agricultural University, Wuhan, China, in 2018. He is currently a Ph.D. student with the Electronic Information School, Wuhan University. His research interests include computer vision, deep learning.

References (42)

  • A. Li et al.

    Large-scale sparse learning from noisy tags for semantic segmentation

    IEEE Trans Cybern

    (2018)
  • L.-C. Chen et al.

    Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs

    IEEE Trans Pattern Anal Mach Intell

    (2018)
  • L.-C. Chen et al.

    Rethinking atrous convolution for semantic image segmentation

    arXiv preprint arXiv:1706.05587

    (2017)
  • L.-C. Chen et al.

    Encoder-decoder with atrous separable convolution for semantic image segmentation

    Proceedings of the European Conference on Computer Vision

    (2018)
  • C.-W. Xie et al.

    Vortex pooling: improving context representation in semantic segmentation

    arXiv Preprint arXiv:1804.06242

    (2018)
  • P. Wang et al.

    Understanding convolution for semantic segmentation

    Proceedings of the IEEE Winter Conference on Applications of Computer Vision

    (2018)
  • F. Yu et al.

    Dilated residual networks

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • H. Zhao et al.

    Pyramid scene parsing network

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • M. Yang et al.

    Denseaspp for semantic segmentation in street scenes

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2018)
  • Z. Zhang et al.

    Exfuse: Enhancing feature fusion for semantic segmentation

    Proceedings of the European Conference on Computer Vision

    (2018)
  • H. Li et al.

    Pyramid attention network for semantic segmentation

    Proceedings of the British Machine Vision Conference

    (2018)
  • Cited by (62)

    View all citing articles on Scopus

    Chengli Peng received the B.E. degree from the School of Electrical Engineering, Xinjiang University, Urumchi, China, in 2016, and the M.S. degrees from the School of Engineering, Huazhong Agricultural University, Wuhan, China, in 2018. He is currently a Ph.D. student with the Electronic Information School, Wuhan University. His research interests include computer vision, deep learning.

    Jiayi Ma received the B.S. degree in Information and Computing Science and the Ph.D. degree in Control Science and Engineering, both from the Huazhong University of Science and Technology, Wuhan, China, in 2008 and 2014, respectively. From 2012 to 2013, he was an Exchange Student with the Department of Statistics, University of California at Los Angeles, Los Angeles, CA, USA. He is currently a Professor with the Electronic Information School, Wuhan University, Wuhan, China. He has authored or co-authored over 130 refereed journal and conference papers, including IEEE TPAMI/TIP, IJCV, CVPR, ICCV, etc. He has been identified in the 2019 Highly Cited Researchers list from the Web of Science Group. He has won the Natural Science Award of Hubei Province (first class) as the first author. He has received the CAAI (Chinese Association for Artificial Intelligence) Excellent Doctoral Dissertation Award (a total of 8 winners in China), and the CAA (Chinese Association of Automation) Excellent Doctoral Dissertation Award (a total of 10 winners in China). He is an Editorial Board Member of Information Fusion and Neurocomputing, and a Guest Editor of Remote Sensing. His current research interests include the areas of computer vision, machine learning, and pattern recognition.

    View full text