Pooling Attention-based Encoder–Decoder Network for semantic segmentation

https://doi.org/10.1016/j.compeleceng.2021.107260Get rights and content

Highlights

  • The encoder and decoder work together to enhance the consistency of pixels.

  • The work uses maximum, average and stochastic pooling to gain contoured, detailed and generalized information.

  • Two attention modules use pooling operations to integrate discriminative feature information.

Abstract

Aiming to the challenge of poor pixel-consistency in inter-category and pixel-similarity in inter-category, in this paper, we propose an Encoder–Decoder network for image semantic segmentation using pooling SE-ResNet attention module, called PAEDN. It is an effective of attention mechanism to get aggregated information. According to the principle of SE-ResNet, a collection of Average, Maximum and Stochastic global pooling, which concentrate on contoured, detailed, and generalized information in a certain semantic segmentation, form attention modules. Channel Pooling Attention Module (CPAM) and Position Pooling Attention Module (PPAM) are designed and integrated into the Encoder to extract discriminative features from input images, and the Decoder is developed through SE-ResNet attention module to fuse the feature map in high-resolution with that in low-resolution. Experimental evaluations performed on the data sets PASCAL and Cityscapes, show the proposed Encoder–Decoder with pooling attention module produces good pixel-consistency semantic label, achieves 15.1% improvement to FCN.

Introduction

The task of semantic segmentation is to assign consistent labels to pixels with similar semantic attributes. The task has important application value in medical image recognition, automatic driving and intelligent safeguard, etc.

In the early period, conventional methods, such as watershed algorithm and threshold optimization, were used to segment the image, and the geometric shape and texture of the region were classified and labeled. Then, in image semantic segmentation for license plate recognition and video image segmentation, probabilistic models [1], [2], [3] and machine learning methods were developed.

Long et al. first proposed FCN [4] in 2014, which adopted image classification network into the dense prediction network and classifying the learned representatives at the pixel level. Then a series of robust networks appeared, such as Zoom-Out [5], Piecewise [6], LRR [7] and so on. Driven by deep neural networks [8], [9], [10], [11], [12], semantic segmentation achieves great progress with convolution layer. Some networks, such as DeepLab V2 [13], aggregate multi-scale context information by studying and combining convolution filters with different dilated rates and pooling operations to generate feature maps. Owing to the impact of transpose convolution operator and skip-layer structure, the FCN prediction result is not consistent with the input image size, but also ensures accuracy and robustness.

FCN serves as the baseline of modern image semantic segmentation tasks. FCN achieves pixel-wise image classification. However, pixel-by-pixel prediction is made for the weak consistency on inter-category. Enhancing the connection between pixels helps improve the performance of FCN. Therefore, in order to perform pixel-level recognition, improving the discriminative ability of feature representation is an indispensable method. The practical and straightforward method is to use neighboring pixels or spatially related information in the convolution process.

The SE Block proposed by SE-ResNet [14] is a lightweight gating mechanism that is specifically used to model channel relationships in a computationally efficient manner, aiming to enhance the representation ability of basic modules in the entire network. In addition, the encoder–decoder network [15] is proposed to fuse low-resolution and high-resolution semantic features. In this work, we aim to design semantic segmentation that achieves a good pixel-consistency, semantic contour and object boundaries. Motivated by SE-ResNet and encoder–decoder, we propose an Encoder–Decoder network for image semantic segmentation using pooling attention module.

In this work, we aim to design semantic segmentation that achieves a good pixel-consistency, semantic contour and object boundaries. Motivated by SE-ResNet, we propose an Encoder–Decoder network for semantic segmentation using pooling attention module. We use Channel Attention Module (CPAM) and Position Attention Module (PPAM) in Encoder to improve the accuracy of segmentation. CPAM captures the channel dependence of any two pixels in the feature map, and PPAM captures the spatial dependence. The decoder we proposed generates features with high resolution and strong semantic information by fusing high-resolution features and low-resolution features.

Our main contributions are summarized as follows.

(1) We propose an encoder–decoder network with attention modules. The Encoder–Decoder serves as a main framework of semantic segmentation. Attention modules enhance the feature discrimination of the encoder, and improve the prediction accuracy of the decoder.

(2) We design pooling attention modules based on SE-ResNet. Pooling is utilized to aggregate the average, maximum, stochastic information along channels, space positions and different resolutions. A collection of pooling modules are implemented through SE-ResNet architecture.

The Encoder part and Decoder part work together to aggrandize the in-category consistency of pixels. Pooling Attention-based Encoder–Decoder Network (PAEDN) shows an excellent performance with the accuracy of 77.30% mIoU on PASCAL VOC 2012.

Section snippets

Related work

Long et al. first proposed FCN, semantic segmentation has been dramatically developed. Firstly, it is a hot topic to change the encoder–decoder structure to produce more accurate prediction results. This method is proposed to fuse low-resolution and high-resolution semantic features. Secondly, more and more semantic segmentation networks begin to add attention mechanisms to capture contextual information. These methods use contextual information as much as possible by changing the structure of

The proposed method

In this part, we give the overall architecture of our Pooling Attention Network, and introduce Encoder and Decoder respectively.

Experimental results

The experiments are conducted with PyTorch 1.1.0 in Ubuntu 18.04 system, running on 2 GTX-1080Ti GPUs with 11 GB memory.

We introduce the datasets, parameter settings, and conduct a series of ablation and comprehensive experiments on PASCAL VOC 2012 to evaluate the proposed method. Finally, we report our results on PASCAL and Cityscapes. The training processes are carried out on the training sets of PASCAL and Cityscapes. The experimental results (Mean IoU and Pixel Acc.) and the visualization

Conclusions

We perform image semantic segmentation based on the correlation between pixels, rather than classifying single pixel one by one. Aiming to this aspect, we propose an attention module based on pooling operation. With the advantage of small computation complexity, this module can effectively obtain long-range contextual information of images, which can make up for the disadvantages of large computation complexity and large GPU memory consumption of Self-Attention mechanism. Therefore, we propose

CRediT authorship contribution statement

Haixia Xu: Conceptualization, Methodology, Software, Writing - original draft. Yunjia Huang: Conceptualization, Methodology, Software, Writing - original draft. Edwin R. Hancock: Supervision. Shuailong Wang: Visualization, Writing - review & editing. Qijun Xuan: Validation, Investigation. Wei Zhou: Supervision, Writing - review & editing.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2021.107260.

Acknowledgments

This work was supported by the joint fund for regional innovation and development of NSFC (U19A2083), and by the Science and Technology Plan Project of Hunan Province, China (2016TP1020), open fund project of Hunan Provincial Key Laboratory of Intelligent Information Processing and Application for Hengyang normal university, China (IIPA20K04).

Haixia Xu received the Ph.D. at Hunan University. She is currently an assistant professor at Xiangtan University. Her research interests are focused on a computer vision, deep learning.

References (24)

  • Zheng Shuai, Jayasumana Sadeep, Romera-Paredes Bernardino, Vineet Vibhav, Su Zhizhong, Du Dalong, Huang Chang, Torr...
  • Vemulapalli Raviteja, Tuzel Oncel, Liu Ming-Yu, Chellapa Rama. Gaussian conditional random field network for semantic...
  • Kreso Ivan, Causevic Denis, Krapac Josip, Segvic Sinisa. Convolution scale invariance for semantic segmentation. In:...
  • Long Jonathan, Shelhamer Evan, Darrell Trevor. Fully convolutional networks for semantic segmentation. In: IEEE...
  • Mostajabi Mohammadreza, Yadollahpour Payman, Shakhnarovich Gregory. Feedforward semantic segmentation with zoom-out...
  • Lin Guosheng, Shen Chunhua, Hengel Anton Van Den, Reld Ian. Efficient piecewise training of deep structured models for...
  • Ghiasi Golnaz, Fowlkes Charless C. Laplacian pyramid reconsturction and refinement for semantic segmentation. In:...
  • Noh Hyeonwoo, Hong Seunghoon, Han Bohyung. Learning deconvolution network for semantic segmentation. In: IEEE...
  • Liu Ziwei, Li Xiaoxiao, Luo Ping, Loy Chen-Change, Tang Xiaohu. Semantic image sementation via deep parsing network....
  • YuFisher et al.

    Multi-scale context aggregation by dilated convolutions

    (2015)
  • Yuan Yuhui, Chen Xilin, Wang Jingdong. Object-contextual Representations for semantic segmentation. In: European...
  • SunJiaxing et al.

    Multi-feature fusion network for road scene semantic segmentation

    Comput Electr Eng

    (2021)
  • Cited by (5)

    • Deformable attention-oriented feature pyramid network for semantic segmentation

      2022, Knowledge-Based Systems
      Citation Excerpt :

      The single-scale and multi-scale performance of DANet are 76.51% / 77.32% respectively, which is 1.23% / 1.38% lower than that of our model. PAEDN [40] designs an encoder–decoder structure, which uses Channel Pooling Attention Module and Position Polling Attention Module to extract features, and the SE-ResNet attention module to fuse features. The multi-scale test result of PAEDN is 77.3%, which is 1.4% lower than our model.

    • Edge-aware and spectral–spatial information aggregation network for multispectral image semantic segmentation

      2022, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      In addition, some improved methods are proposed. Xu et al. (2021) proposed the pooling attention-based encoder–decoder network for semantic segmentation. Attention modules enhance the feature discrimination of the encoder and improve the prediction accuracy of the decoder.

    Haixia Xu received the Ph.D. at Hunan University. She is currently an assistant professor at Xiangtan University. Her research interests are focused on a computer vision, deep learning.

    Yunjia Huang is a graduate student at Xiangtan University. His research area is semantic segmentation.

    Edwin R. Hancock is a professor at the University of York, UK, IEEE Fellow, his research interests include computer vision, machine learning and complex network.

    Shuailong Wang is a graduate student at Xiangtan University. His research area is semantic segmentation.

    Qijun Xuan is a graduate student at Xiangtan University. His research area is semantic segmentation.

    Wei Zhou is an assistant professor at Xiangtan University. His research interests include computer vision, network safety.

    This paper is for regular issues of CAEE. Reviews processed and approved for publication by the co-Editor-in-Chief Huimin Lu.

    View full text