Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss

https://doi.org/10.1016/j.isprsjprs.2020.09.019Get rights and content

Abstract

Parsing very high resolution (VHR) urban scene images into regions with semantic meaning, e.g. buildings and cars, is a fundamental task in urban scene understanding. However, due to the huge quantity of details contained in an image and the large variations of objects in scale and appearance, the existing semantic segmentation methods often break one object into pieces, or confuse adjacent objects and thus fail to depict these objects consistently. To address these issues uniformly, we propose a standalone end-to-end edge-aware neural network (EaNet) for urban scene semantic segmentation. For semantic consistency preservation inside objects, the EaNet model incorporates a large kernel pyramid pooling (LKPP) module to capture rich multi-scale context with strong continuous feature relations. To effectively separate confusing objects with sharp contours, a Dice-based edge-aware loss function (EA loss) is devised to guide the EaNet to refine both the pixel- and image-level edge information directly from semantic segmentation prediction. In the proposed EaNet model, the LKPP and the EA loss couple to enable comprehensive feature learning across an entire semantic object. Extensive experiments on three challenging datasets demonstrate that our method can be readily generalized to multi-scale ground/aerial urban scene images, achieving 81.7% in mIoU on Cityscapes Test set and 90.8% in the mean F1-score on the ISPRS Vaihingen 2D Test set. Code is available at: https://github.com/geovsion/EaNet.

Introduction

Semantic segmentation of urban scene images aims to locate objects at the pixel-level and assign them with categorical labels, which supports a wide range of urban applications, such as urban mapping and 3D modeling, autonomous driving, urban land cover classification and change detection (Zhu et al., 2017, Marcos et al., 2018, Zhao et al., 2018a). However, as a dense pixel-wise classification task, semantic image segmentation faces big challenges in urban areas, due to the volume of detailed information contained in very high resolution (VHR) images and the large variations in the scale and appearance of objects. Large numbers of image details hamper the extraction of features relevant to global structure and semantic information of urban objects. Meanwhile, objects with large-scale variation frequently found in an image, such as large buildings and small cars, create difficulties when balancing the segmentation quality of images containing diverse kinds of objects varying in size. Moreover, the existence of many confusing categories, like trees and meadows, or similar objects with diverse appearances like cars, makes it hard to realize intra-class unification and inter-class discrimination simultaneously, when parsing urban scenes.

Extensive investigations have been presented for the challenging urban scene parsing task based on convolutional neural networks (ConvNets) (Yang et al., 2018, Yu et al., 2018a, Zhao et al., 2018b), due to the ability of ConvNets in hierarchical features learning and rich context capturing (Chen et al., 2018). In particular, ConvNets based on fully convolutional neural network (FCN) have become the mainstream approach for urban scene parsing with the success of the first end-to-end FCN for semantic segmentation (Long et al. 2015). However, the powerful ConvNets capability for abstraction in data-driven learning tasks creates two technical hurdles: imbalanced attention to multi-scale objects and loss of detail during encoding. Targeting these two issues, much effort has been devoted to the improvement of semantic segmentation (Liu et al., 2018a, Yu et al., 2018b).

In urban scene semantic segmentation, when there is scale variance in the objects found in an image, a neural network with an inappropriate receptive field size will give unbalanced attention to differently sized objects. A neural network with small view field will pay more attention to small things and divide the larger objects into fragments, while one with a large view field will ignore details and fail to separate small adjacent objects. Common solutions for multi-scale object segmentation focus on receptive field enlargement (Chen et al., 2018b, Zhao et al., 2017). Many methods were developed with image pyramids (Zhao et al., 2018) or extra subnetworks (Yang et al. 2018), but such methods are time-consuming. A more popular way is to deploy a spatial pyramid pooling (SPP) module in the network architecture (Chen et al., 2018b, Yuan and Wang, 2018, He et al., 2019). However, the current SPPs have difficulty in capturing relational information between long-range features while retaining continuous between neighboring features (Wang et al. 2018), due to inappropriate receptive field size design. Thus, when balancing segmentation quality of multi-scale urban objects, large objects still tend to be divided into fragments.

Another inevitable problem in urban scene semantic segmentation with ConvNets is detail degradation caused by downsampling. Detail degradation affects the accurate localization of objects at the pixel level, leading to blurry object boundaries. To tackle this problem, numerous methods have concentrated on enhancing the sensitivity of a model to boundary information. One way is to employ post-processing techniques such as a conditional random field (CRF) (Paisitkriangkrai et al., 2015, Sherrah, 2016, Chen et al., 2018b), which comes with high computational costs. The other relies on applying an extra edge extraction sub network (Cheng et al., 2017, Liu et al., 2018b) or even an individual edge detection model like HED (Xie and Tu, 2015, Marmanis et al., 2018) to merge boundary information during segmentation. However, employing extra edge detectors will increase model complexity and require more training parameters. Moreover, the edge detectors used in these methods only learn edge features with pixel-level cross entropy loss (CE loss) and is independent to the semantic feature learning of an object, which lead to an incomprehensive learning across an entire object.

In this paper, we propose an edge-aware neural network (EaNet) for precise semantic segmentation of urban scenes. For the basic architecture of EaNet, we deploy a balanced encoder and decoder structure with skip pathways. To address the aforementioned two issues in a unified framework, we appended a couple of modules, i.e., large kernel pyramid pooling (LKPP) and Dice-based edge-aware loss function (EA loss) on the top of the encoder and the decoder of EaNet, respectively. The LKPP captures rich context information at multiple scales and builds strong continuous relations between long-range and neighboring features, by constructing several branches with different densely extending receptive field sizes. It effectively strengthens semantic unification inside objects to prevent them from being segmented into fragments. Moreover, the EA loss optimizes segmentation predictions via a standard cross entropy loss, and learns edge information directly from the segmentation prediction map using Dice-based edge loss. In this way, the EA loss module can work at both pixel- and image-level with no extra training parameters, which is more efficient and effective than many existing solutions for object boundary learning. By integrating the LKPP and the EA loss in a single one-stream EaNet model, the two modules can directly communicate through the forward and backward propagation, which enables a more comprehensive learning of semantic objects than many existing methods.

The EaNet is standalone and elegant, and has high generalization ability even with very large-scale urban scene data. Moreover, the two proposed modules, LKPP and EA loss, can be easily applied to other FCN frameworks. The main contributions of this work are follows:

  • We propose a simple yet effective edge-aware neural network (EaNet) for a comprehensive learning of semantic objects.

  • A LKPP module is proposed to densely capture multi-scale rich context with strongly continuous feature relations, and thus robustly segments multi-scale urban objects with high intra-class consistency.

  • The EA loss module refines object boundaries directly from segmentation prediction at both pixel- and image-level, which significantly improves discrimination of confusing urban objects. The module provides a new loss function for simultaneous semantic category and edge structure learning, which is superior to existing combined solutions.

  • We validate the proposed EaNet on three datasets with very different characteristics, i.e., Cityscapes, ISPRS Vaihingen 2D and WHU Aerial Buildings. We show that EaNet can be highly generalized to multi-scale ground/aerial urban scene data, achieving competitive performance.

The rest of this paper is organized as follows. Related work is reviewed in Section 2. The architecture of EaNet and its components are detailed in Section3. The performance of the two general modules and the complete EaNet is evaluated in Section 4. Some conclusions are drawn in Section 5.

Section snippets

Related work

Extensive works have been presented on urban scene semantic segmentation employing ConvNets, both in the field of computer vision and remote sensing (Zhu et al., 2017, Chen et al., 2018b). In this section, we briefly review the works most relevant to the two technical hurdles in urban scene parsing, i.e., imbalanced attention to multi-scale objects and loss of boundary detail during encoding.

Architecture of the proposed EaNet

In this section, we discuss the architecture of the proposed EaNet and its two major components in detail, starting with an overview of the EaNet workflow in general.

Experiments

We conducted experiments on three datasets, including a large-scale ground dataset, i.e., Cityscapes (Cordts et al. 2016), and two relatively small-scale aerial datasets, i.e., ISPRS Vaihingen 2D (Gerke 2014) and the WHU Aerial Building Dataset (Ji et al. 2018), in order to comprehensively test the learning capacity and generalizability of the proposed EaNet model. Ablation studies were conducted for the two general modules, i.e. LKPP and EA loss individually, to verify their efficacy when

Conclusion

In this paper, we propose an edge-aware neural network (EaNet) with large kernel pyramid pooling for robust semantic segmentation in urban areas. Extensive ablation experiments show that the proposed EaNet can adapt to both ground and aerial urban scene images, and achieved excellent performance consistently on three benchmark datasets, i.e., Cityscapes, ISPRS Vaihingen, and the WHU Aerial Building datasets. Qualitative and quantitative analysis results verify that the two introduced modules,

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was funded by the National Key Research and Development Program of China under Grant 2018Y- FB0505401, the National Natural Science Foundation of China Project under Grant 41701445, 41871361, 42071370 and the Fundamental Research Funds for the Central Universities.

References (58)

  • L. Chen et al.

    Attention to scale: scale-aware semantic image segmentation

    IEEE Conference Computer Vision Pattern Recognition

    (2016)
  • D. Cheng et al.

    FusionNet: Edge aware deep convolutional networks for semantic segmentation of remote sensing harbor images

    IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.

    (2017)
  • M. Cordts et al.

    The cityscapes dataset for semantic urban scene understanding

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • H. Ding et al.

    Boundary-aware feature propagation for scene segmentation

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2019)
  • X. Ding et al.

    ACNet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2019)
  • M. Gerke

    Use of the stair vision library within the ISPRS 2D semantic labeling benchmark (Vaihingen)

    Technical Report

    (2014)
  • S. Ghassemi et al.

    Learning and adapting robust features for satellite image segmentation on heterogeneous data sets

    IEEE Trans. Geosci. Remote Sens.

    (2019)
  • J. He et al.

    Adaptive pyramid context network for semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2019)
  • K. He et al.

    Mask r-cnn Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    (2017)
  • K. He et al.

    Deep residual learning for image recognition

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2016)
  • S. Ji et al.

    Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set

    IEEE Trans. Geosci. Remote Sens.

    (2018)
  • J. Jiang et al.

    Incorporating depth into both CNN and CRF for indoor semantic segmentation

    2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS)

    (2017)
  • W. Kang et al.

    EU-Net: an efficient fully convolutional network for building extraction from optical remote sensing images

    Remote Sensing

    (2019)
  • P. Krähenbühl et al.

    Efficient inference in fully connected CRFs with gaussian edge potentials

    Advances in Neural Information Processing Systems

    (2011)
  • G. Lin et al.

    Refinenet: Multi-path refinement networks for high-resolution semantic segmentation

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    (2017)
  • H. Liu et al.

    DE-net: deep encoding network for building extraction from high-resolution remote sensing imagery

    Remote Sensing

    (2019)
  • Q. Liu et al.

    Dense dilated convolutions' merging network for land cover classification

    IEEE Trans. Geosci. Remote Sens.

    (2020)
  • S. Liu et al.

    ERN: edge loss reinforced semantic segmentation network for remote sensing images

    Remote Sensing

    (2018)
  • Z. Liu et al.

    Semantic image segmentation via deep parsing network

    IEEE International Conference on Computer Vision (ICCV)

    (2016)
  • Cited by (93)

    • Accurate contour preservation for semantic segmentation by mitigating the impact of pseudo-boundaries

      2024, International Journal of Applied Earth Observation and Geoinformation
    View all citing articles on Scopus
    View full text