Parsing very high resolution urban scene images by learning deep ConvNets with edge-aware loss
Introduction
Semantic segmentation of urban scene images aims to locate objects at the pixel-level and assign them with categorical labels, which supports a wide range of urban applications, such as urban mapping and 3D modeling, autonomous driving, urban land cover classification and change detection (Zhu et al., 2017, Marcos et al., 2018, Zhao et al., 2018a). However, as a dense pixel-wise classification task, semantic image segmentation faces big challenges in urban areas, due to the volume of detailed information contained in very high resolution (VHR) images and the large variations in the scale and appearance of objects. Large numbers of image details hamper the extraction of features relevant to global structure and semantic information of urban objects. Meanwhile, objects with large-scale variation frequently found in an image, such as large buildings and small cars, create difficulties when balancing the segmentation quality of images containing diverse kinds of objects varying in size. Moreover, the existence of many confusing categories, like trees and meadows, or similar objects with diverse appearances like cars, makes it hard to realize intra-class unification and inter-class discrimination simultaneously, when parsing urban scenes.
Extensive investigations have been presented for the challenging urban scene parsing task based on convolutional neural networks (ConvNets) (Yang et al., 2018, Yu et al., 2018a, Zhao et al., 2018b), due to the ability of ConvNets in hierarchical features learning and rich context capturing (Chen et al., 2018). In particular, ConvNets based on fully convolutional neural network (FCN) have become the mainstream approach for urban scene parsing with the success of the first end-to-end FCN for semantic segmentation (Long et al. 2015). However, the powerful ConvNets capability for abstraction in data-driven learning tasks creates two technical hurdles: imbalanced attention to multi-scale objects and loss of detail during encoding. Targeting these two issues, much effort has been devoted to the improvement of semantic segmentation (Liu et al., 2018a, Yu et al., 2018b).
In urban scene semantic segmentation, when there is scale variance in the objects found in an image, a neural network with an inappropriate receptive field size will give unbalanced attention to differently sized objects. A neural network with small view field will pay more attention to small things and divide the larger objects into fragments, while one with a large view field will ignore details and fail to separate small adjacent objects. Common solutions for multi-scale object segmentation focus on receptive field enlargement (Chen et al., 2018b, Zhao et al., 2017). Many methods were developed with image pyramids (Zhao et al., 2018) or extra subnetworks (Yang et al. 2018), but such methods are time-consuming. A more popular way is to deploy a spatial pyramid pooling (SPP) module in the network architecture (Chen et al., 2018b, Yuan and Wang, 2018, He et al., 2019). However, the current SPPs have difficulty in capturing relational information between long-range features while retaining continuous between neighboring features (Wang et al. 2018), due to inappropriate receptive field size design. Thus, when balancing segmentation quality of multi-scale urban objects, large objects still tend to be divided into fragments.
Another inevitable problem in urban scene semantic segmentation with ConvNets is detail degradation caused by downsampling. Detail degradation affects the accurate localization of objects at the pixel level, leading to blurry object boundaries. To tackle this problem, numerous methods have concentrated on enhancing the sensitivity of a model to boundary information. One way is to employ post-processing techniques such as a conditional random field (CRF) (Paisitkriangkrai et al., 2015, Sherrah, 2016, Chen et al., 2018b), which comes with high computational costs. The other relies on applying an extra edge extraction sub network (Cheng et al., 2017, Liu et al., 2018b) or even an individual edge detection model like HED (Xie and Tu, 2015, Marmanis et al., 2018) to merge boundary information during segmentation. However, employing extra edge detectors will increase model complexity and require more training parameters. Moreover, the edge detectors used in these methods only learn edge features with pixel-level cross entropy loss (CE loss) and is independent to the semantic feature learning of an object, which lead to an incomprehensive learning across an entire object.
In this paper, we propose an edge-aware neural network (EaNet) for precise semantic segmentation of urban scenes. For the basic architecture of EaNet, we deploy a balanced encoder and decoder structure with skip pathways. To address the aforementioned two issues in a unified framework, we appended a couple of modules, i.e., large kernel pyramid pooling (LKPP) and Dice-based edge-aware loss function (EA loss) on the top of the encoder and the decoder of EaNet, respectively. The LKPP captures rich context information at multiple scales and builds strong continuous relations between long-range and neighboring features, by constructing several branches with different densely extending receptive field sizes. It effectively strengthens semantic unification inside objects to prevent them from being segmented into fragments. Moreover, the EA loss optimizes segmentation predictions via a standard cross entropy loss, and learns edge information directly from the segmentation prediction map using Dice-based edge loss. In this way, the EA loss module can work at both pixel- and image-level with no extra training parameters, which is more efficient and effective than many existing solutions for object boundary learning. By integrating the LKPP and the EA loss in a single one-stream EaNet model, the two modules can directly communicate through the forward and backward propagation, which enables a more comprehensive learning of semantic objects than many existing methods.
The EaNet is standalone and elegant, and has high generalization ability even with very large-scale urban scene data. Moreover, the two proposed modules, LKPP and EA loss, can be easily applied to other FCN frameworks. The main contributions of this work are follows:
- •
We propose a simple yet effective edge-aware neural network (EaNet) for a comprehensive learning of semantic objects.
- •
A LKPP module is proposed to densely capture multi-scale rich context with strongly continuous feature relations, and thus robustly segments multi-scale urban objects with high intra-class consistency.
- •
The EA loss module refines object boundaries directly from segmentation prediction at both pixel- and image-level, which significantly improves discrimination of confusing urban objects. The module provides a new loss function for simultaneous semantic category and edge structure learning, which is superior to existing combined solutions.
- •
We validate the proposed EaNet on three datasets with very different characteristics, i.e., Cityscapes, ISPRS Vaihingen 2D and WHU Aerial Buildings. We show that EaNet can be highly generalized to multi-scale ground/aerial urban scene data, achieving competitive performance.
The rest of this paper is organized as follows. Related work is reviewed in Section 2. The architecture of EaNet and its components are detailed in Section3. The performance of the two general modules and the complete EaNet is evaluated in Section 4. Some conclusions are drawn in Section 5.
Section snippets
Related work
Extensive works have been presented on urban scene semantic segmentation employing ConvNets, both in the field of computer vision and remote sensing (Zhu et al., 2017, Chen et al., 2018b). In this section, we briefly review the works most relevant to the two technical hurdles in urban scene parsing, i.e., imbalanced attention to multi-scale objects and loss of boundary detail during encoding.
Architecture of the proposed EaNet
In this section, we discuss the architecture of the proposed EaNet and its two major components in detail, starting with an overview of the EaNet workflow in general.
Experiments
We conducted experiments on three datasets, including a large-scale ground dataset, i.e., Cityscapes (Cordts et al. 2016), and two relatively small-scale aerial datasets, i.e., ISPRS Vaihingen 2D (Gerke 2014) and the WHU Aerial Building Dataset (Ji et al. 2018), in order to comprehensively test the learning capacity and generalizability of the proposed EaNet model. Ablation studies were conducted for the two general modules, i.e. LKPP and EA loss individually, to verify their efficacy when
Conclusion
In this paper, we propose an edge-aware neural network (EaNet) with large kernel pyramid pooling for robust semantic segmentation in urban areas. Extensive ablation experiments show that the proposed EaNet can adapt to both ground and aerial urban scene images, and achieved excellent performance consistently on three benchmark datasets, i.e., Cityscapes, ISPRS Vaihingen, and the WHU Aerial Building datasets. Qualitative and quantitative analysis results verify that the two introduced modules,
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research was funded by the National Key Research and Development Program of China under Grant 2018Y- FB0505401, the National Natural Science Foundation of China Project under Grant 41701445, 41871361, 42071370 and the Fundamental Research Funds for the Central Universities.
References (58)
- et al.
Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks
(2018) - et al.
Semantic labeling in very high resolution images via a self-cascaded convolutional neural network
ISPRS J. Photogramm. Remote Sens.
(2018) - et al.
Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models
ISPRS J. Photogramm. Remote Sens.
(2018) - et al.
Classification with an edge: improving semantic image segmentation with boundary detection
ISPRS J. Photogrammetry Remote Sens.
(2018) - et al.
Problems of encoder-decoder frameworks for high-resolution remote sensing image segmentation: Structural stereotype and insufficient learning
Neurocomputing
(2019) - et al.
Gated convolutional neural network for semantic segmentation in high-resolution images
Remote Sensing
(2017) - et al.
Semantic segmentation of earth observation data using multimodal and multi-scale deep networks
(2016) - et al.
Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs
IEEE Trans. Pattern Anal. Mach. Intell.
(2018) - et al.
Encoder-decoder with atrous separable convolution for semantic image segmentation
Proceedings of the European Conference on Computer Vision (ECCV)
(2018) - Chen L., G. Papandreou, F. Schroff and H. Adam, 2017. Rethinking Atrous Convolution for Semantic Image...
Attention to scale: scale-aware semantic image segmentation
IEEE Conference Computer Vision Pattern Recognition
FusionNet: Edge aware deep convolutional networks for semantic segmentation of remote sensing harbor images
IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens.
The cityscapes dataset for semantic urban scene understanding
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Boundary-aware feature propagation for scene segmentation
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
ACNet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks
Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Use of the stair vision library within the ISPRS 2D semantic labeling benchmark (Vaihingen)
Technical Report
Learning and adapting robust features for satellite image segmentation on heterogeneous data sets
IEEE Trans. Geosci. Remote Sens.
Adaptive pyramid context network for semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Mask r-cnn Proceedings of the IEEE International Conference on Computer Vision (ICCV)
Deep residual learning for image recognition
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set
IEEE Trans. Geosci. Remote Sens.
Incorporating depth into both CNN and CRF for indoor semantic segmentation
2017 8th IEEE International Conference on Software Engineering and Service Science (ICSESS)
EU-Net: an efficient fully convolutional network for building extraction from optical remote sensing images
Remote Sensing
Efficient inference in fully connected CRFs with gaussian edge potentials
Advances in Neural Information Processing Systems
Refinenet: Multi-path refinement networks for high-resolution semantic segmentation
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
DE-net: deep encoding network for building extraction from high-resolution remote sensing imagery
Remote Sensing
Dense dilated convolutions' merging network for land cover classification
IEEE Trans. Geosci. Remote Sens.
ERN: edge loss reinforced semantic segmentation network for remote sensing images
Remote Sensing
Semantic image segmentation via deep parsing network
IEEE International Conference on Computer Vision (ICCV)
Cited by (93)
Accurate contour preservation for semantic segmentation by mitigating the impact of pseudo-boundaries
2024, International Journal of Applied Earth Observation and GeoinformationBlurry dense object extraction based on buffer parsing network for high-resolution satellite remote sensing imagery
2024, ISPRS Journal of Photogrammetry and Remote SensingOSLPNet: A neural network model for street lamp post extraction from street view imagery
2023, Expert Systems with ApplicationsP-Swin: Parallel Swin transformer multi-scale semantic segmentation network for land cover classification
2023, Computers and GeosciencesAligning semantic distribution in fusing optical and SAR images for land use classification
2023, ISPRS Journal of Photogrammetry and Remote SensingCross-sensor remote sensing imagery super-resolution via an edge-guided attention-based network
2023, ISPRS Journal of Photogrammetry and Remote Sensing