当前位置: X-MOL 学术ISPRS J. Photogramm. Remote Sens. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery
ISPRS Journal of Photogrammetry and Remote Sensing ( IF 12.7 ) Pub Date : 2022-06-24 , DOI: 10.1016/j.isprsjprs.2022.06.008
Libo Wang, Rui Li, Ce Zhang, Shenghui Fang, Chenxi Duan, Xiaoliang Meng, Peter M. Atkinson

Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment. Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct an UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global–local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512 × 512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.



中文翻译:

UNetFormer:一种用于遥感城市场景图像高效语义分割的类 UNet 转换器

在土地覆盖制图、城市变化检测、环境保护和经济评估等广泛的实际应用中,需要对遥感城市场景图像进行语义分割。在深度学习技术快速发展的推动下,卷积神经网络 (CNN) 多年来一直主导语义分割。CNN采用分层特征表示,展示了强大的信息提取能力。然而,卷积层的局部属性限制了网络捕获全局上下文。最近,作为计算机视觉领域的热门话题,Transformer 在全局信息建模方面展示了其巨大潜力,推动了许多与视觉相关的任务,例如图像分类、对象检测,尤其是语义分割。在本文中,我们提出了一种基于 Transformer 的解码器,并构建了一个类似 UNet 的 Transformer (UNetFormer),用于实时城市场景分割。为了高效分割,UNetFormer 选择轻量级 ResNet18 作为编码器,并开发了一种高效的全局-局部注意机制来对解码器中的全局和局部信息进行建模。大量实验表明,与最先进的轻量级模型相比,我们的方法不仅运行速度更快,而且精度更高。具体来说,所提出的 UNetFormer 在 UAVId 和 LoveDA 数据集上分别实现了 67.8% 和 52.4% 的 mIoU,而在单个 NVIDIA GTX 3090 GPU 上使用 512 × 512 输入的推理速度可以达到高达 322.4 FPS。在进一步探索中,所提出的基于 Transformer 的解码器与 Swin Transformer 编码器相结合,还在 Vaihingen 数据集上实现了最先进的结果(91.3% F1 和 84.1% mIoU)。源代码将在 https://github.com/WangLibo1995/GeoSeg 上免费提供。

更新日期:2022-06-26
down
wechat
bug