Migratable urban street scene sensing method based on vision language pre-trained model,International Journal of Applied Earth Observation and Geoinformation

当前位置： X-MOL 学术 › Int. J. Appl. Earth Obs. Geoinf. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Migratable urban street scene sensing method based on vision language pre-trained model
International Journal of Applied Earth Observation and Geoinformation ( IF 7.6 ) Pub Date : 2022-09-05 , DOI: 10.1016/j.jag.2022.102989
Yan Zhang , Fan Zhang , Nengcheng Chen

We propose a geographically reproducible approach to urban scene sensing based on large-scale pre-trained models. With the rise of GeoAI research, many high-quality urban observation datasets and deep learning models have emerged. However, geospatial heterogeneity makes these resources challenging to share and migrate to new application scenarios. This paper introduces vision language and semantic pre-trained model for street view image analysis as an example. This bridges the boundaries of data formats under location coupling, allowing for the acquisition of text-image urban scene objective descriptions in the physical space from the human perspective, including entities, entity attributes, and the relationships between entities. Besides, we proposed the SFT-BERT model to extract text feature sets of 10 urban land use categories from 8,923 scenes in Wuhan. The results show that our method outperforms seven baseline models, including computer vision, and improves 15% compared to traditional deep learning methods, demonstrating the potential of a pre-train & fine-tune paradigm for GIS spatial analysis. Our model could also be reused in other cities, and more accurate image descriptions and scene judgments could be obtained by inputting street view images from different angles. The code is shared at: github.com/yemanzhongting/CityCaption.

中文翻译：

基于视觉语言预训练模型的可迁移城市街景感知方法

我们提出了一种基于大规模预训练模型的地理可重现的城市场景感知方法。随着 GeoAI 研究的兴起，涌现了许多优质的城市观测数据集和深度学习模型。然而，地理空间异构性使得这些资源难以共享和迁移到新的应用场景。本文以街景图像分析的视觉语言和语义预训练模型为例。这在位置耦合下弥合了数据格式的边界，允许从人类的角度获取物理空间中的文本图像城市场景客观描述，包括实体、实体属性以及实体之间的关系。此外，我们提出了 SFT-BERT 模型从 8 个城市土地利用类别中提取 10 个城市土地利用类别的文本特征集，武汉923场景。结果表明，我们的方法优于包括计算机视觉在内的七个基线模型，与传统的深度学习方法相比提高了 15%，证明了 GIS 空间分析的预训练和微调范式的潜力。我们的模型也可以在其他城市重复使用，通过输入不同角度的街景图像，可以获得更准确的图像描述和场景判断。代码共享于：github.com/yemanzhongting/CityCaption。通过输入不同角度的街景图像，可以获得更准确的图像描述和场景判断。代码共享于：github.com/yemanzhongting/CityCaption。通过输入不同角度的街景图像，可以获得更准确的图像描述和场景判断。代码共享于：github.com/yemanzhongting/CityCaption。

更新日期：2022-09-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南11