Migratable urban street scene sensing method based on vision language pre-trained model

https://doi.org/10.1016/j.jag.2022.102989Get rights and content
Under a Creative Commons license
open access

Highlights

  • Highly semantic compression of multi-view street view images.

  • Bridging the data gap between geo-tagged text and geo-tagged image.

  • Geographical replicability, easy to migrate and share to other cities.

  • Pre-trained model reduces the computational cost and carbon footprint.

Abstract

We propose a geographically reproducible approach to urban scene sensing based on large-scale pre-trained models. With the rise of GeoAI research, many high-quality urban observation datasets and deep learning models have emerged. However, geospatial heterogeneity makes these resources challenging to share and migrate to new application scenarios. This paper introduces vision language and semantic pre-trained model for street view image analysis as an example. This bridges the boundaries of data formats under location coupling, allowing for the acquisition of text-image urban scene objective descriptions in the physical space from the human perspective, including entities, entity attributes, and the relationships between entities. Besides, we proposed the SFT-BERT model to extract text feature sets of 10 urban land use categories from 8,923 scenes in Wuhan. The results show that our method outperforms seven baseline models, including computer vision, and improves 15% compared to traditional deep learning methods, demonstrating the potential of a pre-train & fine-tune paradigm for GIS spatial analysis. Our model could also be reused in other cities, and more accurate image descriptions and scene judgments could be obtained by inputting street view images from different angles. The code is shared at: github.com/yemanzhongting/CityCaption.

Keywords

GeoAI
Natural language processing
Data translation
Pretrained model
Street view
Multi-modal

Data availability

Data will be made available on request.

Cited by (0)