Robust deep alignment network with remote sensing knowledge graph for zero-shot and generalized zero-shot remote sensing image scene classification
Introduction
Benefiting from the rapid advances in aerospace, sensor and communication technologies, human beings have entered an era of remote sensing (RS) big data (Chi et al., 2016, Li et al., 2021a, Lobry et al., 2020). Automatically accurate classification of these oversized RS images is one basic but important task for mining the value of RS big data (Cheng and Han, 2016, Gu et al., 2019, Li et al., 2020, Marcos et al., 2018). Along with the spatial resolution improvement of RS imagery, pixel-level or object-level classification methods show great limitations (Blaschke, 2010, Li et al., 2016, Cheng et al., 2017). As a consequence, more attention has been given to scene-level RS image classification due to its stable classification performance and its wide applications in natural disaster monitoring (Cheng et al., 2013), multimodal data fusion (Gerke et al., 2014), functional zone classification (Zhang et al., 2018), object detection (Tao et al., 2019a, Tao et al., 2019b), and image retrieval (Demir and Bruzzone, 2016, Li et al., 2018).
Until now, deep learning (LeCun et al., 2015) has greatly improved RS image scene classification (Li et al., 2021c, Zhang et al., 2016). However, the current deep learning models have good classification performance only when each scene category has sufficient samples. In the era of RS big data, the number of RS scene categories presents an explosive growth trend. It is unrealistic to collect sufficient RS image samples and construct their labels for all categories at once. Hence, identifying RS image scenes that never appear in the training stage has important practical value (Li et al., 2017a) in the era of RS big data. Inspired by humans’ inference ability, embedding prior knowledge into the learning process is an ideal method for addressing this issue (Li et al., 2021b).
In the literature, the development of zero-shot learning (ZSL) (Larochelle et al., 2008, Palatucci et al., 2009, Ji et al., 2020) in recent years has provided promising solutions to recognize samples from unseen categories. By leveraging the prior knowledge of categories, including seen and unseen categories, as auxiliary information, ZSL can learn samples from seen categories to identify samples from the unseen categories. Generally, the semantic information of seen and unseen classes is the common sense of human beings, which is universal and can be used in both of the training and testing stages, but the image samples of unseen classes do not exist in the training stage. Hence, how to express semantics is the key to pursue the superior performance of ZSL. For example, we can recognize the zebra image through the images of tiger, panda and horse, combined with the semantic information such as tiger stripes, panda colors and horse shapes. From this intuitive finding, we can also see the indispensable importance of semantic information in the ZSL task. As an extension of ZSL, generalized zero-shot learning (GZSL) attempts to learn samples from seen categories to simultaneously recognize seen and unseen samples in the testing stage, which is a more challenging but practical task. In the field of computer vision, large numbers of ZSL and GZSL methods have been proposed. In contrast, ZSL and GZSL are rarely discussed in the field of RS (Sumbul et al., 2017). Compared with the computer vision field, the following characteristics in the RS field limit the development of ZSL and GZSL. On the one hand, the names of RS scene categories often have domain specificity. If the semantic representations of RS scene categories are generated by directly leveraging the general natural language processing model (e.g., Word2Vec) to map the names of RS scene categories, the semantic representations cannot reflect the intrinsically semantic information of the RS category. On the other hand, RS image scenes, presenting large intraclass differences and large interclass similarities, generally have more complex appearances than natural images in the computer vision field. Generally, the ZSL and GZSL methods that have achieved excellent results in the field of computer vision cannot be directly extended to address the task in the RS domain. Overall, it deserves much more exploration to promote zero-shot and generalized zero-shot RS image scene classification.
With the aforementioned considerations, this paper mainly focuses on exploiting zero-shot and generalized zero-shot RS image scene classification. The quality of semantic representation of categories plays an important role in ZSL and GZSL (Li et al., 2017a, Li et al., 2017b, Li et al., 2017c). To generate the high-quality semantic representations of RS scene categories, this paper constructs a new remote sensing knowledge graph (RSKG) based on the domain prior knowledge from human experts, where RSKG fully considers the rich connections between RS scene elements. To the best of our knowledge, this paper, for the first time, proposes to calculate the Semantic Representations of RS scene categories by representation learning of RSKG (SR-RSKG). Based on SR-RSKG, this paper proposes a new deep alignment network with a series of well-designed constraints (DAN), which can robustly match the visual features and semantic representations in the latent space, to address zero-shot and generalized zero-shot RS image scene classification. Experimental results on one integrated RS image scene dataset show that our proposed SR-RSKG is superior to traditional knowledge types (e.g., Word2Vec (Mikolov et al., 2013), BERT (Devlin et al., 2018), and manually annotated attribute vectors). In addition, the proposed DAN performs better than the state-of-the-art methods under both the ZSL and GZSL settings. The major contributions of this paper are summarized as follows.
- 1)
To the best of our knowledge, this paper, for the first time, proposes to generate the semantic representations of RS scene categories by representation learning of RSKG. Extensive experiments verify its superiority compared with traditional prior knowledge types. The constructed RSKG will be made publicly available along with this paper.
- 2)
By pursuing the stable cross-modal alignment of the same category and scattered distribution of different categories, this paper proposes a novel DAN to robustly match visual features and semantic features in the latent space. Extensive experiments show that the proposed DAN outperforms the existing methods under both the ZSL and GZSL settings.
The remainder of this paper is organized as follows. Section 2 discusses the related works. Section 3 introduces the construction process of RSKG and depicts representation learning of RSKG. Section 4 introduces the DAN model in detail. Section 5 summarizes the experimental results. Finally, the conclusion is detailed in Section 6.
Section snippets
Related work
In this section, we briefly review the most relevant works in the literature that include semantic representations of RS scene categories and zero-shot RS image scene classification.
Representation learning of remote sensing knowledge graph
In this section, we first introduce the construction process of RSKG and then discuss representation learning of RSKG.
Robust deep alignment network for zero-shot and generalized zero-shot remote sensing image scene classification
Section 4.1 introduces the definition of ZSL and GZSL. In Section 4.2, we clarify the robust deep alignment network for zero-shot and generalized zero-shot RS image scene classification. In addition, we introduce the process of classifying RS image scenes from unseen categories.
Experimental analysis and discussion
In this section, we design extensive experiments to evaluate our proposed approach. In Section 5.1, we introduce the experimental settings. Then, we analyze the sensitivity of critical parameters in our proposed approach in Section 5.2. Finally, we compare our method with the state-of-the-art methods in Section 5.3.
Conclusion
Driven by the increasing practical demands of ZSL and GZSL in the RS field, this paper mainly focuses on zero-shot and generalized zero-shot RS image scene classification. Considering that natural language processing models based on generalized corpora have poor performance in describing RS-oriented scene categories appropriately, this paper, for the first time, proposes to generate semantic representations of RS scene categories through representation learning of RSKG and applies them to
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Key Research and Development Program of China under Grant 2018YFB0505003; the National Natural Science Foundation of China under Grant 41971284 ; the State Key Program of the National Natural Science Foundation of China under Grants 42030102 and 92038301; the Foundation for Innovative Research Groups of the Natural Science Foundation of Hubei Province under Grant 2020CFA003; the China Postdoctoral Science Foundation under Grants 2016M590716 and
References (67)
Object based image analysis for remote sensing
ISPRS Journal of Photogrammetry and Remote Sensing
(2010)- et al.
A survey on object detection in optical remote sensing images
ISPRS Journal of Photogrammetry and Remote Sensing
(2016) - et al.
YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia
Artificial Intelligence
(2013) - et al.
Accurate cloud detection in high-resolution remote sensing imagery by weakly supervised deep learning
Remote Sensing of Environment
(2020) - et al.
Learning deep semantic segmentation network under multiple weakly-supervised constraints for cross-domain remote sensing image semantic segmentation
ISPRS Journal of Photogrammetry and Remote Sensing
(2021) - et al.
Land cover mapping at very high resolution with rotation equivariant CNNs: Towards small yet accurate models
ISPRS Journal of Photogrammetry and Remote Sensing
(2018) - et al.
Spatial information inference net: road extraction using road-specific contextual information
ISPRS Journal of Photogrammetry and Remote Sensing
(2019) - et al.
Linking openstreetmap with knowledge graphs—link discovery for schema-agnostic volunteered geographic information
Future Generation Computer Systems
(2021) - et al.
Integrating bottom-up classification and top-down feedback for improving urban land-cover and functional-zone mapping
Remote Sensing of Environment
(2018) - et al.
PatternNet: A benchmark dataset for performance evaluation of remote sensing image retrieval
ISPRS Journal of Photogrammetry and Remote Sensing
(2018)
Dbpedia: A nucleus for a web of open data
The Semantic Web
Enriching word vectors with subword information
Transactions of the Association for Computational Linguistics
Freebase: a collaboratively created graph database for structuring human knowledge
Translating embeddings for modeling multi-relational data
Proceedings of Neural Information Processing Systems
A simple framework for contrastive learning of visual representations
International conference on machine learning. PMLR
Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA
International Journal of Remote Sensing
Remote sensing image scene classification: Benchmark and state of the art
Proceedings of the IEEE
Big data for remote sensing: Challenges and opportunities
Proceedings of the IEEE
A conceptual framework for modelling spatial relations
Information Technology and Control
Hashing-based scalable remote sensing image search and retrieval in large archives
IEEE Transactions on Geoscience and Remote Sensing
Convolutional 2d knowledge graph embeddings
In: Proceedings of the AAAI Conference on Artificial Intelligence
Creativity inspired zero-shot learning
Introducing Wikidata to the linked data web
A survey on deep learning-driven remote sensing image scene understanding: Scene classification, scene retrieval and scene-guided object detection
Applied Sciences
Deep multimodal representation learning: A survey
IEEE Access
Deep residual learning for image recognition
Momentum contrast for unsupervised visual representation learning
Relation network for multilabel aerial image classification
IEEE Transactions on Geoscience and Remote Sensing
Deep ranking for image zero-shot multi-label classification
IEEE Transactions on Image Processing
Semantic autoencoder for zero-shot learning
Cited by (88)
Few-shot remote sensing image scene classification: Recent advances, new baselines, and future trends
2024, ISPRS Journal of Photogrammetry and Remote SensingTrustworthy remote sensing interpretation: Concepts, technologies, and applications
2024, ISPRS Journal of Photogrammetry and Remote SensingMSPIF: Multi-stage progressive visible and infrared image fusion with structures preservation
2023, Infrared Physics and TechnologyA survey of machine learning and deep learning in remote sensing of geological environment: Challenges, advances, and opportunities
2023, ISPRS Journal of Photogrammetry and Remote SensingStyle and content separation network for remote sensing image cross-scene generalization
2023, ISPRS Journal of Photogrammetry and Remote Sensing