Abstract

Although previous works have proposed sophisticatedly probabilistic models that has strong capability of extracting features from remote sensing data (e.g., convolutional neural networks, CNN), the efforts that focus on exploring the human’s semantics on the object to be recognized are required more explorations. Moreover, interpretability of feature extraction becomes a major disadvantage of the state-of-the-art CNN. Especially for the complex urban objects, which varies in geometrical shapes, functional structures, environmental contexts, etc, due to the heterogeneity between low-level data features and high-level semantics, the features derived from remote sensing data alone are limited to facilitate an accurate recognition. In this paper, we present an ontology-based methodology framework for enabling object recognition through rules extracted from the high-level semantics, rather than unexplainable features extracted from a CNN. Firstly, we semantically organize the descriptions and definitions of the object as semantics (RDF-triple rules) through our developed domain ontology. Secondly, we exploit semantic web rule language to propose an encoder model for decomposing the RDF-triple rules based on a multilayer strategy. Then, we map the low-level data features, which are defined from optical satellite image and LiDAR height, to the decomposed parts of RDF-triple rules. Eventually, we apply a probabilistic belief network (PBN) to probabilistically represent the relationships between low-level data features and high-level semantics, as well as a modified TanH function is used to optimize the recognition result. The experimental results on lacking of the training process based on data samples show that our proposed approach can reach an accurate recognition with high-level semantics. This work is conducive to the development of complex urban object recognition toward the fields including multilayer learning algorithms and knowledge graph-based relational reinforcement learning.

1. Introduction

Urban scenes characterize the representations of land cover and land use on the Earth’s surface and the interaction of natural forces and human activities. Recently, because of the emergence of Internet of Things and cyber infrastructure, urban areas are passing through a rapid development forwarding digital cities and even smart cities, making the urban landscape and objects in urban areas become the most complicated ever [14]. The objects locating in the urban scenes, which varies in geometrical shapes, functional structures, environmental contexts, etc., which are named complex urban objects in this paper, might inherit different social roles and urban functions. To effectively and efficiently recognize the complex characteristics of the complex urban object that might be useful for landscape planning, land management, and traffic monitoring in the era of digital economy. A large number of scholars have focused on using remotely sensed imagery for land scene recognition [2, 59]. For example, the emergence of deep learning approaches has significantly shaped the frameworks of object recognition from remote sensing data. A variety of CNN models continuously have been breaking records of the state-of-the-art accuracy and precision regarding object recognition from remote sensing [10].

However, these approaches that heavily rely on features might lead to weak performance in urban object recognition. Rules, knowledge, and explainable features are back to attract researchers’ attentions [11, 12]. Recently, the community of pattern recognition in remote sensing proposes some discussions that whether the deep learning models are overtrusted. On one hand, researchers notice the expensive cost of training data preparation for the deep learning-enhanced object recognition [13]. On other hand, the amount of object type that a CNN specially designed for object recognition can recognize seems limited and heavily relies on human’s labeling. Thus, how to extend the scope of the deep learning-enhanced object recognition becomes a main concern in the recent works [14, 15].

To deal with the challenge mentioned above, multiple types of probabilistic relational models (PRMs), involving Bayesian belief network, Markov chain model, Monte Carlo, etc., have been successfully exploited by a wide variety of tasks for object recognition, such as road extraction [16], terrestrial modeling [17], urban pattern [18], and change detection [19]. PRMs have some advantages on modeling the relationships included in data, such as addressing the uncertain parts of data, handling with overfitting, and exhibiting a robustness of noises. However, the features extracted by PRMs still belong to low-level features involving color, texture, shape, etc. These low-level features cannot accurately represent the characteristics of objects to be recognized, since the characteristics are described by high-level human’s semantics involving concepts and definitions [20, 21].

The substantial difference between high-level semantics and low-level data features has become the critical challenge to generate accurate object recognition [8]. Thus, the approaches for explicitly representing the content or the semantics of remote sensing data have attracted considerable attentions. Initially, several commonly used relational models (e.g., Entity-Relationship model and Graph-based model) have been applied to model the relationships between different objects in remote sensing images [22, 23]. However, the relational model cannot explicitly reveal the uncertain and hidden relationships in remote sensing data. For example, the relational model is limited to represent the information, involving bright color, complicated structure, etc., the inherit meaning of which cannot be directly extracted from low-level features. More important, the relational model cannot support to produce extra knowledge through logical reasoning.

Compared with traditional relational models, ontology has been viewed a commonly used tool for effectively organizing semantic information and knowledge based on high-end semantic web techniques [24, 25]. Ontology-based methodologies for remote sensing data analysis have attracted wide-ranging discussions in recent years [2629]. Reviewing the ontology-based approaches that are applicable for object recognition, we classify these approaches into three categories. The first type of approach semantically defines the characteristics and the relationships of each object for interpreting the result of classification or recognition [3033]. This type of approach only applies knowledge from the ontological definition to postprocess the recognition result, rather than using the knowledge to assist the computational processing. Then, the second type of approach focuses on predefining some rules to extract low-level image features [3439]. Howbeit, the knowledge and rules in these approaches act as the auxiliary information for object recognition alone, which still lacks incorporating rules into the process of computation [27, 40, 41]. Recently, the third type of approach has been created to exploit the rules in ontology to support the computational procedure of recognition, through establishing parameters that includes high-level semantics from ontology and the low-level features from remote sensing image [42, 43].

Above of all, based on ontology and semantic web techniques, the existing methods have capability to semantically model and formally organize the knowledge about the objects to be recognized [44]. Additionally, the low-level features for specific object can be explicitly modeled in ontology to support the computational processing of recognition [44, 45]. However, many further investigations are still needed. First, the previous ontology-based approaches lack to apply probability to quantitatively model the semantics of objects to be recognized in the computation. Second, an unsupervised model, which is created based on ontology to map high-level semantics and low-level features, has not been explored yet for object recognition.

Aiming to combining the external knowledge and the techniques of data-driven probabilistic analysis [46, 47], this paper proposes an unsupervised approach for object recognition from remote sensing image, which integrates the ontology that formally defines the knowledge of remote sensing image and object, and a probabilistic belief network (PBN) for probabilistically representing the features of remote sensing image and the semantics of objects by a more accurate way. The remainder of this paper is organized as follows. Section 2 shows the details of our proposed methodology framework. Section 3 demonstrates the experiment by using different popular PRMs and our proposed approach. Section 4 discusses the contribution of our work and the prospective of related works.

2. Methodology

The proposed methodology mainly focuses on employing the semantics to build an unsupervised approach for object recognition, rather than designing a complex architecture including feature extraction, feature learning, and feature-based classification. The framework of our proposed methodology consists of four sequential steps. (1) Creating region proposal. (2) Developing domain ontology to model the semantics of object to be recognized and generating knowledge (rules) from the semantics defined in ontology. (3) Creating the low-level features corresponding to these rules. (4) Mapping the low-level features and high-level semantics. (5) Recognizing the object through measuring the similarity between the features from ontological definitions and the low-level features. Additionally, generating knowledge from ontology is a procedure based on top-down architecture, as well as the procedure of creating low-level features corresponding to the knowledge. The details of the methodology framework are elaborated in the following sections.

2.1. Region Proposal Generation

We design a strategy that contains two different procedures for generating region proposals. Firstly, we segment the whole optical satellite image through object-based segmentation named simple linear iteration cluster-SLIC [48]. Then, based on the LiDAR height raster generated from LiDAR cloud point data, we segment the whole height raster through aspect difference. The following expression shows the approach for computing aspect difference based on eight types of connectivity:where denotes the index of direction ranging from 1 to 8, which, respectively, represents the directions starting clockwise at the north, denotes a pixel of the local window, denotes its neighbored pixels over direction , and denotes the threshold. The value of this threshold relies on the shape of the object to be recognized. Generally, the applicable for man-made objects and natural objects are 90 and 45 [49, 50], separately.

If the object to be recognized has distinct altitude variation, its segmented result by aspect difference serves as the region proposal. Otherwise, the parameters proposed by Uijlings et al [51] are used to generate region proposal based on the result by object-based segmentation. Specifically, larger scale is applied during the object-based segmentation to reduce the number of segmented fragments.

2.2. Ontology-Based Knowledge Generation

Domain ontology can transform the descriptions of to-be-recognized object into the semantics and to facilitate logic reasoning based on the semantics [4, 9]. Based on Protégé software [52], we developed the domain ontology by semantic standards which include Extensible Markup Language (XML), Resource Description Framework Schema (RDFS), and Web Ontology Language (OWL). In this paper, we develop two domain ontologies: a domain ontology for transforming the descriptions of to-be-recognized object into the semantics (GeoOnto) and another domain ontology for transferring the low-level features into semantics (ImgOnto). Specifically, ImgOnto is developed based on the image region, rather than image pixels. The conceptual model of these two domain ontologies is shown in Figure 1, which includes four components: Classes, Instances, Properties, and Functions. Classes refer to set or type of individuals. Instances refer to individuals. Properties refer to the relationships that define the connection between classes and individuals. Functions refer to the attribute of relationships.

The semantics in GeoOnto and ImgOnto are encoded as semantics by RDF triples, the structure of which is shown as follows:where , , and denote subject, predicate, and object, respectively. Specifically, Protégé software and some literature define predicate as property or relation. subject and object belong to classes or instances, and predicate refers to relationships and its corresponding restrictions. For example, the description “road is one type of impervious surface” is written as RDF triple: road-typeOf-impervious surface, where “road,” “typeOf” and “impervious surface” are subject, predicate, and object, respectively.

Then, based on the RDF triples, the knowledge is represented by rules, which are generated through logic reasoning through semantic web rule languages (SWRL), the structure of which is shown as follows:where refers to the RDF triple: . The same applies to , …, and , as well. is the total number of RDF triples that lead to .

2.3. Low-Level Features from Optical Satellite Image and LiDAR Height

Each waveband of the optical satellite image and LiDAR height serves as the sources for generating low-level features. The low-level features are divided into three types: color features (height features), textural features (height textural features), and shape features (height shape features). The methods for computing these low-level features based on different sources are shown in Table 1. The parameters (rule) refer to the low-level features defined from descriptions which are modeled by ontology, and the parameters (test) refer to the low-level features extracted from each waveband of optical satellite image and LiDAR height raster.

2.3.1. Shape Features

The boundary of region proposal which is obtained by aspect difference from LiDAR height is used as the shape features for the object that has distinct altitude variation (e.g., buildings and trees). After acquiring the boundaries by aspect difference from LiDAR height, we further compute the statistical and texture features in the region proposal.

Otherwise, for the object that has no obvious altitude changes, the region proposal of the object created refers to the minimum bounding rectangle (MBR) of this object, rather than its true boundaries. After acquiring the boundaries of this object from the optical satellite image, we further compute the color and texture features in the region proposal.

2.3.2. Image Color and Height Statistics

For each region proposal, we computed its color based on each waveband of optical satellite image and its height features based on the LiDAR height raster. For the regional proposal generated from optical satellite image, we compute an independent color feature based on each waveband of optical satellite image. Otherwise, for the region proposal generated from LiDAR height raster, we compute the height feature based on LiDAR height alone. Both color feature and eight feature includes commonly used statistical parameters involving max, min, mean, and standard deviation and skewness and kurtosis, the expressions of which are shown as follows:where denotes skewness, denotes kurtosis, denotes the total number of pixel in orthogonal retrieve window, denotes value of each pixel, and and denote the result of mean and standard deviation, respectively.

2.3.3. Texture Features

Textural features and height textual features for the regional proposal are, respectively, computed based on the fused waveband of optical satellite image and LiDAR height raster. The fused waveband of the optical satellite image is obtained bywhere refers to the gray image after fusing three wavebands and , , and , respectively, denote red, green, and blue waveband of the optical satellite image.

We exploit a gray-level co-occurrence matrix (GLCM) to, respectively, extract textural features and height textural features from the fused waveband and LiDAR height [53]. The details of computing GLCM can be read in this link: http://www.fp.ucalgary.ca/mhallbey/tutorial.htm. In this paper, the texture features need to consider the influence of rotation, which are not available in the traditional GLCM. Thus, we compute two sets of the normalized symmetrical matrix: one set includes four matrices over the cardinal directions involving north, west, south, and east, and another set contains four matrices over the diagonal directions involving northeast, northwest, southeast, and southwest. The expressions for computing these GLCM features are shown in the following two equations:where refers to the element of the normalized symmetrical matrix in location and and refers to the total number of gray levels specified for the image region included in the boundaries. Based on equation (6), another two GLCM features, shade and prominence, are computed by the following expression:

Moreover, to offset the limit of GLCM, the textural result obtained by GLCM is sensitive to rotation variation, we compute the GLCM features for the original image region, as well as its rotated ones by pivoting 90, 180, and 270°.

2.3.4. Data Structure of Low-Level Features

Above all, the low-level feature (), which includes color (statistic), texture, and shape features, is a three-dimensional array that has the following data structure:

In equation (8), the first dimension includes and , where and , respectively, refer to the low-level features extracted from optical satellite image and LiDAR height. The second dimension is , where , which includes , , and . The structure inside of is the third dimension. is a list that records color (statistical) features of each waveband of optical satellite image (, , and ) and height feature (). is a list including GLCM features, where and , respectively, are the index of GLCM features generated from the fused waveband of the optical satellite image and the index of height GLCM features derived from LiDAR height. contains two matrixes that, respectively, refers to the shape features from the optical satellite image () and the height shape features created by aspect difference from LiDAR height ().

2.4. Ontological Rules for Object Recognition
2.4.1. Mapping between High-Level Semantics to Low-Level Features

In ontology, high-level semantics are explicitly represented by RDF-triples rules. To map the high-level semantics to low-level features, as shown in Table 1, we aim to decompose these RDF-triples rules into indecomposable terms (subject, predicate, and object) and create low-level features corresponding to each term. Since the semantics of a to-be-recognized object generally consists of multiple RDF triples (e.g., the demo in Table 1), the decomposition is gradually completed using a layer-by-layer procedure through the proposed SWRL-based fuzzy encoder shown in Figure 2.

In this encoder model, the encoding outcome of the former layer acts as the input for encoding of the next layer; through this way each layer is mutually connected as multilayers processing architecture. The initial input of the encoder model is the rule set, which includes a set of rules that has the similar structure shown in equation (2). In the 1st layer, the rule set is decomposed into multiple rules: R1, …, Rn. Each of these decomposed rules is as one node in the 1st layer. In the 2nd layer, each rule is further decomposed into multiple independent RDF triples that carry the similar structure shown in equation (3). Furthermore, the next layer further decomposes each independent RDF triple into three terms: subject, property, and object.

2.4.2. Probabilistic Prediction on Similarity between High-Level Semantics to Low-Level Features

PBN is a probabilistic graphical model, among which belief and network, respectively, is the probability of observing a state, and a directed acyclic graph of probability inference [54]. In object recognition, belief refers to the probability of recognizing (believing) an image object as the target object, and network refers to the model of recognition. Moreover, the network is comprised of nodes and links, which refer to features and relationships, respectively. PBN has two advantages based on this network architecture. First, one node can be connected with other nodes, which makes it available for multivariable-based probabilistic inference. Second, the graph of PBN and conceptual graph of ontology is easier to be integrated to enable ontological rules using in the procedure of recognition. This integration is shown in the procedure between the final layer and low-level features of Figure 2.

The conceptual model of our proposed approach based on PBN for recognition is completed by the following expression:where denotes the weighting matrix for controlling the impacts of each feature of in equation (8). The approach for computing will be elaborated in Section 2.4.3. refers to the low-level features matched between the low-level features defined from ontology and the ones extracted from the image regions in orthogonal retrieve window. Thus, means the probability of recognizing (believing) an object as the target object, given assembles of multiple matched features ().

Moreover, based on equation (8), we have . Equation (9) is extended as the following expression:

Then, based on , from equation (8), in equation (10) is extended as the following expression:where the implicate content of matched low-level features , , and , respectively, corresponds to the features , , and in equation (8).

We apply two separate approaches to measure the similarity for these matched low-level features (). The first approach is used for statistical (color) features and GLCM features, or and in equation (11). Similarity is measured by the following expression:where and and , respectively, denote one statistical (color) or GLCM feature derived from ontology and the image regions in orthogonal retrieve window. Additionally, is a single value, whereas is a range of values. denotes that is inside the value range of .

The second approach is designed for measuring shape features, or , in equation (11). Since shape features are represented as a matrix, rather than a single value, the similarity of shape features is measured by the modified mean absolute error (MAE). In a two-dimensional space, traditional approach calculates the MAE based on horizontal and vertical dimension, respectively. In our modified MAE, we, respectively, divide the horizontal and vertical dimensional MAE into eastern and western dimensions and northern and southern dimensions. Then, the similarity of shape features is measured based on four MAEs being oriented to the north, south, east, and west.

Assuming the boundaries defined from ontology and derived from the image regions in orthogonal retrieve window as and . The modified MAE between and includes , , , and , where , , , and , respectively, refers to north, south, east, and west. Then, the similarity is measured based on an idea; if and were more similar at multiple scales, they should be mutually closer to be paralleled. Being paralleled is represented by the following modified MAE:where is the direction that includes north, south, west, and east.

2.4.3. Weighted Probability Optimization

In previous probabilistic relational models without supports of semantics, the weighted vector in equation (9) is acquired from data, rather than the semantics. As discussed previously, however, the low-level features derived from an image are different from the semantics. Thus, in this paper, the is given based on high-level semantics. In equation (11), previous probabilistic models semantically define the weight of image-level features for the result of ; these models fail to give adequate attention to a critical issue: the impact of each low-level features for to-be-recognized object are varied. Generally speaking, the impact of each low-level feature cannot be represented by the linear function. This critical issue drives us to focus on optimizing the probabilistic result from PBN through computing each weight based on two sequential steps: qualitative computing (semantics) and quantitative computing (nonlinear regression). Qualitative computing is conducted through the following strategy that involves three conditions:(i)Condition 1: if a low-level feature was not semantically defined in ontology, then the weight for this feature equals to 0(ii)Condition 2: if the importance of low-level features are equally defined in ontology, then the weight for this feature equals to (iii)Condition 3: if the importance of a low-level feature is semantically defined in a biased way, then the weight for this feature equals , which is derived from the restrictions for the property of these triples

Condition 3 is more common to be observed among these three conditions. According to the research achievements on selective attention recognition [55, 56], due to the restriction of conscious experience, humans can only recognize one thing at one time. This means that the features or object described prior are more important than the ones described later. Thus, in Condition 3, the weight can be determined by not only the degree defined in ontology but also the significant degree of features. For example, in the case of the definition in Table 2, three definitions are ranked based on two indexes:(i)Degree: generally > a majority of > restriction “in a city or town”(ii)The sequence as these definitions appear

Then, the weights ranked by qualitative computing are further quantitatively processed through implementing a modified activation function. Many multilayer processing models have successfully applied a great number of activation functions, such as logistic, Gaussian, Tan, softsign [57], and Re [58, 59], to probabilistically predict the uncertain conditions. In this paper, TanH is exploited due to two reasons. First, previous literature had claimed that TanH was the appropriate form as an activation function in a nonlinear classifier [60]. Moreover, TanH presses close to the nonlinear variations in modeling the probability of recognition based on multiple features.

To make TanH function more fitting to the proposed algorithm, we make some modifications. Based on the weight of each matched feature are ranked by qualitative computing as , , …, , where is the total number of matched low-level features. From the number of , , and in equation (13), we have , since . The following expression is proposed to quantitatively measure the impact of each weight:where refers to the ranking index of each feature by qualitative computing. The outline of equation (14) is shown in Figure 3. In equation (14), and are two constants, which are used to create a nonlinear TanH function to accurately fit the priority of each low-level feature.

Eventually, by substituting each weight obtained from equation (14) into equation (13), we obtain the between an image object and the object to be recognized. According to , we determine if this image region in orthogonal retrieve window belonged to the object to be recognized.

The materials and methods section should contain sufficient detail so that all procedures can be repeated. It may be divided into headed sections if several methods are described.

3. Experimental Analysis

The first experimental data is from CAP LTER project [61], which has been managed by Julie Ann Wrigley Global Institute of Sustainability, Arizona State University. Figure 4(a) shows two types of remote sensing data: multispectral Quickbird and LiDAR height, which cover a partial area of Tempe, Arizona. The spatial resolution and image size of multispectral Quickbird image are 0.63 meters and 1295  ×1843 pixels, and this image was produced on March 30th, 2008. LiDAR height has a spatial resolution of 1 meter, an image size of 1295 × 1843 pixels, and was produced on May 5th, 2008. Specifically, there is a missing area over both optical satellite image and LiDAR height data. Moreover, resolution difference exists between the optical satellite image and LiDAR height. Thus, we conducted a registration between LiDAR height and the optical satellite imaged by ERDAS software.

The second experimental datasets are shown in Figure 4(b), which consist of an optical satellite image obtained from Google Earth and LiDAR height accessed from a cyber infrastructure named OpenTopography [62]. This experimental data covers main body of Embry-Riddle Aeronautical University-Prescott, Arizona. The spatial resolution and image size of optical satellite image are 1 meter and 1100 × 1100 pixels, and this image was produced on September 2009. LiDAR height has a spatial resolution of 1 meter, an image size of 1100 × 1100 pixels, and was produced on January 2014. In the previous work, we have testified land cover classification based on LiDAR data alone [49]. We recognized the object based on the semantics converted from the optical satellite image and LiDAR height.

Based on the classification system and remote sensing data involving optical satellite image and LiDAR height, the objects to be recognized in the first experimental data consists of residential buildings and commercial buildings, trees have height greater than 10 meters, trees have height over 5 meters to 10 meters, trees have height over 1.5 meters to 5 meters, circular roads, and swimming pools. Figure 5 illustrates the samples of these objects to be recognized.

We followed the workflow mentioned in Section 2 to create low-level features from the semantics of each category of the object. Table 2 shows an example of transforming the descriptions on object “street” to semantics and rules in domain ontology.

Then, 2e uses Rules 2 and 5 in Table 2 to illustrate how to map the high-level semantics and low-level features through the proposed multilayer architecture. The mapping procedure is shown in Figure 4. The initial input is the rule set involving Rule 2 and Rule 5. In the 1st layer, this rule set is decomposed into two independent rules: Rule 2 and Rule 5. In the 2nd layer, these two rules are decomposed into three RDF triples. Especially, Rule 2 and Rule 5 has the same triple: covers (?image_object, ?asphalt). Then, these three RDF triples are decomposed into multiple subjects, predicates, and objects in the next layer. In the 4rd layer, the rules involving low-level features are created for each of these terms. Specifically, we use the node named asphalt as the illustration. In the 4th layer, the low-level features for asphalt are organized as a rule that contains four RDF triples. In the 5th layer, this rule is decomposed as four rules. Next, each of these rules is decomposed into subjects, predicates, and objects. These subjects, predicates, and object in the 6th layer are the ultimate results of the proposed multilayer decomposition. The next step is to create the corresponding low-level features for each subject, predicate, and object in the 6th layer. Dissimilar to the low-level features extracted from remote sensing data, the parameters of low-level features here are given according to the semantic information obtained though fuzzy logic reasoning from descriptions on these terms, results of visual interpretation and experts’ knowledge.

Uijlings et al [51] proposed a strategy that discovers the representative parameter for appropriate segmentation generation. In SLIC segmentation, the scale of segmentation studied in [51] refers to the number of segmentation and compactness. Based on the complexity of image content, we employed the strategy defined by Uijlings et al [51] to determine the parameter regarding the number of segmentation and compactness. Table 3 lists the priority of each low-level feature, which was semantically defined by transforming the descriptions on the object to be recognized to semantics in ontology. The descriptions on each category of the object for transforming semantics are mainly collected from visual interpretations on the optical satellite image. We also selected some text information from external resources including Wikipedia, Merriam Webster dictionary, and land classification systems involving National Oceanic and Atmospheric Administration (NOAA) and National Land Cover Database (NCLD).

Based on the information selected from multiple sources, we assign the priority of low-level features for each category. The color of roof and height refer to the key indicators for distinguishing buildings and other categories. Additionally, the differences in geometrical shape and size observed from the optical satellite image between residential and commercial buildings are used to distinguish these two types of building. From LiDAR height, the roof structures of many residential buildings are distinctly different from the ones of commercial buildings. For circular roads, its color and unique shape observed from the optical satellite image are mainly used. Not only the optical satellite image but also the text information about the materials of road that has specific color and unique geometric shape are given in some external resources, such as asphalt, bitumen, and circle. For swimming pools, water is mentioned in the external sources at high frequency. In the optical satellite image, the color of water is much more significant than geometric shape, since swimming pool carries a variety of shape. The color of trees is distinct compared with other categories in the optical satellite image. Moreover, the area being covered by the boundaries derived from LiDAR height refers to another significant indicator for distinguishing trees and other categories that have obvious altitude variation (e.g., building). Due to the classification system that classifies trees over different height, the height derived from LiDAR height is also used for classifying these different types of trees.

Since the proposed approach is an unsupervised model without the support of data training, we show the recognition results using an unsupervised model available in some commercial software (e.g., ENVI and ERDAS): k nearest neighbor (KNN), as well as a commonly-used supervised model SVM in remote sensing data classification [63]. To get the recognition results by KNN, we used ERDAS software to apply the KNN for classifying five classes based on the optical satellite image. Then, we extracted the classified image regions being similar to trees and further classified these regions as three classes based on LiDAR height. For SVM, we applied a popular open library named LIBSVM to create a binary classifier for classifying the image object as either a target object or a nontarget object. Based on the strategy of binary classification, the optical satellite image was used as data source for recognizing swimming pools and circular roads, as well as classifying other five categories by incorporating with LiDAR height. Based on the segmentation results generated from the optical satellite image and LiDAR height, we randomly selected the image objects from the optical satellite image as training samples. The amount of the image objects used for training is less than 15% of the total number of all objects. Table 4 shows the quantitative comparison of recognition accuracy.

In another experimental dataset, based on our previous works in LiDAR data-based classification [49] and the content of the optical satellite image and LiDAR height, the objects to be recognized consists of buildings, parking structure, parking lot, shrub, grass, bare soil, and road. Figure 6 illustrates the samples of these objects to be recognized.

Similarly, we followed the workflow mentioned in Section 2 to create low-level features from the semantics of each category of the object. Table 5 lists the priority of each low-level feature, which was semantically defined by transforming the descriptions on the object to be recognized to semantics in ontology. The descriptions on each category of the object for transforming semantics are mainly collected from visual interpretations on the optical satellite image and the experiments from our efforts for LiDAR data-based land cover classification.

Based on the information selected from multiple sources, we assign the priority of low-level features for each category. The height refers the key indicators for distinguishing buildings and other categories. Parking lot and road carry similar color and LiDAR height. However, vehicles and the texture feature of road plays the key impact to distinguish these two types of object. Moreover, LiDAR height refers to the key parameter to distinguish between the parking structure and parking lot and road. Similar to the first experimental dataset, the shape of the road is significantly different from the parking structure and parking lot. Although bare soil and shrub have similar land cover from the optical satellite image and LiDAR intensity, shrub carries specific pattern which can be represented from texture features. For grass, its color is distinct compared with other categories in the optical satellite image.

According to Tables 4 and 6, it can be observed that the recognition result by our proposed approach overperforms an unsupervised KNN approach and supervised SVM approach. Without data training, KNN is not able to effectively support object recognition based on the optical satellite image and LiDAR height. First, without manual setting, the distance metric defined by KNN classifier is difficult to be self-adjusted for the specific application. Second, KNN such as an unsupervised lazy classifier might be challenging to effectively handle complicated variables. The performance of KNN in recognition proves that the features extracted from data alone without people’s interpretation (e.g., distance metric and distance assigned with direction) cannot characterize the object from the optical satellite image and LiDAR height. Although SVM performs well on applying the low-level features for distinguishing a variety of objects through the training process, it still has challenges to conduct perfect recognition without data features. Our proposed approach first defines the representations of to-be-recognized objects based on high-level semantics. Then, it creates low-level features that corresponding to semantics, which facilitate discovering the low-level features more fitting for the representations of to-be-recognized objects. Lacking the training process based on data samples, our proposed approach can also reach the recognition results within better accuracy.

Based on the results obtained from above two experiments, several conclusions can be derived from the comparison of these three methods. First, the existing techniques that focus on data features alone cannot support efficient object recognition without data training and knowledge assistance. Additionally, even though the low-level features can be extracted through the training process in commonly used models and algorithms, the gap between low-level features and high-level semantics can still significantly impact the recognition accuracy. Last but not least, high-level semantics has been proved as the essential factor to improve the performance of machine learning models in object recognition. Thus, the investigations on high-level semantics have a great potential for facilitating object recognition from remote sensing data.

Meanwhile, the main disadvantage of the proposed method is also observed from the experiment. As mentioned in some papers [27, 44], semantics-driven object recognition might not have a good transferability, which can only be limited in the predefined object type. Moreover, more in-depth investigations are still needed for the mapping between semantics and data features.

4. Conclusion

This paper presents an ontology-embedded probabilistic relational model to support object recognition based on geographical knowledge being derived from ontological rules. The first step aims to semantically organize the descriptions of the object to be recognized as rules in ontology. Then, an SWRL-based encoder is used to decompose the rules into three elements of triple (subject, property, and object) and map these elements to low-level features created from an optical satellite image and LiDAR height. Next, a PBN is developed for representing the low-level features corresponding to rules in ontology. Eventually, a modified TanH function is created to compute the distribution of features-based probability and to determine the object recognized based on the graph of PBN.

A majority of sophisticatedly probabilistic models for object recognition mainly focus on extracting features remote sensing data, rather than rich human semantics. The heterogeneity between low-level data features and high-level human semantics prevents these sophisticated algorithms from producing accurate result on object recognition. Although some previous research has attempted to utilize knowledge to support remote sensing data analysis, few of them have discussed on extracting the low-level features that are applicable for applying the semantics of the object to be recognized. Moreover, integrating knowledge to establish an unsupervised approach for object recognition without data training is still a nonproven research field. In the future, several extended endeavors are worthy studying. We hope our work can further help in the development of object recognition as the field continues moving forward towards multilayer learning algorithms, knowledge graph-based relational reinforcement learning, etc.

Data Availability

The multispectral remote sensing image and LiDAR height data used to support the first experimental findings of this study have been deposited in the Quickbird and the CAP LTER project [61], respectively. The high-resolution remote sensing image and LiDAR height data used to support the second experimental findings of this study have been obtained from Google Earth and LiDAR height accessed from a cyber infrastructure named OpenTopography (https://opentopography.org/), respectively. Zhao, Q., Myint, S. W., Wentz, E. A., and Fan, C. [61]. Rooftop surface temperature analysis in an urban residential environment. Remote Sensing, 7(9), 12135–12159.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by “the Fundamental Research Funds for the Central Universities (Grant no. 2020QN28).”