Abstract

Autonomous object detection powered by cutting-edge artificial intelligent techniques has been an essential component for sustaining complex smart city systems. Fine-grained image classification focuses on recognizing subcategories of specific levels of images. As a result of the high similarity between images in the same category and the high dissimilarity in the same subcategories, it has always been a challenging problem in computer vision. Traditional approaches usually rely on exploring only the visual information in images. Therefore, this paper proposes a novel Knowledge Graph Representation Fusion (KGRF) framework to introduce prior knowledge into fine-grained image classification task. Specifically, the Graph Attention Network (GAT) is employed to learn the knowledge representation from the constructed knowledge graph modeling the categories-subcategories and subcategories-attributes associations. By introducing the Multimodal Compact Bilinear (MCB) module, the framework can fully integrate the knowledge representation and visual features for learning the high-level image features. Extensive experiments on the Caltech-UCSD Birds-200-2011 dataset verify the superiority of our proposed framework over several existing state-of-the-art methods.

1. Introduction

In recent years, with the development of artificial intelligence and Internet of Things technology [15], the concept and construction of smart city are also constantly breaking through. As an important part of smart city field [6, 7], object recognition based on computer vision has attracted much attention. Specifically, fine-grained image classification has been widely used in vehicle type recognition [810], goods recognition [11], content-based image retrieval [12], and other smart city applications [1317]. In these applications, recognizing fine-grained images is still challenging, due to the high similarity between images in the same categories and the high dissimilarity in the same subcategories caused by different poses, behaviors, and so on as shown in Figure 1.

Note. In this paper, we use “category” to refer to the abstract concept of object type. For example, the category of a bird refers to its family or genus, such as “Albatross,” “Turdus.” “Subcategory” refers to the concept of fine-grained object type. For example, the subcategory of a bird is the species, such as “Sooty Albatross,” “Rusty Blackbird.”

Traditional approaches of fine-grained image classification usually rely on the low level visual cues to capture features for recognition. These methods mainly involve part-based models, and visual attention networks [1820] first locate regions/parts of the object and capture the visual features on the detected locations to learn the ability to distinguish the nuances between different subcategories. However, these models require heavy annotations of object parts and are more difficult to collect than image labels. Visual attention networks [21, 22] try to learn discriminative representations with attention mechanisms. However, these works only focus on capturing visual features with a lot of labeled images.

Different from the traditional approaches, human beings recognize objects in an image based on not only the visual information of objects, but also some prior knowledge acquired from their daily life experience. For example, we might know that yellow headed blackbird has yellow head and chest with black around its eyes. With this knowledge, when we see an image of “yellow headed blackbird,” we can reason the classification correctly by combining visual information. Recently, Chen et al. [23] and Xu et al. [24] try to incorporate the prior knowledge as knowledge graph into fine-grained image classification. Although these methods achieve significant success, they still tend to consider the prior knowledge as the relations between subcategory labels and their attributes or introduce redundant text information as knowledge.

To take advantage of prior knowledge properly, this paper organizes prior knowledge about hierarchical relationships between categories-subcategories and subcategories-attributes relationships as knowledge graph, and designs a Knowledge Graph Representation Fusion (KGRF) framework for integrating knowledge representation and visual features for fine-grained image classification. The proposed model involves two key components: (1) the Graph Attention Network (GAT) [25] that aggregates information about nodes in the graph to learn the knowledge representation; (2) the Multimodal Compact Bilinear (MCB) [26] module that fuses knowledge representation with the captured visual features to learn the categories-subcategories and subcategories-attributes associations. Furthermore, the proposed method is validated on Caltech-UCSD Birds-200-2011 dataset [27] with 200 bird subcategories and 312 attributes. Compared with several baselines, the model shows superiority in fine-grained image classification task. In summary, the main contribution of this paper includes the following:(1)We propose a novel KGRF framework for introducing knowledge graph in fine-grained image classification.(2)Our model incorporates the MCB module for the first time to integrate the knowledge representation with visual features for fine-grained image classification.(3)Extensive evaluation shows that our model outperforms several strong baselines in fine-grained image classification.

This paper is organized as follows. The related works are introduced in Section 2. The proposed model is described in Section 3. The model is validated by several experiments and compared with other methods in Section 4. The conclusion is presented in Section 5.

During the past several years, there have been a number of researchers working on Convolutional Neural Networks (CNNs) [18, 28, 29] to capture the discriminative visual features for fine-grained image classification. Compared with the traditional handcrafted features [3032], the method based on CNN shows a significant improvement. Bilinear CNN [33] uses two independent CNNs to learn the high-order representation which can capture interactions between subtle visual features, but the learned bilinear feature dimension is extremely high. In order to reduce the bilinear feature dimension, Gao et al. [34] proposed a compact model for approximating the high-dimensional feature with polynomial kernels. Kong et al. [35] proposed codecomposition to compress the Bilinear CNN model.

However, it is difficult for these approaches to capture the subtle visual features. Therefore, a series of studies [18, 19, 36] attempt to learn the part-based representation, which locates discriminative regions and captures the visual features. However, these methods rely on heavy manual part and bounding box annotations, making it difficult to apply them to the real world.

Instead, visual attention networks [3741] are proposed to automatically locate the informative regions without part and bounding box annotations by the self-attention mechanism, and show superiority in the fine-grained image classification task [21, 4244]. Liu et al. [21] use a reinforcement learning framework to adaptively locate the discriminative regions. Zheng et al. [44] propose a multi-attention CNN to capture parts localization, and aggregate features from the located informative regions with the global image. Fu et al. [42] introduce a recurrent attention CNN that recursively locates the attentional regions at multiple scales and learns the representation of region-based feature.

Although these methods avoid the need for a lot of part and bounding box annotations, they can only capture the feature from regions roughly due to the lack of supervision. Therefore, some researchers try to introduce additional guidance to capture more semantic-related features to aid the fine-grained image classification task. For example, Liu et al. [45] incorporate part-level attribute with locating the discriminative regions. He and Peng [46] introduce detailed language descriptions to capture more discriminative parts and features. Chen et al. [23] utilize the knowledge graph to introduce subcategory-attribute relations for reasoning discriminative features. Xu et al. [24] use the visual-semantic embedding framework to introduce the text and knowledge base to learn the relations between subcategories and images.

Different from existing approaches, our method also introduces additional guidance for fine-grained image classification, but in the form of constructing knowledge graph involved in subcategories-attributes relationships and categories-subcategories hierarchical relationships. There have been several works to introduce prior knowledge into visual tasks. For example, Qi et al. [47] propose 3DGNN network for semantic segmentation. Marino et al. [48] use GSNN for multilabel image recognition.

3. Method

In this section, our constructed knowledge graph which contains associations between categories and part-attributes as well as hierarchy of categories has been first proposed. Then, we describe the KGRF framework which contains the knowledge representation module using GAT for modeling the constructed knowledge graph and the knowledge fusion module using MCB module for integrating knowledge representation into captured visual features. An overview of the framework is shown in Figure 2. The detailed descriptions of notations could be found in Table 1.

3.1. Knowledge Graph Construction

Essentially, a knowledge graph which consists of nodes and edges is a repository of semantic information about complex structures in our lives. In order to better utilize the knowledge graph in fine-grained image classification, we introduce a knowledge graph that includes categories-subcategories hierarchy relationships and subcategories-attributes relationships. The knowledge graph is constructed based on the subcategory labels, the part-attribute annotations of images, and the existing knowledge base DBpedia [49].

Nodes. A node in the constructed knowledge graph refers to a specific category, subcategory, or a part-attribute. The category is a coarse-grained type of an object in an image, such as “Albatross,” “Blackbird.” The subcategory refers to a specific type that needs to be identified in the fine-grained image classification task, such as “Sooty Albatross,” “Rusty Blackbird.” The part-attribute is a description of parts of an object such as color, shape, size. Suppose that there are object categories, object subcategories, and part-attributes; the knowledge graph has nodes.

Edges. There are two main types of edges in the constructed knowledge graph. The edge between a category node and a subcategory node means that there is some kind of hierarchical semantic relationships, such as “Sooty Albatross is a kind of Albatross.” The edge between a subcategory node and a part-attribute node indicates that the subcategory has the corresponding part-attribute, such as “Sooty Albatross has the corresponding part-attribute ‘has shape: swallow like.” It should be noted that there is no association between two category nodes, two subcategory nodes, or two part-attribute nodes in the constructed knowledge graph.

3.2. Knowledge Representation

After constructing the knowledge graph, since there are two types of edges, we extract two subgraphs from the constructed knowledge graph and use the GAT to learn the feature vector for nodes, respectively. Then, we concatenate the features of the same node learned from the two subgraphs to obtain the final representation of each node in the constructed knowledge graph.

According to the types of edges, the extracted two subgraphs are categories-subcategories subgraph and subcategories-attributes subgraph, respectively. The categories-subcategories subgraph only contains the category nodes, the subcategory nodes, and the edges between two kinds of nodes. The subcategories-attributes subgraph only contains the subcategory nodes, the part-attribute nodes, and the edges between two kinds of nodes. For the two subgraphs, we input them into two separate GATs to learn the representation of knowledge after node information propagations.

Because the representation learning methods of the two subgraphs are completely consistent, the categories-subcategories subgraph is taken as an example. We first initialize nodes in the subgraph with the corresponding Word2Vec [50] features, which can reflect the linguistic contexts of concepts. Accordingly, the input to GAT can be expressed as follows:where means the number of nodes in the categories-subcategories subgraph and means the initial feature dimension number in each node. After the propagation of node information, we can get a new set of node representations:

GAT is a convolution style neural network that uses masked self-attention for aggregating information about neighbor nodes. In order to transform the initial input features into high-order knowledge features, it first uses a weight matrix for each node for linear transformation. Then, it utilizes the self-attention to compute attention coefficients:which represents the importance of the features of node to . In order to facilitate the consideration of coefficients between different nodes, the softmax function is used to normalize all neighborhood nodes of :where is the first-order neighbors of node in the subgraph. Since the attention mechanism is a feedforward neural network in GAT, it can be represented by a weight vector . Thus, attention coefficients can be computed with the LeakyReLU activation function:where means transposition operation on and means the concatenation operation. On this basis, the obtained attention coefficients are used to compute the linear combination of their corresponding features:

In order to stabilize the learning process of self-attention and get the accurate features of each node as output, we use the multihead attention GAT:where means the normalized attention coefficient computed by the -th attention mechanism and means the -th input linear transformation’s weight matrix.

Finally, we get different node features from the two GATs. In order to combine the categories-subcategories relationships and subcategories-attributes relationships, we concatenate the features learned by the same subcategory node in different GAT to get the final knowledge representation for each subcategory node:where is the knowledge feature learned in the categories-subcategories subgraph and is the knowledge feature learned in the subcategories-attributes subgraph. Thus, the knowledge representation of the constructed knowledge graph can be expressed as follows:

3.3. Knowledge Fusion

After the knowledge representation is obtained, the MCB module is introduced to fuse it with the extracted visual features, so as to enhance the fine-grained image classification.

Visual Feature Extraction. Since there are several CNN models with good performance in fine-grained image classification, we directly choose the CB-CNN [34] to extract image visual features :

Firstly, the input image is processed through a convolutional network to obtain feature maps with the size of :where and mean the width and height of the filter in CNN model, is the padding, and is the stride.

Then, the compact bilinear operation is performed to capture the final feature maps . Since we will fuse the knowledge representation with visual features, the final feature maps are not sum-pooled. For comparison with the existing works, we adopt VGG16-Net as the CNN model.

Most models only consider visual features in images but ignore a lot of implicit semantic correlation information. In the fine-grained image classification task, visual features alone make it difficult to capture subtle differences between subcategories. However, the obtained knowledge representation contains the categories-subcategories relations and subcategories-attributes relations, which might help capture important subtle features. Thus, we fuse the knowledge representation with visual features to learn better unified feature representation. Since the traditional method only performs simple concatenation operation on two different features, without considering the interaction between them, we introduce the MCB module to efficiently and expressively integrate knowledge representation and visual features. Traditional bilinear models take the outer product of two vectors, , and learn a linear model to allow all elements of two vectors to interact with each other:where means the outer product and [] denotes linearizing the matrix in a vector. The direct calculation of the outer product leads to a large amount of memory consumption and high calculation time. Thus, MCB module uses the convolution of two count sketches to express the outer product of vectors:where means the convolution operator and is the Count Sketch projection function [51]. Additionally, according to the convolution theorem, convolution in the time domain is equivalent to element-wise product in the frequency domain. Therefore, the MCB module can be summarized in Figure 3 and described as Algorithm 1.

(1)input:
(2)output:
(3)procedure MCB
(4) for in {1, 2} do
(5)  if not initialized then
(6)   for in do
(7)    sample from
(8)    sample from
(9)   
(10)
(11)return
(12)procedure
(13)
(14) for in do
(15)  
(16)return

Suppose that is the knowledge representation of a subcategory node in the knowledge graph and is the visual feature of an image . The purpose of our MCB module is to get the comprehensive feature representation for image :

Firstly, the Tensor Sketching [52] is used to compress the dimensions of and , respectively.where is the Count Sketch projection function and and are the features after dimensionality reduction. Then, in the Fast Fourier Transform (FFT) space, the two features are fused by element-wise product.where means the Fast Fourier Transform, is the Inverse Fast Fourier Transform, and is the high-order feature that combines knowledge representation with visual features. Finally, we feed the feature into a full connection layer to classify the subcategory of the image .

4. Experiment

4.1. Experiment Settings

Datasets. We evaluate the proposed KGRF framework on the Caltech-UCSD Birds-200-2011 [27] dataset, which is a widely used benchmark in fine-grained image classification. There are 200 subcategories of birds, 5,994 training images, and 5,794 test images in the dataset. In addition to the basic subcategory annotations, each image is further labeled with 1 bounding box, 15 part key-points, and 312 part-attributes. In this work, we use accuracy to evaluate the effectiveness of models on fine-grained image classification.

Knowledge Graph Details. Since the constructed knowledge graph contains categories-subcategories hierarchy relations and subcategories-attributes relations, there are three types of nodes, including category node, subcategory node, and part-attribute node. Specifically, the category node obtained from DBpedia refers to a type of coarse-grained bird species. The subcategory node represents the output in the fine-grained image classification task, according to the image label in the Caltech-UCSD Birds-200-2011 dataset. The part-attribute node means a type of attribute in a particular part of birds, according to the attribute annotation. In a word, there are 56 category nodes, 200 subcategory nodes, and 312 part-attribute nodes in the knowledge graph.

4.2. Comparison with State-of-the-Art Methods

In this subsection, we compare the proposed model with several state-of-the-art methods, and the results are reported in Table 2. Some of the methods rely solely on the image labels, while others use the information of bounding box and part key-points. In the models that use bounding box annotations and part annotations, the part-based model PN-CNN [28] performs well with the accuracy of 85.4, but this type of approach relies heavily on the guidance of bounding box annotations and part annotations. On the contrary, more existing methods do not rely on the bounding box annotations and part annotations. Instead, they attempt to search the distinguishing regions and capture the high-level visual features of these regions for classification. For example, Bilinear CNN [33] uses two separated CNNs to locate the key parts and capture the visual features of parts, and the accuracy achieved is 84.1, but it relies on very-high-dimensional representations of visual features. A3M [53] utilizes attribute-guided attention module to consider attribute information to select key features for different regions.

In addition, we compare the proposed model with the methods which introduce external information. Specifically, CVL [46] introduces the prior text descriptions to help locate the discriminative regions, and it achieves 85.55 accuracy. HSE [57] introduces four levels of category hierarchical semantic information to improve the accuracy to 88.1. Further, there are also some methods trying to introduce the prior external knowledge, which are most related to our work. KERL [23] introduces a knowledge graph with a Gated Graph Neural Network (GGNN) [59], which models the correlations between categories and part-attributes. Ensemble T-CNN [24] introduces the text descriptions and knowledge base with visual-semantic embedding [60] at the same time. By contrast, our framework achieves very good results, especially with the two methods that also introduce the prior knowledge.

4.3. Contribution of Knowledge Representation

Since our KGRF framework is based on the CB-CNN [34] to extract visual features and fully integrates the obtained knowledge representation, we set up experiments to demonstrate the effectiveness of the knowledge representation. As shown in Table 3, CB-CNN relies heavily on visual features and only achieves an accuracy of 84.6. Using this model as a benchmark, our KGRF framework uses the MCB module for introducing the knowledge representation to improve the accuracy to 88.49, which is 3.89 higher than CB-CNN that only uses the visual features. This means that the introduction of prior knowledge into fine-grained image classification performs well. To further verify the contribution of our method of introducing prior knowledge, we use element-wise product and simple concatenation, respectively, to integrate knowledge representation with visual features for comparison with our framework. As shown in Table 3, the element-wise product and concatenation can achieve accuracies of 85.8 and 86.5, respectively, which is a little better than the CB-CNN but still a lot worse than ours. This shows that the knowledge fusion method we adopted can make better use of the prior knowledge to promote fine-grained image classification.

5. Conclusion

In this paper, we propose a novel Knowledge Graph Representation Fusion (KGRF) framework to integrate knowledge representations and visual features for fine-grained image classification. In particular, the proposed framework includes the GAT to learn knowledge representations, and the MCB module to fuse the representations with the captured visual features from CNN for modeling the categories-subcategories and subcategories-attributes associations. Furthermore, the proposed framework is validated on a widely used dataset, Caltech-UCSD Birds-200-2011, and the experiment results show superiority in fine-grained image classification task, compared with several state-of-the-art methods. In the future, we could consider more reasonable and interpretable methods to introduce prior knowledge into relevant computer vision tasks.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research work was partly supported by Sichuan Science and Technology Program (2019YFG0507 and 2020YFG0328), the National Natural Science Foundation of China (NSFC) (U19A2059), and Young Scientists Fund of the National Natural Science Foundation of China (61802050).