Abstract

Visual relationship can capture essential information for images, like the interactions between pairs of objects. Such relationships have become one prominent component of knowledge within sparse image data collected by multimedia sensing devices. Both the latent information and potential privacy can be included in the relationships. However, due to the high combinatorial complexity in modeling all potential relation triplets, previous studies on visual relationship detection have used the mixed visual and semantic features separately for each object, which is incapable for sparse data in IoT systems. Therefore, this paper proposes a new deep learning model for visual relationship detection, which is a novel attempt for cooperating computational intelligence (CI) methods with IoTs. The model imports the knowledge graph and adopts features for both entities and connections among them as extra information. It maps the visual features extracted from images into the knowledge-based embedding vector space, so as to benefit from information in the background knowledge domain and alleviate the impacts of data sparsity. This is the first time that visual features are projected and combined with prior knowledge for visual relationship detection. Moreover, the complexity of the network is reduced by avoiding the learning of redundant features from images. Finally, we show the superiority of our model by evaluating on two datasets.

1. Introduction

Visual relationship detection tries to simultaneously detect objects for an image and classify the predicate between each pair of these objects [1]. It has been considered as a bridge to semantically connect the low-level visual information [27] and high-level semantic information [811]. Generally, visual relationships indicate types of relations between objects in images and are usually represented by triplets (subject, predicate, and object), where the predicate can be a verb (person, ride, bicycle), spatial (cat, on, table), preposition (person, with, shirt), and comparative (elephant, taller, person) [1, 12]. The detection of these interactions can uncover diverse knowledge from images and significantly benefits the functionalities of IoT systems. Moreover, potential disclosure of sensitive information [13] can also be inferred with the autonomous relationship detection and provides guidelines for secure multimedia IoT data processing [1416].

The early studies of visual relationship detection mainly rely on pure visual features capturing the complex visual variance of images [17, 18], suffering from the lack of diverse information for predicate classification. Considering the sparsity of IoT data, both the scale of the image dataset and the details within these images will be constrained. Sensing devices will be conservative on data publication [19, 20], especially when the image data contain abundant semantic information. Meanwhile, images maybe masked or obfuscated before publication due to privacy concerns [21]. Both constraints caused by sparsity of images have aggravated the difficulties for visual relationship detection, and purely visual-feature-based methods are not qualified.

Recently, additional sources of information, such as prior knowledge and semantic information, are incorporated into visual relationship detection [1, 2224], as extra background information can be utilized to supply and refine the detection. Generally, two essential tasks are considered during the incorporation of additional source of information: (1) How to apply the semantic associations among relationships [12, 25, 26] to refine the detection. For example, the relationship (person, ride, horse) is semantically similar to (person, ride, elephant) as the horse and elephant both belong to animals, even though horse and elephant are quite different in images. In this case, visual relationship detection models should be able to infer (person, ride, elephant) base on examples of (person, ride, horse). (2) How to alleviate the huge semantic space of possible relationships. Assume the category of objects to be and the predicates to be . Then, the number of possible relationships is as a relationship is composed of two objects [27]. Therefore, the size of semantic space in relationship detection increases by orders of magnitude, while many of relationships appear rarely in images. Visual relationship detection models should learn all relationship classes sufficiently.

Towards these tasks, extensive studies have been conducted. They mainly consider how to incorporate the additional source of information into the relationship detection. Initially, Lu et al. [1] introduced the additional language priors from semantic word embeddings to fine-tune the likelihood of a predicted relationship. Subsequently, Zhuang et al. [28] integrated the language representations of the subjects and objects as “context” to derive a better classification result for visual relationship detection. Then, Plummer et al. [8] use a large collection of linguistic and visual cues for the relationship detection in images, which contain attribute information and spatial relations between pairs of entities connected by verbs or prepositions. Furthermore, instead of using the pretrained and fixed language representations directly, Zhang et al. [29] tried to fine-tune the subject and object representations jointly and employ the interaction between visual branches to predict the relationship.

Although these methods achieve significant success, they still tend to focus on the word-level semantics [30] as the additional sources of information and lack in adopting the sophisticated knowledge and deep relations among objects. As for such kind of external knowledge, the knowledge graph is treated as a typical category of structural information providing abundant clues on relations between entities. It has been recently applied for many areas including computer vision and achieves dramatical improvements. Generally, a knowledge graph is a multirelational graph composed of entities (nodes) and relations (different types of edges). Each edge is a kind of relation in the form of triplets (head entity, relation, tail entity), indicating that two entities are connected by a specific relation. This type of additional information can provide more semantic association between objects and relations in an image and could be used for more rational reasoning to improve visual relationship detection. However, its application for visual relationship detection has not yet been properly considered, and neither of the above-mentioned tasks is investigated.

To take advantage of this type of information, this paper designs a deep neural network for visual relationship detection by considering the knowledge graph as an additional source of information. The input of the model includes the images and an external knowledge graph, and the outputs are the relationships in images. The proposed model includes a visual module extracting the visual features of images, a knowledge module introducing the additional prior knowledge via the knowledge graph embedding [31], and a mapping module combining the visual features with prior knowledge. Finally, a new loss function based on the triplet loss [32] is designed in the mapping module to tune the projection of visual features into the knowledge space.

The proposed model uses the vector translation of the knowledge space for the first time, to capture the valuable structured information between objects and relations. By this mean, the structured semantic association in a knowledge graph can help improve the relationship detection. The proposed model also learns the objects and predicates and fuses them together to predict the relationship triplets [1]. This method can alleviate the impact of a huge semantic space of possible relationships, by reducing the space from to . Furthermore, the model achieves a reduced scale of parameters compared with state-of-the-art works [31], as it does not request the learning of visual features of predicates. The performance of the model is validated on two relation datasets: visual relationship detection (VRD) [1] with 5,000 images and 6,672 unique relations and visual genome (VG) [12] with 99,658 images and 19,237 unique relations. According to the comparison with several baselines, our model shows the superiority in visual relationship detection. In summary, the main contribution of this paper includes(1)We propose a novel framework for introducing the prior knowledge in visual relationship detection(2)Our model incorporates the priors in knowledge graph embedding for the first time to capture the valuable structured information between objects and relations(3)Our model reduces the parameters for extracting the visual features of predicates and designs a loss for combining the visual feature with the prior knowledge(4)Extensive evaluation shows that our model outperforms several strong baselines in visual relationship detection

This paper is organized as follows. The related works are introduced in Section 2. The proposed model is described in Section 3. The model is validated in VRD and VG datasets and compared with other methods in Section 4. The conclusion is described in Section 5.

During the past years, there have been a number of studies in visual relationship detection. The earlier works regard visual relationships as an adminicle to improve the performance for other tasks, such as object detection [33, 34], image retrieval [12, 35, 36], and action recognition [37]. They focus on the specific types of relationships, such as spatial relationships [2, 38], positional relationships [2, 35, 39], and actions (e.g., the interaction between objects) [4042].

Lu et al. [1] first formalized the visual relationship as the (subject, predicate, object) triplet, defined the visual relationship detection task, and proposed a method by leveraging the language prior to model the more general correlation between objects. Afterwards, more studies on visual relationship detection have been developed, which can be divided into two categories: joint model and separate model.

For the joint model, it detects (subject, predicate, object) simultaneously by considering the relationship triplets as an integrated body [17, 22, 4244], e.g., (person, ride, horse) and (person, ride, elephant) are of different classes. Vip-CNN [18] considers each visual relationship as a phrase with three components and formulates the visual relationship detection as three interconnected recognition problems. Plummer et al. [8] learned a Canonical Correlation Analysis (CCA) model on top of different combinations of the subject, object, and union regions and train a RankSVM to learn the visual relationship. However, it requires extremely large training data, because all possible combinations of predicates and entities (subject, object) are treated as independent classes. As a result, the general approaches usually pose the problem as a classification task in limited classes.

For a separate model, it first detects subjects and objects and then recognizes the possible interactions among them [1, 39, 4547]. VtransE [48] uses the object detection output of a Faster R-CNN network and extracts features from every pair of objects to learn the visual translation embedding for relationship detection. Zhang et al. [29] embed the objects and relations of relationship triplets separately to the independent semantic spaces and then implicitly learn the connections between them via visual feature fusion and semantic meaning preservation in the embedding space.

The method proposed recently by Zhang et al. [29] is the most related one to ours. Compared with this work, instead of the word-level semantic embeddings, our work incorporates the knowledge graph and embeds it in a knowledge space as the additional sources of information. Due to the use of TransE [31] as the knowledge graph embedding, our work barely needs to model the large visual variance of relations in images.

Finally, our method adopts the additional semantic information to guide the visual recognition. This is consistent with the trend of using language information for visual recognition. For example, the language information has also been incorporated into visual question answering [4952], few-shot learning [5356], and image-sentence similarity task [5760].

3. Method

3.1. Overview

The goal of the proposed model is to detect visual relationships from images which requires having discriminative power among a set of relationship categories. However, since object categories are often semantically associated, it is critical for a model to preserve semantic similarities, so as to benefit both frequent and rarely seen relationship categories.

The overview of the proposed model is shown in Figure 1. It consists of three modules, namely, visual module, knowledge module, and mapping module. The visual module detects a set of objects in images and extracts the visual features of the objects. The knowledge module consists of a knowledge graph, which is embedded in a low dimension vector space, so it can be used as the additional source of information. The mapping module considers the image and additional source of information comprehensively, which maps the visual features to the knowledge space for relationship detection. For any valid relationships, they are represented by the triplets (subject, predicate, object) in low dimension vectors , , and , respectively.

Note: in this paper, we use “relation” to refer to “predicate” in previous works and “relationship” to refer to the (subject, predicate, object) triplet. The detailed descriptions of notations can be found in Table 1.

3.2. Visual Module

The design of the visual module is based on the intuition that a relationship exists when its objects exist, but not vice versa. Therefore, to detect the visual relationships from images, the first step is to detect the objects and corresponding visual features in images.

In the visual module, the object detection is based on a Faster-RCNN [61] network with the VGG-16 [62] architecture, composed of a Region Proposal Network (RPN) and a classification layer. In the Faster-RCNN network, convolution does not change the size of the input image.

After that, the Feature Extraction Layer is proposed to extract and , when suppose are the -dimensional visual features of the subject and object, respectively. The visual features and are obtained by concating the vector from the last convolution feature map in the Faster-RCNN network and the bounding box parameterization in [63].

3.3. Knowledge Module

A knowledge graph is represented by , while is the set of nodes, which represents the entities (subjects, objects), and is the set of edges, which represents the connections between entities. Hence, the relations between the subject and object can be represented by the connections between the entities in the knowledge graph, mainly describing real world entities and their interrelations organized in a graph. Compared with the word-level external information, this type of additional information can capture a more semantic association between objects and relations and be used for rational reasoning to improve the results of visual relationship detection.

The knowledge module introduces jointly a knowledge graph and projects it into an embedding space, to activate the rich prior knowledge in tuning the relationship detection. Translation embedding (TransE) [31] is a remarkable model that represents a valid relationship (subject, predicate, object) in the knowledge graph in low dimension vectors , , and , and , respectively. The relation is represented as a translation in the vector space:when the relationship triplet holds, and otherwise.

Since TransE offers a simple but effective method for representing the complex relationships in large knowledge graphs, it is adopted into the knowledge module for representing prior knowledge in the knowledge space. To learn such embeddings for the knowledge graph, we suppose a training set of triplets composed of two entities (the set of entities) and a relation (the set of relations). Since the relation is represented as a translation in the vector space, the energy of a triplet is defined by , which regard the squared Euclidean distance as a dissimilarity function:

To project the knowledge graph to knowledge space, we minimize a margin-based ranking criterion over the training set:

where denotes the positive part of , is a margin hyperparameter, and

In the knowledge graph embedding, the loss function, constructed according to Equation (4), has lower values of the energy for training triplets than for wrong triplets, so the embeddings for the knowledge graph have the ability to distinguish wrong triplets. As for the wrong triplets, it is constructed according to Equation (5), which is composed of training triplets with either the subject or object replaced by a random entity (but not both at the same time).

3.4. Mapping Module

To consider the image visual feature and extra knowledge feature comprehensively, the mapping module is adopted to learn the joint visual and knowledge embedding. In the mapping module, there is a projection matrix from the feature space to the knowledge embedding space:

where and are the vector representations after the projection of and . To guarantee that the corresponding entities are close to each other during the projection process, a modified triplet loss is employed, where the triplet loss [32] can encourage matched entities from the two modalities to be closer than the mismatched ones by a fixed margin. To this end, two sets of entity triplets for each positive visual-knowledge pair are denoted by :where and the set and correspond to triplets with negatives from the visual mapping and knowledge space, respectively. If the superscripts are omitted for clarity, the triplet loss is the summation of two losses and :

where guarantees that entities in knowledge space can be close to the corresponding entities in the visual mapping space, guarantees that the entities in visual mapping space can be close to the corresponding entities in knowledge space, is the number of entities, is the margin between the distances of positive and negative pairs, and is a similarity function, which is the cosine similarity function:

4. Experiments

Datasets: the visual relationship detection (VRD) [1] dataset contains 5,000 images with 100 object categories and 70 relations. In total, VRD contains 37,993 relationship annotations with 6,672 unique relationships and 24.25 relationships per object category. We follow the same train/test split as in previous works [1] to get 4,000 training images and 1,000 test images. To demonstrate that the proposed method can work reasonably well on a dataset with small relationship space, experiments in terms of visual relationship detection task are performed in the VRD dataset.

The visual genome (VG) [12] dataset is the latest release version (VG v1.4) that contains 108,077 images with 21 relationships on average per image. Each relationship is of the form (subject, relation, object) with annotated subject and object bounding boxes. Since the VG dataset is annotated by crowd workers, the objects and relations are noisy. Therefore, we clean it by removing nonalphabet characters and stop words and use the autocorrect library to correct spelling. Finally, the data is split into 86,462 training images and 21,615 testing images. The statistics for datasets can be found in Table 2.

Knowledge graph: in order to take advantage of the effective background knowledge, the knowledge graph for visual relationship detection is constructed according to the processed image label information and the public knowledge graph, WordNet [64]. To build the accurate knowledge graph, the annotation noise in the dataset should be removed. Firstly, duplicate words are deleted, such as “apple apple” and “dog dog.” Secondly, phrases with the same meaning are merged, such as “surfboard” and “surf board.” Specifically, the one with more occurrences in the dataset is selected and replaces other phrases with identical meaning. Then, we build the knowledge graph by using the object-object relationship annotations in the dataset.

However, this kind of knowledge graph lacks some common sense information. For instance, it can be helpful to know that a horse is a kind of animal. But if images of horse labels miss the “animal” label, our constructed knowledge graph will also lack in this common sense. Thus, it is necessary to combine our constructed knowledge graph with the semantic knowledge graph, WordNet. First, we collect the new nodes in WordNet which directly connect to the nodes in the constructed knowledge graph. Then, we add these new nodes to our knowledge graph. Finally, we take all of the WordNet edges between these nodes and add them to the knowledge graph.

Detecting a visual relationship involves classifying both the objects, predicting the predicate, and localizing both the objects. To study the model’s performance for visual relationship detection, the visual relationship detection is measured in three tasks: (1) predicate detection: predict a set of possible predicates between pairs of objects, under the given ground truth object boxes and labels; (2) phrase detection: output a label (subject, predicate, object) and localize the entire relationship as one bounding box; and (3) relationship detection: output a set of (subject, predicate, object) and localize both subject and object in the image simultaneously.

Metrics: Recall@50 (R@50) and Recall@100 (R@100) are adopted as evaluation metrics for detection. R@K computes the fraction of times a correct relationship is predicted in the top confident relationship predictions in an image. Note that precision and average precision (AP) are also widely used metrics, but they are not proper as visual relationships are labeled incompletely and they will penalize the detection if we do not have that particular ground truth.

Compared methods: we compare our model with three representative models. The three visual relationship detection models are as follows: (1) Lu’s-V (V-only in [1]): it is a two-stage separate model that first uses R-CNN [63] for object detection and then adopts a large-margin JointBox model for predicate classification; (2) Lu’s-VLK (V+L+K in [1]): a two-stage separate model that combines Lu’s-V and word2vec language priors [65]; (3) VtransE [48]: a fully convolutional visual architecture that draws upon the idea of knowledge embedding for predicate classification.

4.1. Comparison on VRD

The proposed model is first validated on the small VRD dataset with comparison to the similar methods using the metrics proposed above in Table 3. From the quantitative results, it can be found that the proposed model outperforms other methods in all tasks. Specifically, our proposed model improves performance by 4.25% in the phrase detection task and improves performance by 2.93% in the relationship detection task. These improvements validate the assumption that visual relationships might be helpful for object detection, which can be owed to the incorporation of the knowledge graph as an additional source of information. In addition, the improvement in predicate detection shows that the incorporation of the knowledge graph can provide more meaningful information than the word-level text.

4.2. Comparison on VG

The results of the proposed model on the VG dataset are presented in Table 4. Since VG is a relatively large and newer dataset, some representative models have not been validated on it. In addition, some methods have no public codes, and we can only mark the performance of these methods as a blank in Table 4. Even though the variety of possible relationships becomes more diverse, our proposed model still outperforms other methods in all tasks. Specifically, our proposed model improves performance by 1.07% in the predication detection task. Since the predicate detection isolates the factor of subject/object localization accuracy by using ground truth subject/object boxes and labels, it focuses more on the relationship recognition ability of a model. Therefore, the improvement of our model in this task shows that the incorporation of the knowledge graph is essentially effective for visual relationship detection. Besides, the performance of our proposed model has been improved to some extent, but it is not obvious in phrase detection task and relationship detection task. It may be due to the noise annotations in the large-scale VG dataset and the limited quality of the constructed knowledge graph.

4.3. Case Study

The VRD and VG datasets have densely annotated relationships for images with a wide range of types. From the qualitative results in Figure 2, it shows that our model can clearly detect a wide variety of visual relationship categories. Specifically, in Figures 2(a)2(c) are the same interactive relationships (person, wear, skis). Figures 2(d)2(f) are the same positional relationships (person, ride, skateboard). It shows that our model can detect different types of identical relationship, even though their visual representations are quite divergent. Moreover, there are more categories of relationships, such as Figure 2(g) (wheel, on, motorcycle), Figure 2(h) (umbrella, cover, person), and Figure 2(i) (person, ride, horse). It shows that the proposed model can be able to cover all kinds of relationships in (subject, predicate, object), where the predicate can be a verb, spatial, and preposition.

5. Conclusion

The visual relationship detection has been treated as a critical task in enhancing the functionalities of IoTs with CI tools. Considering the sparsity of multimedia IoT data, this work investigates the improvement of visual relationship detection with the knowledge graph as the additional structural semantic information. We proposed a new model for visual relationship detection incorporating the knowledge graph. In the proposed model, the Faster-RCNN and TransE models are used for feature learning from the image and knowledge graph, respectively. A third module is proposed to combine the two parts at the level of low dimensional vectors. Furthermore, a corresponding loss function is designed for the whole network. We validate the effectiveness of the proposed model on several datasets, both on the classification and detection task, and demonstrate the superiority of our approach over other similar methods. The proposed model can be applied for both the knowledge discovery and security analysis for sparse multimedia IoT data. Our future work includes the combination of other techniques like graph neural networks for visual relationship detection, as well as the privacy preservation towards these multimedia IoT data.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research work was partly supported by the Sichuan Science and Technology Program (2019YFG0507 and 2020YFG0328) and the National Natural Science Foundation of China (NSFC) (U19A2059). The work was also supported in part by the Young Scientists Fund of the National Natural Science Foundation of China under Grant No. 61802050.