Multi-label image recognition by using semantics consistency, object correlation, and multiple samples

https://doi.org/10.1016/j.jvcir.2021.103067Get rights and content

Abstract

An image can be annotated from the local perspective, based on objects visually present. An image can also be annotated from the global perspective, based on implicit emotion or meanings derived from it. We propose three points relatively little studied before. First, semantics remain the same even if the image is manipulated by some geometric processes. Second, object correlation is important in image labelling. We propose to use a standard recurrent neural network to take object sequences in random orders. Third, we observe that some entity can be represented by multiple image samples, and multiple samples can be jointly considered to improve recognition performance. These three points are implemented in a network that jointly considers global and local information. With comprehensive evaluation studies, we verify that a simple network with these points is effective and is able to achieve competitive performance compared to the state of the arts.

Introduction

An image normally presents several objects, scenes, or attributes, and is usually annotated with multiple labels. Recognizing multiple labels from an image facilitates deeper image understanding and image retrieval, and thus multi-label image classification/recognition keeps receiving attention in recent years [30], [26], [35].

According to the scale of labeling, multi-label image recognition can be roughly grouped into two categories: collection of local labels, or global labels. In the local case, the basic way is detecting and recognizing objects. The semantic relationship or spatial relationship between objects can then be investigated to improve performance [30], [35]. Fig. 1(a) shows a sample image from the PASCAL VOC2007 dataset. The associated labels are chair, diningtable, and bottle, which are obviously the objects shown in the image. In the global case, the whole image should be viewed in a holistic way to infer an image’s global meanings. The labels may not visually appear as objects in images. The most typical example of this case may be movie posters [6]. Fig. 1(b) is a poster image of the movie “88 Minutes”, which shows a violent scene with a crashed car in fire and the actor with a gun. According to [6], this image is labeled with movie genres crime, drama, and mystery, rather than a collection of objects.

In this paper, we work on both cases in a unified framework. We propose to extract local visual information and model the relationship between local parts by a recurrent neural network (RNN). The processed local information is then integrated with the global information extracted by a convolutional neural network (CNN), and then multiple labels are recognized by a dense network. This framework is somehow standard, but we propose three points that were relatively unexplored before.

  • Semantics consistency: Even if an image is processed by some geometric processes, e.g., rotation, flip, and adding noise, semantics of the image remain unchanged. This motivates us to consider semantics consistency in training our framework. A similar idea was proposed in [11], but they focused on visual attention consistency.

  • Object correlation: Recent works commonly model label correlation to improve multi-label recognition results. Recently, graph convolutional networks (GCN) [5] were proposed to model label correlation. In contrast to GCNs where the graph structure and data matrix should be processed and maintained, in our work we propose to adopt a standard long short-term memory (LSTM) network taking a series of objects in random order to model object correlation.

  • Multiple samples: We found that the same entity is usually represented by multiple images. For example, multiple posters are usually designed to promote a movie in different countries or at different places. Different elements may be presented in different posters, but they all correspond to the same movie. Fig. 2 shows three poster images corresponding to the movie “Caption America: The First Avenger”. If we jointly consider multiple samples that correspond to the same entity, we may achieve better multi-label recognition results.

Section snippets

Multi-label image classification

An image usually presents rich information, such as objects, scenes, actions, and various attributes, and can be labeled or annotated from various perspectives. Originating from auto-annotation, multi-label classification keeps attracting much attention in the past decade. Based on a labeled dataset, a weighted nearest neighbor model associated with a metric learning scheme was proposed to predict labels of a test image [10]. With the development of deep neural networks, Gong et al. [9]

Overview

Fig. 3 shows the proposed framework. Given an image, the top branch extracts global visual information by the ResNet-101 model [12], and represents an image as a 2,048-dimensional feature vector g. This vector is then fed to a fully-connected layer with the sigmoid activation function, which outputs preliminary recognition results in the representation of L-dimensional confidence vector c=(c1,c2,,cL), where L is the number of labels.

To gain information from local parts, we adopt the Faster

Performance metric and datasets

We follow previous works [3], [33] that use the mean average precision (mAP) over all label categories for evaluation. We also calculate overall precision (OP), recall (OR), and F1-measure (OF), as well as per-class precision (CP), recall (CR), and F1-measure (CF), for further comparison [17], [36]. Notice that all experiments in the following are based on the labels with the top-three scores, consistent with previous works.

We evaluate the proposed framework based on two datasets: PASCAL

Conclusion

We propose three simple but yet unexplored points to improve performance of multi-label image recognition. They are semantics consistency, object correlation, and multiple samples. By integrating these three points into a framework that takes global and local visual information into account, we verify the effectiveness of these points based on the PASCAL VOC2007 dataset and the movie poster dataset. In contrast to complex design of many recent works, these ideas can be easily integrated into

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

This work was funded in part by Qualcomm through a Taiwan University Research Collaboration Project and in part by the Ministry of Science and Technology, Taiwan, under grant 108–2221-E-006–227-MY3, 107–2923-E- 194–003-MY3, and 109–2218-E-002–015.

References (36)

  • Rajiv Ratn Shah et al.

    Leveraging multimodal information for event summarization and concept-level sentiment analysis

    Knowl.-Based Syst.

    (2016)
  • Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh, VQA:...
  • Tianshui Chen et al.

    Recurrent attentional reinforcement learning for multi-label image recognition

  • Tianshui Chen et al.

    Learning semantic-specific graph representation for multi-label image recognition

  • Tianshui Chen et al.

    Learning semantic-specific graph representation for multi-label image recognition

  • Zhao-Min Chen et al.

    Multi-label image recognition with graph convolutional networks

  • Wei-Ta Chu et al.

    Movie genre classification based on poster images with deep neural networks

  • Xinmiao Ding, Bing Li, Weihua Xiong, Wen Guo, Weiming Hu, Bo Wang, Multi-instance multi-label learning combining...
  • Mark Everingham et al.

    The PASCAL Visual Object Classes (VOC) challenge

    Int. J. Comput. Vision

    (2010)
  • Yunchao Gong et al.

    Deep convolutional ranking for multilabel image annotation

  • Matthieu Guillaumin et al.

    TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation

  • Hao Guo et al.

    Visual attention consistency under image transforms for multi-label image classification

  • Kaiming He et al.

    Deep residual learning for image recognition

  • Shiyi He et al.

    Reinfoced multi-label image classification by exploring curriculum

  • Yan Huang, Wei Wang, Liang Wang, Unconstrained Multimodal Multi-Label Learning. IEEE Trans. Multimedia 17 (11) (2015)...
  • Marina Ivasic-Kos et al.

    Movie posters classification into genres based on low-level features

  • Alex Krizhevsky et al.

    ImageNet classification with deep convolutional neural networks

  • Yuncheng Li et al.

    Improving pairwise ranking for multi-label image classification

  • This paper has been recommended for acceptance by Zicheng Liu.

    View full text