Discriminative semantic region selection for fine-grained recognition

https://doi.org/10.1016/j.jvcir.2021.103084Get rights and content

Highlights

  • We combine DCNN with image regions for semantical representations which bridges the semantic gap.

  • Image regions are combined with semantic distinctiveness and spatial-semantic correlations.

  • The proposed method can be combined with various pre-learned models to improve the recognition accuracy.

Abstract

Performances of fine-grained recognition have been greatly improved thanks to the fast developments of deep convolutional neural networks (DCNN). DCNN methods often treat each image region equally. Besides, researchers often rely on visual information for classification. To solve these problems, we propose a novel discriminative semantic region selection method for fine-grained recognition (DSRS). We first select a few image regions and then use the pre-trained DCNN models to predict their semantic correlations with corresponding classes. We use both visual and semantic representations to represent image regions. The visual and semantic representations are then linearly combined for joint representation. The combination parameters are determined by considering both semantic distinctiveness and spatial-semantic correlations. We use the joint representations for classifier training. A testing image can be classified by obtaining the visual and semantic representations and encoded for joint representation and classification. Experiments on several publicly available datasets demonstrate the proposed method's superiority.

Introduction

Accurately classifying fine-grained images is a challenging task which has been studied for many years. It aims to automatically predict the classes of similar objects by analyzing the correlations of fine-grained image pixels. This problem is more difficult than general image classification as fine-grained images are visually very similar. To accurately classify fine-grained images, one important problem is how to obtain discriminative image representations. To achieve this goal, researchers have proposed many learning-based methods using labeled images.

Well-designed features [18], [22], [26], [27], [35], [37], [40] have demonstrated their effectiveness for fine-grained recognition, especially for image classes which have distinctive characters that can be well separated. Both global [27] and local feature [18], [22], [26], [35], [37], [40] based methods are used. Global feature based methods often have limited discriminative power as the spatial context information is missing. Global feature can only handle a small number of classes with large variations. Local feature based methods go one step further by considering the distinctive and invariant properties of image regions. Although improved performances have been achieved over global features, local features are hard to design and cannot fully cope with image variations.

With fast developments of deep convolutional neural networks [1], [9], [12], [15], [19], [29], [30], visual representations can be automatically learned from large quantity of labeled images in an end-to-end way. This strategy greatly improves the performances over well-designed features as it can learn the intrinsic relationships of images automatically. However, it also requires large amounts of labeled images for supervision. It is time consuming and labor expensive to obtain enough labeled images. Transferring information from other sources [42] is necessary when only few labeled images are available. However, how to transfer useful information instead of noisy correlation is vital for efficient classification. Researchers have proposed many efficient models to solve this problem. Although great progress has been made, these methods still suffer from one problem. For each image, different image parts are often treated equally which mixes the influences of discriminative parts and noisy background. This may degenerate the classification performances as we often have irrelevant background. To alleviate this problem, detection technique [43] is used to locate the objects to be classified. Detection models are often trained with bounding box annotations. This strategy helps to concentrate on the objects to be classified for training efficient models. However, bounding boxes are harder to annotate than image labels. Researchers have also tried to combine local feature based methods using deep convolutional neural networks [5], [23] which greatly improves performances. However, each image region is treated individually and independently. Besides, only visual information is considered. Visual similarity cannot always help to ensure semantic consistency, especially when we only have few labeled images.

To bridge the gap, semantic representations are used as an alternative way to obtain consistent representations with human perception [33], [38], [42]. Semantic based methods can be roughly divided into two schemes. The first scheme mines the intrinsic correlations of labeled images [38] while the second scheme leverages information from other datasets or the Internet [33], [42]. The semantic representations can be automatically generated using labeled images by making use of pre-learned models or manually labeled by humans. However, since different datasets are collected by different people for varied applications, the underlying distributions of different datasets are not consistent with each other, noisy information may also be introduced when supervision information is transferred. Besides, many semantic based methods still treat different image parts equally without considering their distinctiveness.

To solve the problems mentioned above, in this paper, we propose a novel discriminative semantic region selection method for fine-grained recognition (DSRS). Instead of treating each image region equally, we select visually distinctive regions by making use of pre-learned DCNN models. This is achieved by first using the pre-trained DCNN models to predict each image region’s semantic similarities with corresponding classes. We then calculate their semantic representations by predicting the class distributions of these selected regions. To make use of both visual and semantic representations, we jointly use them for recognition by encoding these region representations and linearly combine them by considering both the semantic distinctiveness and the spatial-semantic correlations. A testing image can then be represented and classified accordingly. Image classification experiments on several publicly available datasets are conducted to evaluate the effectiveness of the proposed method. Performance comparisons with other baseline methods well demonstrate the proposed method's effectiveness. Fig. 1 shows the flowchart of the proposed discriminative semantic region selection for fine-grained recognition method.

The proposed method has two advantages over other fine-grained classification methods. First, the proposed method has good generalization ability as it can make use of various efficient pre-learned DCNN architectures to semantically represent image regions. In this way, we can consistently improve the classification accuracies. Second, instead of only using visual information, we jointly consider the semantic distinctiveness and the spatial-semantic correlations of images regions. In this way, we can get more discriminative representations for classification.

Section snippets

Related work

Fine-grained recognition had been widely explored in the last few decades. Many methods had been proposed. Both global feature and local feature based methods [18], [22], [26], [27], [35], [37], [40] were used. Global features were often not effective than local features. Local features were often used in a bag of words way. This was achieved by first quantizating each local feature and use the quantization distributions for classifier training and image class prediction. Although effective,

Fine-grained recognition by discriminative semantic region selection

In this section, we give the details of the proposed discriminative semantic region selection method for fine-grained image classification. DSRS first selects a number of image regions which have large class responses. These regions pose high probability of object parts. We extract the visual representations of these regions using pre-learned DCNN model. We then calculate the semantic representations of these regions and encode both visual and semantic information for region representations.

Experiments

To evaluate the proposed method's effectiveness, we conduct fine-grained image classification experiments on the Flower-102 dataset [22], the CUB-200-2011 dataset [31], and the Standford Cars dataset [45]. The Flower-102 dataset has 102 classes of 8,189 flower images with each class having 40 to 258 images. As to the CUB-200-2011 dataset, it has 200 different birds of 11,788 images. The Standford Cars datasets has 196 classes of different cars. 8,144 images are pre-chosen for training, as [45].

Conclusion

In this paper, we proposed a novel visual recognition method by discriminative semantic region selection (DSRS). Image regions were first selected by choosing local maximums of response maps. Both visual and semantic representations of the selected region were extracted. We encoded these regions and linearly combined the encoding parameters by considering both semantic correlations and spatial-semantic relationships for recognition. Image classification experiments and detailed analysis on

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by National Natural Science Foundation of China: 62072026; Beijing Natural Science Foundation: JQ20022; the Open Research Fund of Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology and Business University. This work is also in part supported by Natural Science Foundation of China under Grants 61773325, and 61806173, and Natural Science Foundation of Fujian Province under Grants 2019J05123.

References (45)

  • C. Zhang et al.

    Object categorization in sub-semantic space

    Neurocomputing

    (2014)
  • C. Zhang et al.

    Boosted random contextual semantic space based representation for visual recognition

    Inf. Sci.

    (2016)
  • C. Zhang et al.

    Image classification by search with explicitly and implicitly semantic representations

    Inf. Sci.

    (2017)
  • S. Branson, G. Horn, S. Belongie, P. Perona, Bird species categorization using pose normalized deep convolutional nets,...
  • S. Cai et al.

    Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization

    Proc. IEEE Int. Conf. Comput. Vision

    (2017)
  • Y. Chai et al.

    Tricos: A tri-level class-discriminative cosegmentation method for image classification

    Eur. Conf. Computer Vis.

    (2012)
  • Y. Chen et al.

    Destruction and construction learning for fine-grained image recognition

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2019)
  • M. Cimpoi et al.

    Deep filter banks for texture recognition, description, and segmentation

    Int. J. Comput. Vision

    (2016)
  • Y. Cui, F. Zhou, Y. Lin, S. Belongie, Fine-grained categorization and dataset bootstrapping using deep metric learning...
  • J. Fu et al.

    Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2017)
  • W. Han et al.

    Sample generation based on a supervised Wasserstein generative adversarial network for high-resolution remote-sensing scene classification

    Inform. Sci.

    (2020)
  • K. He et al.

    Deep residual learning for image recognition

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2016)
  • X. He et al.

    Fine-grained image classification via combining vision and language

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2017)
  • X. He et al.

    Fast fine-grained image classification via weekly supervised discriminative localization

    IEEE Trans. Circ. Syst. Video Technol.

    (2019)
  • S. Huang et al.

    Part-stacked cnn for fine-grained visual categorization

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2016)
  • M. Jaderberg et al.

    Spatial transformer networks

    Proc. Neural Inform. Process. Syst.

    (2015)
  • J. Krause et al.

    Fine-grained recognition without part annotations

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2015)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Proc. Neural Inform. Process. Syst.

    (2012)
  • S. Kong et al.

    Low-rank bilinear pooling for fine-grained classification

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2017)
  • M. Lam et al.

    Fine-grained recognition as hsnet search for informative image parts

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2017)
  • S. Lazebnik et al.

    Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories

    Proc. IEEE Computer Vis. Pattern Recogn.

    (2006)
  • T. Lin et al.

    Bilinear convolutional neural networks for fine-grained visual recognition

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • Cited by (1)

    View full text