Discriminative semantic region selection for fine-grained recognition
Introduction
Accurately classifying fine-grained images is a challenging task which has been studied for many years. It aims to automatically predict the classes of similar objects by analyzing the correlations of fine-grained image pixels. This problem is more difficult than general image classification as fine-grained images are visually very similar. To accurately classify fine-grained images, one important problem is how to obtain discriminative image representations. To achieve this goal, researchers have proposed many learning-based methods using labeled images.
Well-designed features [18], [22], [26], [27], [35], [37], [40] have demonstrated their effectiveness for fine-grained recognition, especially for image classes which have distinctive characters that can be well separated. Both global [27] and local feature [18], [22], [26], [35], [37], [40] based methods are used. Global feature based methods often have limited discriminative power as the spatial context information is missing. Global feature can only handle a small number of classes with large variations. Local feature based methods go one step further by considering the distinctive and invariant properties of image regions. Although improved performances have been achieved over global features, local features are hard to design and cannot fully cope with image variations.
With fast developments of deep convolutional neural networks [1], [9], [12], [15], [19], [29], [30], visual representations can be automatically learned from large quantity of labeled images in an end-to-end way. This strategy greatly improves the performances over well-designed features as it can learn the intrinsic relationships of images automatically. However, it also requires large amounts of labeled images for supervision. It is time consuming and labor expensive to obtain enough labeled images. Transferring information from other sources [42] is necessary when only few labeled images are available. However, how to transfer useful information instead of noisy correlation is vital for efficient classification. Researchers have proposed many efficient models to solve this problem. Although great progress has been made, these methods still suffer from one problem. For each image, different image parts are often treated equally which mixes the influences of discriminative parts and noisy background. This may degenerate the classification performances as we often have irrelevant background. To alleviate this problem, detection technique [43] is used to locate the objects to be classified. Detection models are often trained with bounding box annotations. This strategy helps to concentrate on the objects to be classified for training efficient models. However, bounding boxes are harder to annotate than image labels. Researchers have also tried to combine local feature based methods using deep convolutional neural networks [5], [23] which greatly improves performances. However, each image region is treated individually and independently. Besides, only visual information is considered. Visual similarity cannot always help to ensure semantic consistency, especially when we only have few labeled images.
To bridge the gap, semantic representations are used as an alternative way to obtain consistent representations with human perception [33], [38], [42]. Semantic based methods can be roughly divided into two schemes. The first scheme mines the intrinsic correlations of labeled images [38] while the second scheme leverages information from other datasets or the Internet [33], [42]. The semantic representations can be automatically generated using labeled images by making use of pre-learned models or manually labeled by humans. However, since different datasets are collected by different people for varied applications, the underlying distributions of different datasets are not consistent with each other, noisy information may also be introduced when supervision information is transferred. Besides, many semantic based methods still treat different image parts equally without considering their distinctiveness.
To solve the problems mentioned above, in this paper, we propose a novel discriminative semantic region selection method for fine-grained recognition (DSRS). Instead of treating each image region equally, we select visually distinctive regions by making use of pre-learned DCNN models. This is achieved by first using the pre-trained DCNN models to predict each image region’s semantic similarities with corresponding classes. We then calculate their semantic representations by predicting the class distributions of these selected regions. To make use of both visual and semantic representations, we jointly use them for recognition by encoding these region representations and linearly combine them by considering both the semantic distinctiveness and the spatial-semantic correlations. A testing image can then be represented and classified accordingly. Image classification experiments on several publicly available datasets are conducted to evaluate the effectiveness of the proposed method. Performance comparisons with other baseline methods well demonstrate the proposed method's effectiveness. Fig. 1 shows the flowchart of the proposed discriminative semantic region selection for fine-grained recognition method.
The proposed method has two advantages over other fine-grained classification methods. First, the proposed method has good generalization ability as it can make use of various efficient pre-learned DCNN architectures to semantically represent image regions. In this way, we can consistently improve the classification accuracies. Second, instead of only using visual information, we jointly consider the semantic distinctiveness and the spatial-semantic correlations of images regions. In this way, we can get more discriminative representations for classification.
Section snippets
Related work
Fine-grained recognition had been widely explored in the last few decades. Many methods had been proposed. Both global feature and local feature based methods [18], [22], [26], [27], [35], [37], [40] were used. Global features were often not effective than local features. Local features were often used in a bag of words way. This was achieved by first quantizating each local feature and use the quantization distributions for classifier training and image class prediction. Although effective,
Fine-grained recognition by discriminative semantic region selection
In this section, we give the details of the proposed discriminative semantic region selection method for fine-grained image classification. DSRS first selects a number of image regions which have large class responses. These regions pose high probability of object parts. We extract the visual representations of these regions using pre-learned DCNN model. We then calculate the semantic representations of these regions and encode both visual and semantic information for region representations.
Experiments
To evaluate the proposed method's effectiveness, we conduct fine-grained image classification experiments on the Flower-102 dataset [22], the CUB-200-2011 dataset [31], and the Standford Cars dataset [45]. The Flower-102 dataset has 102 classes of 8,189 flower images with each class having 40 to 258 images. As to the CUB-200-2011 dataset, it has 200 different birds of 11,788 images. The Standford Cars datasets has 196 classes of different cars. 8,144 images are pre-chosen for training, as [45].
Conclusion
In this paper, we proposed a novel visual recognition method by discriminative semantic region selection (DSRS). Image regions were first selected by choosing local maximums of response maps. Both visual and semantic representations of the selected region were extracted. We encoded these regions and linearly combined the encoding parameters by considering both semantic correlations and spatial-semantic relationships for recognition. Image classification experiments and detailed analysis on
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by National Natural Science Foundation of China: 62072026; Beijing Natural Science Foundation: JQ20022; the Open Research Fund of Beijing Key Laboratory of Big Data Technology for Food Safety, Beijing Technology and Business University. This work is also in part supported by Natural Science Foundation of China under Grants 61773325, and 61806173, and Natural Science Foundation of Fujian Province under Grants 2019J05123.
References (45)
- et al.
Object categorization in sub-semantic space
Neurocomputing
(2014) - et al.
Boosted random contextual semantic space based representation for visual recognition
Inf. Sci.
(2016) - et al.
Image classification by search with explicitly and implicitly semantic representations
Inf. Sci.
(2017) - S. Branson, G. Horn, S. Belongie, P. Perona, Bird species categorization using pose normalized deep convolutional nets,...
- et al.
Higher-order integration of hierarchical convolutional activations for fine-grained visual categorization
Proc. IEEE Int. Conf. Comput. Vision
(2017) - et al.
Tricos: A tri-level class-discriminative cosegmentation method for image classification
Eur. Conf. Computer Vis.
(2012) - et al.
Destruction and construction learning for fine-grained image recognition
Proc. IEEE Computer Vis. Pattern Recogn.
(2019) - et al.
Deep filter banks for texture recognition, description, and segmentation
Int. J. Comput. Vision
(2016) - Y. Cui, F. Zhou, Y. Lin, S. Belongie, Fine-grained categorization and dataset bootstrapping using deep metric learning...
- et al.
Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition
Proc. IEEE Computer Vis. Pattern Recogn.
(2017)
Sample generation based on a supervised Wasserstein generative adversarial network for high-resolution remote-sensing scene classification
Inform. Sci.
Deep residual learning for image recognition
Proc. IEEE Computer Vis. Pattern Recogn.
Fine-grained image classification via combining vision and language
Proc. IEEE Computer Vis. Pattern Recogn.
Fast fine-grained image classification via weekly supervised discriminative localization
IEEE Trans. Circ. Syst. Video Technol.
Part-stacked cnn for fine-grained visual categorization
Proc. IEEE Computer Vis. Pattern Recogn.
Spatial transformer networks
Proc. Neural Inform. Process. Syst.
Fine-grained recognition without part annotations
Proc. IEEE Computer Vis. Pattern Recogn.
Imagenet classification with deep convolutional neural networks
Proc. Neural Inform. Process. Syst.
Low-rank bilinear pooling for fine-grained classification
Proc. IEEE Computer Vis. Pattern Recogn.
Fine-grained recognition as hsnet search for informative image parts
Proc. IEEE Computer Vis. Pattern Recogn.
Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories
Proc. IEEE Computer Vis. Pattern Recogn.
Bilinear convolutional neural networks for fine-grained visual recognition
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (1)
Semantic-aware visual scene representation
2022, International Journal of Multimedia Information Retrieval