A novel feature representation: Aggregating convolution kernels for image retrieval
Introduction
Automatic extraction, analysis and understanding of images are the aims of computer vision, which usually can be conducted in two stages, i.e., image representation, and pattern analysis (Radenovic et al., 2019, Yang et al., 2019). The first stage aims to obtain image representations in vector forms conveying discriminative information of images. Furthermore, pattern analysis methods for certain tasks, such as classification, detection, segmentation, etc., can be developed respectively. Deep neural networks unify the two stages jointly, which adapts image representation for specific tasks (Guo et al., 2019, Tian et al., 2020, Wang, Tao, et al., 2019, Zhan and Lu, 2019).
In terms of image representation, early methods index images by visual cues to extract global descriptors, such as texture, and color. However, these global descriptors cannot deal with the variations of images, such as illumination, translation, occlusion, and truncation. These variations compromise the retrieval accuracy and limit the applications of global descriptors. Bag-of-Words (BoW) model is proposed for image representation (Sivic & Zisserman, 2003) and image classification (Csurka et al., 2004, Jégou et al., 2010), relying on the scale-invariant feature transform (SIFT) descriptor (Lowe, 2004). The seminal work using deep learning is proposed by Krizhevsky et al. (2012), where AlexNet achieves the state-of-the-art recognition accuracy on ILSRVC 2012. Furthermore, deep learning based methods (Liu et al., 2018, Ng et al., 2015, Roy and Boddeti, 2019, Wu et al., 2019), especially the convolutional neural network (CNN) dominates the area of image representation learning. Though convolution and fully connection layers show strong power for image representation (Bhat, 2017, Cheng et al., 2018, Vo et al., 2019, Wang et al., 2018), still suffering from some disadvantages. For instance, CNN-based image representation is in high dimension and float form, resulting in large memory consumption and computational cost for large-scale data.
Convolution preserves the spatial relationship between pixels by learning image features through using traversing the input data, while convolution kernels in different CNNs can extract different image features (ElAdel et al., 2017, Koh and Liang, 2017, Zheng et al., 2018). Convolution kernel is a filter in the process of data feed forward, continuously filtering out the information that does not match the current convolution kernel and purifying the data, and finally obtaining the feature descriptors of the image through layer convolution. Intuitively, we visualize the three convolution kernels of VGG-16 in Fig. 1 according to maximum activation of feature maps (Erhan et al., 2009, Szegedy et al., 2014, Zeiler and Fergus, 2014). We can observe that the kernel in subfigure (b) can extract the columnar structure features of the image in the yellow rectangles; the kernel in subfigure (c) can extract the triangular structural features in the black rectangles; the kernel in subfigure (d) can extract the arched door features of the image in red rectangles. Intuitively, each individual convolution kernel responses to some certain features in an image.
In this paper, we propose aggregating convolution kernels (ACK), brightening convolution kernels of a layer and ranks the kernels based on their response intensity, which achieve a new image representation. The brightening operation in ACK maximizes the convolution kernel response to image intrinsic features. Furthermore, the top-n index numbers of the convolution kernels constitute an index sequence as the image representation, which indicates their convolution response intensity. Compared with the vector representations in float forms, ACK extracts semantic features with low computational complexity. To measure the distance between image representations achieved by ACK, we propose a similarity measurement by taking into account the overlap between the index numbers and their positions of the discrete numbers. The contributions of this paper are summarized as follows.
- 1.
We propose to aggregate convolution kernels (ACK) by brightening on each kernel and selecting the ones with maximum response to image intrinsic features, achieving a new image representation consisting of kernel index numbers.
- 2.
We design a position-sensitive similarity measurement for image representations in integer values achieved by ACK, taking into account the aspects of overlap and positions of the convolution kernels.
- 3.
We conduct extensive experiments on public datasets, demonstrating the effectiveness and efficiency of ACK for image retrieval.
The paper is structured as follows. In Section 2, we discuss the related work on image representation and distance measurement. Section 3 presents the proposed ACK method. Section 4 introduces the metric method for the image representation. Section 5 shows the experimental results. Section 6 concludes the paper.
Section snippets
Related work
In this section, we investigate some representative works on image representation and distance measurement. For image representation, we divide the works into two categories (Zheng et al., 2018), including SIFT-based representation, and CNN-based representation.
Aggregate convolution kernels (ACK) for image representation
In the convolutional neural networks, CNN can be seemed as a multi-stage distillation of information, in which information is continuously filtered and purified (Springenberg et al., 2015, Zheng et al., 2018) by convolution kernels. Each convolution kernel can be understood as a feature template. For example, Fig. 1(b), (c) and (d) can be seen as the main inner feature of (a) in the vision. Generally speaking, a CNN model achieving competitive performance usually is with relatively stable
Method
The image representation achieved by ACK is a set of discrete and ordered index numbers for each image, i.e., there is no distance relation between the index numbers. Therefore, vector-based metrics, such as cosine distance, cannot be used directly. In addition, the index numbers in the representation are position-sensitive, making the metrics like Jaccard similarity cannot be adopted as well. To this end, we propose a distance measurement according to dissimilarity in different convolution
Experiments
In this section, we evaluate the proposed ACK method and the distance measurement on image retrieval tasks.
Conclusion
In this work, we propose a new image representation method, aggregating convolution kernels (ACK), inspired by the visualizations of CNN convolution kernel, which shows strong discrimination and robustness. In detail, the convolution kernels in a layer are ranked based on their response intensity to image intrinsic features, which are achieved by brightening on each kernel, i.e. finding the index that maximizes the strength of the convolution kernel response to image feature, respectively.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported by the National Natural Science Foundation of China (No. 61703109, No. 91748107, No. 61902077, No. 61675050), Guangdong Basic and Applied Basic Research Foundation (No. 2020A1515010616), Guangdong Innovative Research Team Program (No. 2014ZT05G157). This study was supported by the Special Research Fund of Hasselt University (No. BOF20BL01).
References (59)
- et al.
PolyWaTT: A polynomial water travel time estimator based on Derivative Dynamic Time Warping and Perceptually Important Points
Computers & Geosciences
(2018) - et al.
Fast DCNN based on FWT, intelligent dropout and layer skipping for image retrieval
Neural Networks
(2017) - et al.
Flexible unsupervised feature extraction for image classification
Neural Networks
(2019) - et al.
Differential convolutional neural network
Neural Networks
(2019) - et al.
Image denoising using deep CNN with batch renormalization
Neural Networks
(2020) - et al.
Robust auto-weighted projective low-rank and sparse recovery for visual representation
Neural Networks
(2019) - et al.
BoSR: A CNN-based aurora image retrieval method
Neural Networks
(2019) - et al.
Embedding topological features into convolutional neural network salient object detection
Neural Networks
(2020) - et al.
Multi-representation adaptation network for cross-domain image classification
Neural Networks
(2019) - et al.
Aggregating deep convolutional features for image retrieval
(2015)
Deep progressive hashing for image retrieval
Makeup invariant face recognition using features from accelerated segment test and eigen vectors
International Journal of Image and Graphics
BRIEF: Binary robust independent elementary features
Factor analysis of card sort data: An alternative to hierarchical cluster analysis
Human Factors & Ergonomics Society
Hybrid-attention based decoupled metric learning for zero-shot image retrieval
Video2Shop: Exactly matching clothes in videos to online shopping images
Total recall II: Query expansion revisited
Visualizing higher-layer features of a deep network
University of Montreal
ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching
Deep image retrieval: Learning global representations for image search
End-to-end learning of deep visual representations for image retrieval
International Journal of Computer Vision
Visual attention consistency under image transforms for multi-label image classification
Deep residual learning for image recognition
Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening
Hamming embedding and weak geometric consistency for large scale image search
Aggregating local descriptors into a compact image representation
Cross-dimensional weighting for aggregated deep convolutional features
Cited by (17)
Image retrieval using unsupervised prompt learning and regional attention
2024, Expert Systems with ApplicationsCo-attention enabled content-based image retrieval
2023, Neural NetworksDRL: Dynamic rebalance learning for adversarial robustness of UAV with long-tailed distribution
2023, Computer CommunicationsDeep random walk of unitary invariance for large-scale data representation
2021, Information SciencesCitation Excerpt :By whether the supervised information from class labels is available, data representation can be categorized as supervised, weakly supervised or unsupervised methods. The supervised data representation is to search a low-dimensional subspace under the guidance of class label information [39], while the unsupervised one aims to learn a discriminative subspace using the internal connection among the given data points [18,50]. Weakly supervised data representation falls into an intermediate medium between these two methods, where only a small quantity of exactly labeled training data is available [27].