A novel feature representation: Aggregating convolution kernels for image retrieval

doi:10.1016/j.neunet.2020.06.010

Neural Networks

Volume 130, October 2020, Pages 1-10

https://doi.org/10.1016/j.neunet.2020.06.010 Get rights and content

Highlights

•
Proposed a new image representation method based on convolution kernels index.
•
Convolution kernel is equivalent to a feature extractor, which can be as descriptors directly.
•
Explored a similarity measurement for new representation based on position-sensitive.
•
Extended a new research area about image sequence representation.

Abstract

Activated hidden units in convolutional neural networks (CNNs), known as feature maps, dominate image representation, which is compact and discriminative. For ultra-large datasets, high dimensional feature maps in float format not only result in high computational complexity, but also occupy massive memory space. To this end, a new image representation by aggregating convolution kernels (ACK) is proposed, where some convolution kernels capturing certain patterns are activated. The top-n index numbers of the convolution kernels are extracted directly as image representation in discrete integer values, which rebuild relationship between convolution kernels and image. Furthermore, a distance measurement is defined from the perspective of ordered sets to calculate position-sensitive similarities between image representations. Extensive experiments conducted on Oxford Buildings, Paris, and Holidays, etc., manifest that the proposed ACK achieves competitive performance on image retrieval with much lower computational cost, outperforming the ones using feature maps for image representation.

Introduction

Automatic extraction, analysis and understanding of images are the aims of computer vision, which usually can be conducted in two stages, i.e., image representation, and pattern analysis (Radenovic et al., 2019, Yang et al., 2019). The first stage aims to obtain image representations in vector forms conveying discriminative information of images. Furthermore, pattern analysis methods for certain tasks, such as classification, detection, segmentation, etc., can be developed respectively. Deep neural networks unify the two stages jointly, which adapts image representation for specific tasks (Guo et al., 2019, Tian et al., 2020, Wang, Tao, et al., 2019, Zhan and Lu, 2019).

In terms of image representation, early methods index images by visual cues to extract global descriptors, such as texture, and color. However, these global descriptors cannot deal with the variations of images, such as illumination, translation, occlusion, and truncation. These variations compromise the retrieval accuracy and limit the applications of global descriptors. Bag-of-Words (BoW) model is proposed for image representation (Sivic & Zisserman, 2003) and image classification (Csurka et al., 2004, Jégou et al., 2010), relying on the scale-invariant feature transform (SIFT) descriptor (Lowe, 2004). The seminal work using deep learning is proposed by Krizhevsky et al. (2012), where AlexNet achieves the state-of-the-art recognition accuracy on ILSRVC 2012. Furthermore, deep learning based methods (Liu et al., 2018, Ng et al., 2015, Roy and Boddeti, 2019, Wu et al., 2019), especially the convolutional neural network (CNN) dominates the area of image representation learning. Though convolution and fully connection layers show strong power for image representation (Bhat, 2017, Cheng et al., 2018, Vo et al., 2019, Wang et al., 2018), still suffering from some disadvantages. For instance, CNN-based image representation is in high dimension and float form, resulting in large memory consumption and computational cost for large-scale data.

Convolution preserves the spatial relationship between pixels by learning image features through using traversing the input data, while convolution kernels in different CNNs can extract different image features (ElAdel et al., 2017, Koh and Liang, 2017, Zheng et al., 2018). Convolution kernel is a filter in the process of data feed forward, continuously filtering out the information that does not match the current convolution kernel and purifying the data, and finally obtaining the feature descriptors of the image through layer convolution. Intuitively, we visualize the three convolution kernels of VGG-16 in Fig. 1 according to maximum activation of feature maps (Erhan et al., 2009, Szegedy et al., 2014, Zeiler and Fergus, 2014). We can observe that the kernel in subfigure (b) can extract the columnar structure features of the image in the yellow rectangles; the kernel in subfigure (c) can extract the triangular structural features in the black rectangles; the kernel in subfigure (d) can extract the arched door features of the image in red rectangles. Intuitively, each individual convolution kernel responses to some certain features in an image.

In this paper, we propose aggregating convolution kernels (ACK), brightening convolution kernels of a layer and ranks the kernels based on their response intensity, which achieve a new image representation. The brightening operation in ACK maximizes the convolution kernel response to image intrinsic features. Furthermore, the top-n index numbers of the convolution kernels constitute an index sequence as the image representation, which indicates their convolution response intensity. Compared with the vector representations in float forms, ACK extracts semantic features with low computational complexity. To measure the distance between image representations achieved by ACK, we propose a similarity measurement by taking into account the overlap between the index numbers and their positions of the discrete numbers. The contributions of this paper are summarized as follows.

1.
We propose to aggregate convolution kernels (ACK) by brightening on each kernel and selecting the ones with maximum response to image intrinsic features, achieving a new image representation consisting of kernel index numbers.
2.
We design a position-sensitive similarity measurement for image representations in integer values achieved by ACK, taking into account the aspects of overlap and positions of the convolution kernels.
3.
We conduct extensive experiments on public datasets, demonstrating the effectiveness and efficiency of ACK for image retrieval.

The paper is structured as follows. In Section 2, we discuss the related work on image representation and distance measurement. Section 3 presents the proposed ACK method. Section 4 introduces the metric method for the image representation. Section 5 shows the experimental results. Section 6 concludes the paper.

Section snippets

Related work

In this section, we investigate some representative works on image representation and distance measurement. For image representation, we divide the works into two categories (Zheng et al., 2018), including SIFT-based representation, and CNN-based representation.

Aggregate convolution kernels (ACK) for image representation

In the convolutional neural networks, CNN can be seemed as a multi-stage distillation of information, in which information is continuously filtered and purified (Springenberg et al., 2015, Zheng et al., 2018) by convolution kernels. Each convolution kernel can be understood as a feature template. For example, Fig. 1(b), (c) and (d) can be seen as the main inner feature of (a) in the vision. Generally speaking, a CNN model achieving competitive performance usually is with relatively stable

Method

The image representation achieved by ACK is a set of discrete and ordered index numbers for each image, i.e., there is no distance relation between the index numbers. Therefore, vector-based metrics, such as cosine distance, cannot be used directly. In addition, the index numbers in the representation are position-sensitive, making the metrics like Jaccard similarity cannot be adopted as well. To this end, we propose a distance measurement according to dissimilarity in different convolution

Experiments

In this section, we evaluate the proposed ACK method and the distance measurement on image retrieval tasks.

Conclusion

In this work, we propose a new image representation method, aggregating convolution kernels (ACK), inspired by the visualizations of CNN convolution kernel, which shows strong discrimination and robustness. In detail, the convolution kernels in a layer are ranked based on their response intensity to image intrinsic features, which are achieved by brightening on each kernel, i.e. finding the index that maximizes the strength of the convolution kernel response to image feature, respectively.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61703109, No. 91748107, No. 61902077, No. 61675050), Guangdong Basic and Applied Basic Research Foundation (No. 2020A1515010616), Guangdong Innovative Research Team Program (No. 2014ZT05G157). This study was supported by the Special Research Fund of Hasselt University (No. BOF20BL01).

References (59)

ClaureYuri Navarro et al.
PolyWaTT: A polynomial water travel time estimator based on Derivative Dynamic Time Warping and Perceptually Important Points
Computers & Geosciences
(2018)
ElAdelAsma et al.
Fast DCNN based on FWT, intelligent dropout and layer skipping for image retrieval
Neural Networks
(2017)
LiuYang et al.
Flexible unsupervised feature extraction for image classification
Neural Networks
(2019)
SarigulMehmet et al.
Differential convolutional neural network
Neural Networks
(2019)
TianChunwei et al.
Image denoising using deep CNN with batch renormalization
Neural Networks
(2020)
WangLei et al.
Robust auto-weighted projective low-rank and sparse recovery for visual representation
Neural Networks
(2019)
YangXi et al.
BoSR: A CNN-based aurora image retrieval method
Neural Networks
(2019)
ZhouLecheng et al.
Embedding topological features into convolutional neural network salient object detection
Neural Networks
(2020)
ZhuYongchun et al.
Multi-representation adaptation network for cross-domain image classification
Neural Networks
(2019)
BabenkoArtem et al.
Aggregating deep convolutional features for image retrieval
(2015)

BaiJiale et al.

Deep progressive hashing for image retrieval

BhatAruna

Makeup invariant face recognition using features from accelerated segment test and eigen vectors

International Journal of Image and Graphics

(2017)

CalonderMichael et al.

BRIEF: Binary robust independent elementary features

CapraMiranda G.

Factor analysis of card sort data: An alternative to hierarchical cluster analysis

Human Factors & Ergonomics Society

(2005)

ChenBinghui et al.

Hybrid-attention based decoupled metric learning for zero-shot image retrieval

ChengZhi-Qi et al.

Video2Shop: Exactly matching clothes in videos to online shopping images

(2018)

ChumOndrej et al.

Total recall II: Query expansion revisited

Csurka, Gabriella, Dance, Christopher, Fan, Lixin, Willamowski, Jutta, & Bray, Cédric (2004). Visual categorization...

ErhanDumitru et al.

Visualizing higher-layer features of a deep network

University of Montreal

(2009)

GeirhosRobert et al.

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

(2018)

GeorgakisGeorgios et al.

End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching

GordoAlbert et al.

Deep image retrieval: Learning global representations for image search

GordoAlbert et al.

End-to-end learning of deep visual representations for image retrieval

International Journal of Computer Vision

(2017)

GuoHao et al.

Visual attention consistency under image transforms for multi-label image classification

HeKaiming et al.

Deep residual learning for image recognition

JégouHervé et al.

Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening

JégouHervé et al.

Hamming embedding and weak geometric consistency for large scale image search

JégouHervé et al.

Aggregating local descriptors into a compact image representation

KalantidisYannis et al.

Cross-dimensional weighting for aggregated deep convolutional features

Cited by (17)

Image retrieval using unsupervised prompt learning and regional attention
2024, Expert Systems with Applications
Identifying the target object in an image can produce more accurate and discriminative feature representations, which can significantly improve large-scale instance-level image-retrieval performance. However, it is usually difficult to obtain annotation information for all target objects in a dataset manually, which makes it challenging to automatically identify target objects. To address this issue, we propose a novel method of instance-level image retrieval based on unsupervised prompt learning and regional attention (PLRA) rather than manual annotation. It includes three main components: (1) We propose an unsupervised prompt learning method to identify an image’s target object. It reconstructs deep features by mining prompt information, then designs prompt factors to identify the target object based on the reconstructed features. (2) We propose a new regional attention method to extract the distinguishing features of the target object. This method captures important feature regions in four dimensions: global, local, spatial, and channel, which improves the diversity and discriminability of the representation. (3) We propose a general hybrid PCA-whitening (HPW) method based on multi-parameter learning and feature fusion, which trades off feature dimensionality with retrieval performance. This method significantly improves performance and reduces vector dimensionality in a plug-and-play manner. Comprehensive experiments on five benchmark datasets show that the proposed method significantly outperforms existing state-of-the-art methods based on unsupervised feature aggregation.
Co-attention enabled content-based image retrieval
2023, Neural Networks
Content-based image retrieval (CBIR) aims to provide the most similar images to a given query. Feature extraction plays an essential role in retrieval performance within a CBIR pipeline. Current CBIR studies would either uniformly extract feature information from the input image and use it directly or employ some trainable spatial weighting module which is then used for similarity comparison between pairs of query and candidate matching images. These spatial weighting modules are normally query non-sensitive and only based on the knowledge learned during the training stage. They may focus towards incorrect regions, especially when the target image is not salient or is surrounded by distractors. This paper proposes an efficient query sensitive co-attention¹ mechanism for large-scale CBIR tasks. In order to reduce the extra computation cost required by the query sensitivity to the co-attention mechanism, the proposed method employs clustering of the selected local features. Experimental results indicate that the co-attention maps can provide the best retrieval results on benchmark datasets under challenging situations, such as having completely different image acquisition conditions between the query and its match image.
DRL: Dynamic rebalance learning for adversarial robustness of UAV with long-tailed distribution
2023, Computer Communications
Adversarial robustness has attracted extensive studies in various fields by increasing the interpretability of deep learning and enhancing the understanding of neural network models. In realistic scenarios such as UAV control system, imbalanced datasets are a consensus. Therefore, how to solve the adversarial robustness on imbalanced datasets is a more and more inescapable problem in UAV control system. There have been some works on adversarial robustness on imbalanced datasets, which bring us a deeper understanding of vulnerability of deep neural networks and the generation of adversarial examples. To adjust the classification plane after training, the long-tailed robustness framework is usually designed to be multi-stage, and different classifiers are used in different stages, which can improve the robustness through multi learning. The existing methods are hardly considered to effectively handle the long-tailed robustness problem in UAV control system. To explore the intrinsic features of long-tailed robustness, we propose a one-stage robustness framework. First, we study different classifiers and propose a general cosine classifier. By changing the general cosine classifier adaptively, the model obtains a more robust classification. Then, we analyze the scalability of the focal loss and design a focal-margin loss. Finally, we design a category focus mobile learning strategy, obtained more robust features by changing the learning emphasis with this strategy. From this, we design a simple and efficient one-stage dynamic adversarial robustness method DRL under long-tailed distribution, which consists of an adaptive cosine classifier and a focal-margin loss under long-tailed mobile learning. The extended experiments demonstrate the superiority of our approach over other state-of-the-art methods, and the effectiveness of the designed module. This method can effectively solve the long-tailed robustness problem on UAV control system and other terminals.
Miper-MVS: Multi-scale iterative probability estimation with refinement for efficient multi-view stereo
2023, Neural Networks
Multi-view stereo reconstruction aims to construct 3D scenes from multiple 2D images. In recent years, learning-based multi-view stereo methods have achieved significant results in depth estimation for multi-view stereo reconstruction. However, the current popular multi-stage processing method cannot solve the low-efficiency problem satisfactorily owing to the use of 3D convolution and still involves significant amounts of calculation. Therefore, to further balance the efficiency and generalization performance, this study proposed a multi-scale iterative probability estimation with refinement, which is a highly efficient method for multi-view stereo reconstruction. It comprises three main modules: 1) a high-precision probability estimator, dilated-LSTM that encodes the pixel probability distribution of depth in the hidden state, 2) an efficient interactive multi-scale update module that fully integrates multi-scale information and improves parallelism by interacting information between adjacent scales, and 3) a Pi-error Refinement module that converts the depth error between views into a grayscale error map and refines the edges of objects in the depth map. Simultaneously, we introduced a large amount of high-frequency information to ensure the accuracy of the refined edges. Among the most efficient methods (e.g., runtime and memory), the proposed method achieved the best generalization on the Tanks & Temples benchmarks. Additionally, the performance of the Miper-MVS was highly competitive in DTU benchmark. Our code is available at https://github.com/zhz120/Miper-MVS.
LCM-Captioner: A lightweight text-based image captioning method with collaborative mechanism between vision and text
2023, Neural Networks
Text-based image captioning (TextCap) aims to remedy the shortcomings of existing image captioning tasks that ignore text content when describing images. Instead, it requires models to recognize and describe images from both visual and textual content to achieve a deeper level of comprehension of the images. However, existing methods tend to use numerous complex network architectures to improve performance, which still fails to adequately model the relationship between vision and text on the one side, while on the other side this leads to long running times, high memory consumption, and other unfavorable deployment problems. To solve the above issues, we have developed a lightweight captioning method with a collaborative mechanism, LCM-Captioner, which balances high efficiency with high performance. First, we propose a feature-lightening transformation for the TextCap task, named TextLighT, which is able to learn rich multimodal representations while mapping features to lower dimensions, thereby reducing memory costs. Next, we present a collaborative attention module for visual and text information, VTCAM, to facilitate the semantic alignment of multimodal information to uncover important visual objects and textual content. Finally, the conducted extensive experiments on the TextCaps dataset demonstrate the effectiveness of our method. Code is available at https://github.com/DengHY258/LCM-Captioner.
Deep random walk of unitary invariance for large-scale data representation
2021, Information Sciences
Citation Excerpt :
By whether the supervised information from class labels is available, data representation can be categorized as supervised, weakly supervised or unsupervised methods. The supervised data representation is to search a low-dimensional subspace under the guidance of class label information [39], while the unsupervised one aims to learn a discriminative subspace using the internal connection among the given data points [18,50]. Weakly supervised data representation falls into an intermediate medium between these two methods, where only a small quantity of exactly labeled training data is available [27].
Data representation aims at learning an efficient low-dimensional representation, which is always a challenging task in machine learning and computer vision. It can largely improve the performance of specific learning tasks. Unsupervised methods are extensively applied to data representation, which considers the internal connection among data. Most of existing unsupervised models usually use a specific norm to favor certain distributions of the input data, leading to an unsustainable encouraging performance for given learning tasks. In this paper, we propose an efficient data representation method to address large-scale feature representation problems, where the deep random walk of unitary invariance is exploited for learning discriminative features. First, the data representation is formulated as deep random walk problems, where unitarily invariant norms are employed to capture diverse beneficial perspectives hidden in the data. It is embedded into a state transition matrix model, where an arbitrary number of transition steps is available for an accurate affinity evaluation. Second, data representation problems are then transformed as high-order matrix factorization tasks with unitary invariance. Third, a closed-form solution is proved for the formulated data representation problem, which may provide a new perspective for solving high-order matrix factorization problems. Finally, extensive comparative experiments are conducted in publicly available real-world data sets. In addition, experimental results demonstrate that the proposed method achieves better performance than other compared state-of-the-art approaches in terms of data clustering.

View all citing articles on Scopus

View full text

A novel feature representation: Aggregating convolution kernels for image retrieval

Highlights

Abstract

Introduction

Section snippets

Related work

Aggregate convolution kernels (ACK) for image representation

Method

Experiments

Conclusion

Declaration of Competing Interest

Acknowledgments

Computers & Geosciences

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Aggregating deep convolutional features for image retrieval

Deep progressive hashing for image retrieval

Makeup invariant face recognition using features from accelerated segment test and eigen vectors

International Journal of Image and Graphics

BRIEF: Binary robust independent elementary features

Factor analysis of card sort data: An alternative to hierarchical cluster analysis

Human Factors & Ergonomics Society

Hybrid-attention based decoupled metric learning for zero-shot image retrieval

Video2Shop: Exactly matching clothes in videos to online shopping images

Total recall II: Query expansion revisited

Visualizing higher-layer features of a deep network

University of Montreal

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching

Deep image retrieval: Learning global representations for image search

End-to-end learning of deep visual representations for image retrieval

International Journal of Computer Vision

Visual attention consistency under image transforms for multi-label image classification

Deep residual learning for image recognition

Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening

Hamming embedding and weak geometric consistency for large scale image search

Aggregating local descriptors into a compact image representation

Cross-dimensional weighting for aggregated deep convolutional features