Elsevier

Neural Networks

Volume 130, October 2020, Pages 1-10
Neural Networks

A novel feature representation: Aggregating convolution kernels for image retrieval

https://doi.org/10.1016/j.neunet.2020.06.010Get rights and content

Highlights

  • Proposed a new image representation method based on convolution kernels index.

  • Convolution kernel is equivalent to a feature extractor, which can be as descriptors directly.

  • Explored a similarity measurement for new representation based on position-sensitive.

  • Extended a new research area about image sequence representation.

Abstract

Activated hidden units in convolutional neural networks (CNNs), known as feature maps, dominate image representation, which is compact and discriminative. For ultra-large datasets, high dimensional feature maps in float format not only result in high computational complexity, but also occupy massive memory space. To this end, a new image representation by aggregating convolution kernels (ACK) is proposed, where some convolution kernels capturing certain patterns are activated. The top-n index numbers of the convolution kernels are extracted directly as image representation in discrete integer values, which rebuild relationship between convolution kernels and image. Furthermore, a distance measurement is defined from the perspective of ordered sets to calculate position-sensitive similarities between image representations. Extensive experiments conducted on Oxford Buildings, Paris, and Holidays, etc., manifest that the proposed ACK achieves competitive performance on image retrieval with much lower computational cost, outperforming the ones using feature maps for image representation.

Introduction

Automatic extraction, analysis and understanding of images are the aims of computer vision, which usually can be conducted in two stages, i.e., image representation, and pattern analysis (Radenovic et al., 2019, Yang et al., 2019). The first stage aims to obtain image representations in vector forms conveying discriminative information of images. Furthermore, pattern analysis methods for certain tasks, such as classification, detection, segmentation, etc., can be developed respectively. Deep neural networks unify the two stages jointly, which adapts image representation for specific tasks (Guo et al., 2019, Tian et al., 2020, Wang, Tao, et al., 2019, Zhan and Lu, 2019).

In terms of image representation, early methods index images by visual cues to extract global descriptors, such as texture, and color. However, these global descriptors cannot deal with the variations of images, such as illumination, translation, occlusion, and truncation. These variations compromise the retrieval accuracy and limit the applications of global descriptors. Bag-of-Words (BoW) model is proposed for image representation (Sivic & Zisserman, 2003) and image classification (Csurka et al., 2004, Jégou et al., 2010), relying on the scale-invariant feature transform (SIFT) descriptor  (Lowe, 2004). The seminal work using deep learning is proposed by Krizhevsky et al. (2012), where AlexNet achieves the state-of-the-art recognition accuracy on ILSRVC 2012. Furthermore, deep learning based methods (Liu et al., 2018, Ng et al., 2015, Roy and Boddeti, 2019, Wu et al., 2019), especially the convolutional neural network (CNN) dominates the area of image representation learning. Though convolution and fully connection layers show strong power for image representation (Bhat, 2017, Cheng et al., 2018, Vo et al., 2019, Wang et al., 2018), still suffering from some disadvantages. For instance, CNN-based image representation is in high dimension and float form, resulting in large memory consumption and computational cost for large-scale data.

Convolution preserves the spatial relationship between pixels by learning image features through using traversing the input data, while convolution kernels in different CNNs can extract different image features (ElAdel et al., 2017, Koh and Liang, 2017, Zheng et al., 2018). Convolution kernel is a filter in the process of data feed forward, continuously filtering out the information that does not match the current convolution kernel and purifying the data, and finally obtaining the feature descriptors of the image through layer convolution. Intuitively, we visualize the three convolution kernels of VGG-16 in Fig. 1 according to maximum activation of feature maps (Erhan et al., 2009, Szegedy et al., 2014, Zeiler and Fergus, 2014). We can observe that the kernel in subfigure (b) can extract the columnar structure features of the image in the yellow rectangles; the kernel in subfigure (c) can extract the triangular structural features in the black rectangles; the kernel in subfigure (d) can extract the arched door features of the image in red rectangles. Intuitively, each individual convolution kernel responses to some certain features in an image.

In this paper, we propose aggregating convolution kernels (ACK), brightening convolution kernels of a layer and ranks the kernels based on their response intensity, which achieve a new image representation. The brightening operation in ACK maximizes the convolution kernel response to image intrinsic features. Furthermore, the top-n index numbers of the convolution kernels constitute an index sequence as the image representation, which indicates their convolution response intensity. Compared with the vector representations in float forms, ACK extracts semantic features with low computational complexity. To measure the distance between image representations achieved by ACK, we propose a similarity measurement by taking into account the overlap between the index numbers and their positions of the discrete numbers. The contributions of this paper are summarized as follows.

  • 1.

    We propose to aggregate convolution kernels (ACK) by brightening on each kernel and selecting the ones with maximum response to image intrinsic features, achieving a new image representation consisting of kernel index numbers.

  • 2.

    We design a position-sensitive similarity measurement for image representations in integer values achieved by ACK, taking into account the aspects of overlap and positions of the convolution kernels.

  • 3.

    We conduct extensive experiments on public datasets, demonstrating the effectiveness and efficiency of ACK for image retrieval.

The paper is structured as follows. In Section 2, we discuss the related work on image representation and distance measurement. Section 3 presents the proposed ACK method. Section 4 introduces the metric method for the image representation. Section 5 shows the experimental results. Section 6 concludes the paper.

Section snippets

Related work

In this section, we investigate some representative works on image representation and distance measurement. For image representation, we divide the works into two categories (Zheng et al., 2018), including SIFT-based representation, and CNN-based representation.

Aggregate convolution kernels (ACK) for image representation

In the convolutional neural networks, CNN can be seemed as a multi-stage distillation of information, in which information is continuously filtered and purified (Springenberg et al., 2015, Zheng et al., 2018) by convolution kernels. Each convolution kernel can be understood as a feature template. For example, Fig. 1(b), (c) and (d) can be seen as the main inner feature of (a) in the vision. Generally speaking, a CNN model achieving competitive performance usually is with relatively stable

Method

The image representation achieved by ACK is a set of discrete and ordered index numbers for each image, i.e., there is no distance relation between the index numbers. Therefore, vector-based metrics, such as cosine distance, cannot be used directly. In addition, the index numbers in the representation are position-sensitive, making the metrics like Jaccard similarity cannot be adopted as well. To this end, we propose a distance measurement according to dissimilarity in different convolution

Experiments

In this section, we evaluate the proposed ACK method and the distance measurement on image retrieval tasks.

Conclusion

In this work, we propose a new image representation method, aggregating convolution kernels (ACK), inspired by the visualizations of CNN convolution kernel, which shows strong discrimination and robustness. In detail, the convolution kernels in a layer are ranked based on their response intensity to image intrinsic features, which are achieved by brightening on each kernel, i.e. finding the index that maximizes the strength of the convolution kernel response to image feature, respectively.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (No. 61703109, No. 91748107, No. 61902077, No. 61675050), Guangdong Basic and Applied Basic Research Foundation (No. 2020A1515010616), Guangdong Innovative Research Team Program (No. 2014ZT05G157). This study was supported by the Special Research Fund of Hasselt University (No. BOF20BL01).

References (59)

  • BaiJiale et al.

    Deep progressive hashing for image retrieval

  • BhatAruna

    Makeup invariant face recognition using features from accelerated segment test and eigen vectors

    International Journal of Image and Graphics

    (2017)
  • CalonderMichael et al.

    BRIEF: Binary robust independent elementary features

  • CapraMiranda G.

    Factor analysis of card sort data: An alternative to hierarchical cluster analysis

    Human Factors & Ergonomics Society

    (2005)
  • ChenBinghui et al.

    Hybrid-attention based decoupled metric learning for zero-shot image retrieval

  • ChengZhi-Qi et al.

    Video2Shop: Exactly matching clothes in videos to online shopping images

    (2018)
  • ChumOndrej et al.

    Total recall II: Query expansion revisited

  • Csurka, Gabriella, Dance, Christopher, Fan, Lixin, Willamowski, Jutta, & Bray, Cédric (2004). Visual categorization...
  • ErhanDumitru et al.

    Visualizing higher-layer features of a deep network

    University of Montreal

    (2009)
  • GeirhosRobert et al.

    ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

    (2018)
  • GeorgakisGeorgios et al.

    End-to-end learning of keypoint detector and descriptor for pose invariant 3D matching

  • GordoAlbert et al.

    Deep image retrieval: Learning global representations for image search

  • GordoAlbert et al.

    End-to-end learning of deep visual representations for image retrieval

    International Journal of Computer Vision

    (2017)
  • GuoHao et al.

    Visual attention consistency under image transforms for multi-label image classification

  • HeKaiming et al.

    Deep residual learning for image recognition

  • JégouHervé et al.

    Negative evidences and co-occurences in image retrieval: The benefit of PCA and whitening

  • JégouHervé et al.

    Hamming embedding and weak geometric consistency for large scale image search

  • JégouHervé et al.

    Aggregating local descriptors into a compact image representation

  • KalantidisYannis et al.

    Cross-dimensional weighting for aggregated deep convolutional features

  • Cited by (17)

    • Deep random walk of unitary invariance for large-scale data representation

      2021, Information Sciences
      Citation Excerpt :

      By whether the supervised information from class labels is available, data representation can be categorized as supervised, weakly supervised or unsupervised methods. The supervised data representation is to search a low-dimensional subspace under the guidance of class label information [39], while the unsupervised one aims to learn a discriminative subspace using the internal connection among the given data points [18,50]. Weakly supervised data representation falls into an intermediate medium between these two methods, where only a small quantity of exactly labeled training data is available [27].

    View all citing articles on Scopus
    View full text