Elsevier

Pattern Recognition

Volume 111, March 2021, 107732
Pattern Recognition

Cluster-wise unsupervised hashing for cross-modal similarity search

https://doi.org/10.1016/j.patcog.2020.107732Get rights and content

Highlights

  • Proposing Cluster-wise Unsupervised Hashing to solve effective and efficient large-scale retrieval across modalities.

  • Utilizing the multi-view clustering to project the multi-modal data into its low-dimensional space, saving the diversity.

  • The proposed cluster-wise prototype making different data points in the same cluster having the same binary codes.

  • No much loss of information during transforming the real-valued data into binary codes based on the cluster-wise prototype.

  • Designing a discrete optimization framework to directly learn the unified binary codes for heterogeneous modalities.

Abstract

Cross-modal hashing similarity retrieval plays dual roles across various applications including search engines and autopilot systems. More generally, these methods also known to reduce the computation and memory storage in a training scheme. The key limitation of current methods are that: (i) they relax the discrete constrains to solve the optimization problem which may defeat the model purpose, (ii) projecting heterogenous data into a latent space may encourage to loss the diverse representations in such data, (iii) transforming real-valued data point to the binary codes always resulting in a loss of information and producing the suboptimal continuous latent space. In this paper, we propose a novel framework to project the original data points from different modalities into its own low-dimensional latent space and finds the cluster centroid points in its a low-dimensional space, using Cluster-wise Unsupervised Hashing (CUH). In particular, the proposed clustering scheme aims to jointly learns the compact hash codes and the corresponding linear hash functions. A discrete optimization framework is developed to learn the unified binary codes across modalities under of the guidance cluster-wise code-prototypes. Extensive experiments over multiple datasets demonstrate the effectiveness of our proposed model in comparison with the state-of-the-art in unsupervised cross-modal hashing tasks.

Introduction

Due to the explosive growth of data with diversity and multimodality that associated in different form of images, texts, and videos, data analysis techniques has been received more attention to purify the semantic correlations across different heterogenous modalities. For example, when a relevant data in different modalities shows a good semantic correlation, it is idea to form the cross-modal search to retrieve the similar items across the heterogeneous modalities in response to the query. Taking Wikipedia as an example, we can retrieval images of a relevant query tag, or tags of a relevant query image. Although the result of large-scale databases, identifying the heterogeneity, diversity and huge semantic gap, it still remains a significant challenges for effective and efficient cross-modal retrieval.

Based on the normal condition that the searchable database has a large volume or measuring similarity between the query and database items are constrained by parameters-expensive, thereby hashing based methods gets great popularity for its low storage cost, fast searching speed and impressive retrieval performance. Moreover, a hash method aims to search approximate nearest neighbor (ANN) within the reference database for a query item in several tasks such as machine learning [1], [2], data mining [3], [4] and computer vision [5], [6], which enables its uses for balancing the retrieval efficiency and retrieval accuracy. The basic component of hashing model is transforming the high-dimensional data point into a compact binary code, creating a binary code sequence for the relevant data samples in different modalities.

More recently, many efforts have been made to encode the correlation structure between different heterogeneous modalities to jointly learn the hash function and indexing cross-modal data points in a Hamming space [7], [8], [9], [10], [11], [12], [13], [14], [15]. However, these cross-modal hashing methods can be induced into two-step scheme: first, projecting multiple heterogenous data modalities into a continuous common latent space by optimizing intermodal coherence. The second scheme is to quantize the continuous projections into a compact binary sequence using sign function. This much said, research in cross-modal hashing methods still has many issues which have to be overcome. We focus on three fundamental to improve the hashing methods: first, transformation from real data to discrete binary codes leads to the loss of information, producing a suboptimal continuous common latent space and a suboptimal compact binary codes [7], [8]; second, solving the optimization problem by relaxing the discrete constraints significantly degrades the retrieval performance while increases the quantization error [11], [12]; third, projecting heterogenous data into a common latent space fails to utilize the diverse representations of heterogeneous data, which eventually helps to learn a better binary code for cross-modal search models. However, how to learn compact binary codes with a great performance is a highly challenging work. To realize this model, we address the above challenges as follow: (1) a discrete optimization framework is proposed to directly learn the unified binary codes for heterogeneous modalities; (2) a multi-view clustering scheme is proposed to project the original data points from different modalities into its low-dimensional latent space, which is able to save the diversity in heterogenous data; (3) based on the proposed cluster-wise prototype, different data points in the same cluster should have a same binary codes. In this case, the semantic information is included in the final binary codes and there is no much loss of information during transforming the real-valued data point into binary codes. As an aside, we can roughly classify existing cross-modal hashing methods into supervised [7], [8], [9], [10], and unsupervised approaches [11], [12], [13], [14], [15], [16]. The details are represented in the Section 2.

In this paper, we propose Cluster-wise Unsupervised Hashing (CUH), an effective hash model to solve the cross-modal retrieval problems. In particular, CUH proposes a multi-view clustering that projects the original data points from different modalities into its low-dimensional latent semantic space, in order to find the cluster centroid points and the common clustering indicators in its own space. In this case, it can jointly learns the compact hash codes and the corresponding linear hash functions. The flowcharts of CUH are shown in Fig. 1. Inspired by Huang and Pan [17], Xu et al. [18], we construct a seamless learning framework based on class-wise and multi-view clustering strategies as an co-training process for learning to hash in the unsupervised manner. Further, we simultaneously realize the multi-view clustering, for the learning of hash codes and the learning of hash functions. These above steps are jointly optimised by an unified learning structure, which ensures to keep both inter-modal semantic coherence and intra-modal similarity, even when the least-absolute clustering residual and the quantization error is minimized. From the optimization perspective, the proposed CUH can generates a single compact unified hash code to all observed modalities of any instance for efficient cross-modal search and it is flexible to the data point size. The effectiveness of CUH is well demonstrated by a set of comprehensive experiments on diverse benchmark datasets.

The remainder of this paper is structured as follows. In Section 2, we briefly overview the related works of cross-modal hashing methods. Section 3 elaborates our proposed cluster-wise unsupervised hashing method, along with an efficient discrete optimization algorithm to tackle this problem. In Section 4, we report the experimental results and extensive evaluations on popular benchmark datasets. Finally, we draw a conclusion in Section 5.

Section snippets

Related work

As we mentioned above, there are two categories for cross-modal hashing methods, e.g., supervised and unsupervised approaches. The former ones easily stretches the intra-modality and inter-modality of the features in the training data for learning hash functions, while, the second one is able to learn the hash functions in a better way by utilizing the available supervised information. Indeed, for the supervised methods, there is need of label information. However, in a large-scale databases it

Proposed algorithm

In this section,we present the detail of the CUH algorithm.

Experiments and evaluations

In this section, we execute comprehensive retrieval performance evaluation of CUH on three multimodal benchmark data sets against several state-of-the-art unsupervised cross-modal hashing methods. We present the details of concrete content in data sets, evaluation criteria, comparison methods, and implementation details for the first time. Next, We investigate the experimental results and discussions in terms of fair comparisons. Finally, the convergence and parameter sensitivity of CUH are

Conclusions

In this paper, we proposed a novel framework for cross-modal similarity retrieval task through the cluster-wise unsupervised hashing (CUH) in the unsupervised case. The proposed model accomplishes this by integrating multi-view clustering and learning of hash codes via the cluster-wise code-prototypes, (i.e. cluster centroid points in multi-view clustering) into an unified binary optimization framework. We demonstrated that such integration mechanism generates better compact binary codes that

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Lu Wang received the B.S. degree in electronic information engineering from the Harbin Institute of Technology, Harbin, China, in 2016. He is currently pursuing the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China. His current research interests include machine learning and information retrieval with respect to learning to hash in Large-scale cross-modal similarity retrieval and visual tracking.

References (45)

  • V.E. Liong et al.

    Cross-modal discrete hashing

    Pattern Recognit.

    (2018)
  • H. Peng et al.

    Dual-supervised attention network for deep cross-modal hashing

    Pattern Recognit. Lett.

    (2019)
  • T. Wang et al.

    Unsupervised deep cross-modal hashing with virtual label regression

    Neurocomputing

    (2020)
  • W. Liu et al.

    Hashing with graphs

    Proceedings of the 28th International Conference on International Conference on Machine Learning, Omnipress

    (2011)
  • G. Lin et al.

    A general two-step approach to learning-based hashing

    Proceedings of the IEEE International Conference on Computer Vision

    (2013)
  • W. Xiao et al.

    Fast hash-based inter-block matching for screen content coding

    IEEE Trans. Circuits Syst. Video Technol.

    (2016)
  • C. Wu et al.

    Semi-supervised nonlinear hashing using bootstrap sequential projection learning

    IEEE Trans. Knowl. Data Eng.

    (2012)
  • D.C. Ngo et al.

    Biometric hash: high-confidence face recognition

    IEEE Trans. Circuits Syst. Video Technol.

    (2006)
  • P. Xu et al.

    Nested-sift for efficient image matching and retrieval

    IEEE Multimedia

    (2013)
  • J. Zhou et al.

    Latent semantic sparse hashing for cross-modal similarity search

    Proceedings of the 37th International ACM SIGIR Conference on Research Development in Information Retrieval, ACM

    (2014)
  • Y. Zhen et al.

    Spectral multimodal hashing and its application to multimedia retrieval

    IEEE Trans. Cybern.

    (2015)
  • J. Song et al.

    Inter-media hashing for large-scale retrieval from heterogeneous data sources

    Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM

    (2013)
  • G. Ding et al.

    Collective matrix factorization hashing for multimodal data

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2014)
  • D. Zhang et al.

    Large-scale supervised multimodal hashing with semantic correlation maximization

    Twenty-Eighth AAAI Conference on Artificial Intelligence

    (2014)
  • Z. Yu et al.

    Discriminative coupled dictionary hashing for fast cross-media retrieval

    Proceedings of the 37th International ACM SIGIR Conference on Research Development in Information Retrieval, sACM

    (2014)
  • S. Kumar et al.

    Learning hash functions for cross-view similarity search

    Twenty-Second International Joint Conference on Artificial Intelligence

    (2011)
  • Y. Zhen et al.

    Co-regularized hashing for multimodal data

    Adv. Neural Inf. Process. Syst.

    (2012)
  • M.M. Bronstein et al.

    Data fusion through cross-modality metric learning using similarity-sensitive hashing

    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE

    (2010)
  • L.-K. Huang et al.

    Class-wise supervised hashing with label embedding and active bits

    IJCAI

    (2016)
  • J. Xu et al.

    Re-weighted discriminatively embedded k-means for multi-view clustering

    IEEE Trans. Image Process.

    (2017)
  • Y. Cao et al.

    Deep visual-semantic hashing for cross-modal retrieval

    Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM

    (2016)
  • Q.-Y. Jiang et al.

    Deep cross-modal hashing

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    (2017)
  • Cited by (19)

    • Multi-task subspace clustering

      2024, Information Sciences
    • Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks

      2023, Pattern Recognition
      Citation Excerpt :

      Some unsupervised cross-modal hashing methods have demonstrated that unlabeled multi-media data is also useful for the retrieval task [15,16]. For example, cluster-wise unsupervised hashing (CUH) [17] adopts the multi-view clustering manner to project data of different modalities into latent space to seek cluster centroid points for learning compact hash codes and linear hash functions. Focusing on the unsupervised retrieval task, aggregation-based graph convolutional hashing (AGCH) [18] uses multiple metrics to formulate affinity matrix for hash code learning.

    • Discrete online cross-modal hashing

      2022, Pattern Recognition
      Citation Excerpt :

      As an alternative, approximate nearest neighbors (ANN), especially learning to hash [7,8], has attracted increasing attention. Cross-modal hashing methods transform high-dimensional multimedia data into compact binary codes while generating similar binary codes for similar data items [9–11]. With low storage cost and fast query speed, hashing-based methods can efficiently calculate the hamming distance by using XOR operation and dramatically reduce storage cost by using binary hash codes to represent data in the Hamming space [12–15].

    • Deep momentum uncertainty hashing

      2022, Pattern Recognition
      Citation Excerpt :

      Meanwhile, this paper also explores the uncertainty in the optimization process, which is expected to provide new insights for other combinatorial problems. Hashing aims to project data from high-dimensional pixel space into the low-dimensional binary Hamming space [35,36]. It has drawn substantial attention of researchers due to the low time and space complexity.

    View all citing articles on Scopus

    Lu Wang received the B.S. degree in electronic information engineering from the Harbin Institute of Technology, Harbin, China, in 2016. He is currently pursuing the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China. His current research interests include machine learning and information retrieval with respect to learning to hash in Large-scale cross-modal similarity retrieval and visual tracking.

    Jie Yang received the Ph.D. degree from the Department of Computer Science, Hamburg University, Germany, in 1994. He is currently a Professor with the Institute of Image Processing and Pattern recognition, Shanghai Jiao Tong University, Shanghai, China. He has been involved in research projects (e.g., National Science Foundation and 863 National High Tech. Plan). He has authored or co-authored one book in Germany, and authored more than 300 journal papers. His current research interests include object detection and recognition, data fusion and data mining, and medical image processing.

    1

    This research is partly supported by National Key R&D Program of China (No. 2019YFB1311503), and NSFC, China (No: 61876107, U1803261).

    View full text