Cluster-wise unsupervised hashing for cross-modal similarity search
Introduction
Due to the explosive growth of data with diversity and multimodality that associated in different form of images, texts, and videos, data analysis techniques has been received more attention to purify the semantic correlations across different heterogenous modalities. For example, when a relevant data in different modalities shows a good semantic correlation, it is idea to form the cross-modal search to retrieve the similar items across the heterogeneous modalities in response to the query. Taking Wikipedia as an example, we can retrieval images of a relevant query tag, or tags of a relevant query image. Although the result of large-scale databases, identifying the heterogeneity, diversity and huge semantic gap, it still remains a significant challenges for effective and efficient cross-modal retrieval.
Based on the normal condition that the searchable database has a large volume or measuring similarity between the query and database items are constrained by parameters-expensive, thereby hashing based methods gets great popularity for its low storage cost, fast searching speed and impressive retrieval performance. Moreover, a hash method aims to search approximate nearest neighbor (ANN) within the reference database for a query item in several tasks such as machine learning [1], [2], data mining [3], [4] and computer vision [5], [6], which enables its uses for balancing the retrieval efficiency and retrieval accuracy. The basic component of hashing model is transforming the high-dimensional data point into a compact binary code, creating a binary code sequence for the relevant data samples in different modalities.
More recently, many efforts have been made to encode the correlation structure between different heterogeneous modalities to jointly learn the hash function and indexing cross-modal data points in a Hamming space [7], [8], [9], [10], [11], [12], [13], [14], [15]. However, these cross-modal hashing methods can be induced into two-step scheme: first, projecting multiple heterogenous data modalities into a continuous common latent space by optimizing intermodal coherence. The second scheme is to quantize the continuous projections into a compact binary sequence using sign function. This much said, research in cross-modal hashing methods still has many issues which have to be overcome. We focus on three fundamental to improve the hashing methods: first, transformation from real data to discrete binary codes leads to the loss of information, producing a suboptimal continuous common latent space and a suboptimal compact binary codes [7], [8]; second, solving the optimization problem by relaxing the discrete constraints significantly degrades the retrieval performance while increases the quantization error [11], [12]; third, projecting heterogenous data into a common latent space fails to utilize the diverse representations of heterogeneous data, which eventually helps to learn a better binary code for cross-modal search models. However, how to learn compact binary codes with a great performance is a highly challenging work. To realize this model, we address the above challenges as follow: (1) a discrete optimization framework is proposed to directly learn the unified binary codes for heterogeneous modalities; (2) a multi-view clustering scheme is proposed to project the original data points from different modalities into its low-dimensional latent space, which is able to save the diversity in heterogenous data; (3) based on the proposed cluster-wise prototype, different data points in the same cluster should have a same binary codes. In this case, the semantic information is included in the final binary codes and there is no much loss of information during transforming the real-valued data point into binary codes. As an aside, we can roughly classify existing cross-modal hashing methods into supervised [7], [8], [9], [10], and unsupervised approaches [11], [12], [13], [14], [15], [16]. The details are represented in the Section 2.
In this paper, we propose Cluster-wise Unsupervised Hashing (CUH), an effective hash model to solve the cross-modal retrieval problems. In particular, CUH proposes a multi-view clustering that projects the original data points from different modalities into its low-dimensional latent semantic space, in order to find the cluster centroid points and the common clustering indicators in its own space. In this case, it can jointly learns the compact hash codes and the corresponding linear hash functions. The flowcharts of CUH are shown in Fig. 1. Inspired by Huang and Pan [17], Xu et al. [18], we construct a seamless learning framework based on class-wise and multi-view clustering strategies as an co-training process for learning to hash in the unsupervised manner. Further, we simultaneously realize the multi-view clustering, for the learning of hash codes and the learning of hash functions. These above steps are jointly optimised by an unified learning structure, which ensures to keep both inter-modal semantic coherence and intra-modal similarity, even when the least-absolute clustering residual and the quantization error is minimized. From the optimization perspective, the proposed CUH can generates a single compact unified hash code to all observed modalities of any instance for efficient cross-modal search and it is flexible to the data point size. The effectiveness of CUH is well demonstrated by a set of comprehensive experiments on diverse benchmark datasets.
The remainder of this paper is structured as follows. In Section 2, we briefly overview the related works of cross-modal hashing methods. Section 3 elaborates our proposed cluster-wise unsupervised hashing method, along with an efficient discrete optimization algorithm to tackle this problem. In Section 4, we report the experimental results and extensive evaluations on popular benchmark datasets. Finally, we draw a conclusion in Section 5.
Section snippets
Related work
As we mentioned above, there are two categories for cross-modal hashing methods, e.g., supervised and unsupervised approaches. The former ones easily stretches the intra-modality and inter-modality of the features in the training data for learning hash functions, while, the second one is able to learn the hash functions in a better way by utilizing the available supervised information. Indeed, for the supervised methods, there is need of label information. However, in a large-scale databases it
Proposed algorithm
In this section,we present the detail of the CUH algorithm.
Experiments and evaluations
In this section, we execute comprehensive retrieval performance evaluation of CUH on three multimodal benchmark data sets against several state-of-the-art unsupervised cross-modal hashing methods. We present the details of concrete content in data sets, evaluation criteria, comparison methods, and implementation details for the first time. Next, We investigate the experimental results and discussions in terms of fair comparisons. Finally, the convergence and parameter sensitivity of CUH are
Conclusions
In this paper, we proposed a novel framework for cross-modal similarity retrieval task through the cluster-wise unsupervised hashing (CUH) in the unsupervised case. The proposed model accomplishes this by integrating multi-view clustering and learning of hash codes via the cluster-wise code-prototypes, (i.e. cluster centroid points in multi-view clustering) into an unified binary optimization framework. We demonstrated that such integration mechanism generates better compact binary codes that
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Lu Wang received the B.S. degree in electronic information engineering from the Harbin Institute of Technology, Harbin, China, in 2016. He is currently pursuing the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China. His current research interests include machine learning and information retrieval with respect to learning to hash in Large-scale cross-modal similarity retrieval and visual tracking.
References (45)
- et al.
Cross-modal discrete hashing
Pattern Recognit.
(2018) - et al.
Dual-supervised attention network for deep cross-modal hashing
Pattern Recognit. Lett.
(2019) - et al.
Unsupervised deep cross-modal hashing with virtual label regression
Neurocomputing
(2020) - et al.
Hashing with graphs
Proceedings of the 28th International Conference on International Conference on Machine Learning, Omnipress
(2011) - et al.
A general two-step approach to learning-based hashing
Proceedings of the IEEE International Conference on Computer Vision
(2013) - et al.
Fast hash-based inter-block matching for screen content coding
IEEE Trans. Circuits Syst. Video Technol.
(2016) - et al.
Semi-supervised nonlinear hashing using bootstrap sequential projection learning
IEEE Trans. Knowl. Data Eng.
(2012) - et al.
Biometric hash: high-confidence face recognition
IEEE Trans. Circuits Syst. Video Technol.
(2006) - et al.
Nested-sift for efficient image matching and retrieval
IEEE Multimedia
(2013) - et al.
Latent semantic sparse hashing for cross-modal similarity search
Proceedings of the 37th International ACM SIGIR Conference on Research Development in Information Retrieval, ACM
(2014)
Spectral multimodal hashing and its application to multimedia retrieval
IEEE Trans. Cybern.
Inter-media hashing for large-scale retrieval from heterogeneous data sources
Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, ACM
Collective matrix factorization hashing for multimodal data
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Large-scale supervised multimodal hashing with semantic correlation maximization
Twenty-Eighth AAAI Conference on Artificial Intelligence
Discriminative coupled dictionary hashing for fast cross-media retrieval
Proceedings of the 37th International ACM SIGIR Conference on Research Development in Information Retrieval, sACM
Learning hash functions for cross-view similarity search
Twenty-Second International Joint Conference on Artificial Intelligence
Co-regularized hashing for multimodal data
Adv. Neural Inf. Process. Syst.
Data fusion through cross-modality metric learning using similarity-sensitive hashing
2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE
Class-wise supervised hashing with label embedding and active bits
IJCAI
Re-weighted discriminatively embedded -means for multi-view clustering
IEEE Trans. Image Process.
Deep visual-semantic hashing for cross-modal retrieval
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM
Deep cross-modal hashing
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Cited by (19)
Multi-task subspace clustering
2024, Information SciencesMulti-similarity reconstructing and clustering-based contrastive hashing for cross-modal retrieval
2023, Information SciencesEDMH: Efficient discrete matrix factorization hashing for multi-modal similarity retrieval
2023, Information Processing and ManagementSemi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks
2023, Pattern RecognitionCitation Excerpt :Some unsupervised cross-modal hashing methods have demonstrated that unlabeled multi-media data is also useful for the retrieval task [15,16]. For example, cluster-wise unsupervised hashing (CUH) [17] adopts the multi-view clustering manner to project data of different modalities into latent space to seek cluster centroid points for learning compact hash codes and linear hash functions. Focusing on the unsupervised retrieval task, aggregation-based graph convolutional hashing (AGCH) [18] uses multiple metrics to formulate affinity matrix for hash code learning.
Discrete online cross-modal hashing
2022, Pattern RecognitionCitation Excerpt :As an alternative, approximate nearest neighbors (ANN), especially learning to hash [7,8], has attracted increasing attention. Cross-modal hashing methods transform high-dimensional multimedia data into compact binary codes while generating similar binary codes for similar data items [9–11]. With low storage cost and fast query speed, hashing-based methods can efficiently calculate the hamming distance by using XOR operation and dramatically reduce storage cost by using binary hash codes to represent data in the Hamming space [12–15].
Deep momentum uncertainty hashing
2022, Pattern RecognitionCitation Excerpt :Meanwhile, this paper also explores the uncertainty in the optimization process, which is expected to provide new insights for other combinatorial problems. Hashing aims to project data from high-dimensional pixel space into the low-dimensional binary Hamming space [35,36]. It has drawn substantial attention of researchers due to the low time and space complexity.
Lu Wang received the B.S. degree in electronic information engineering from the Harbin Institute of Technology, Harbin, China, in 2016. He is currently pursuing the Ph.D. degree with the Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai, China. His current research interests include machine learning and information retrieval with respect to learning to hash in Large-scale cross-modal similarity retrieval and visual tracking.
Jie Yang received the Ph.D. degree from the Department of Computer Science, Hamburg University, Germany, in 1994. He is currently a Professor with the Institute of Image Processing and Pattern recognition, Shanghai Jiao Tong University, Shanghai, China. He has been involved in research projects (e.g., National Science Foundation and 863 National High Tech. Plan). He has authored or co-authored one book in Germany, and authored more than 300 journal papers. His current research interests include object detection and recognition, data fusion and data mining, and medical image processing.