Abstract

Tourist image retrieval has attracted increasing attention from researchers. Mainly, supervised deep hash methods have significantly boosted the retrieval performance, which takes hand-crafted features as inputs and maps the high-dimensional binary feature vector to reduce feature-searching complexity. However, their performance depends on the supervised labels, but few labeled temporal and discriminative information is available in tourist images. This paper proposes an improved deep hash to learn enhanced hash codes for tourist image retrieval. It jointly determines image representations and hash functions with deep neural networks and simultaneously enhances the discriminative capability of tourist image hash codes with refined semantics of the accompanying relationship. Furthermore, we have tuned the CNN to implement end-to-end training hash mapping, calculating the semantic distance between two samples of the obtained binary codes. Experiments on various datasets demonstrate the superiority of the proposed approach compared to state-of-the-art shallow and deep hashing techniques.

1. Introduction

With the rise of cheap sensors, mobile terminals, and social networks, research on tourist images is making good progress, which results in an explosive growth of image retrieval in social networks. This trend imposes great challenges on developing scalable indexing approaches, supporting retrieving relevant images of such massive tourist images. However, current tourist image retrieval mainly relies on manual tags in sensor types, tourist sights, and geographical locations. For example, SIFT [1] uses local descriptors to encode image regions of interest, for example, HOG [2] and BOW [3]. Consequently, it is highly dependent on the availability and quality of tags.

Due to the fast query speed and low storage cost, learning-based hash has been attracting research interests and was applied to applications such as large-scale object retrieval [4], image classification [5], and detection [3]. Recently, deep learning using hash methods has shown promising performance [6, 7]. Due to the high efficiency of binary hash code in the computation of Hamming distance and the advantage of storage space, it is very efficient in large-scale image retrieval. Convolutional neural network hashing (CNNH) [8] incorporates deep neural networks into hash coding to learn the image representations and hash codes. Network in network hashing (NINH) [9] presents a triplet ranking loss to capture the relative similarities of images. The image representation learning and hash coding can benefit each other within a staged framework. Deep semantic hashing [10] ultimate hash codes produced by the learned hash functions maintain sentiment-level similarity. Other hashing methods have also been proposed [1113].

Although hashing methods have achieved remarkable performance, they still suffer from the two following problems:(1)Existing methods learn binary hash codes with hand-crafted feature representations, which cannot accurately capture the inherent semantic similarities of images(2)In most existing hashing methods for images, the semantic similarities are defined at the image level, and each picture is represented by one piece of hash code

This paper considers large-scale retrieval for multilabel tourist image data, which includes semantic hashing and category-aware hashing. We propose an architecture of deep convolution networks designed for hash learning, which has substantially superior performance on large-scale tourist images by end-to-end learning discriminative short binary code. As a whole, the main contributions of this paper are as follows:(1)For binary hash optimization, we propose a discrete hash optimization strategy based on the inner relationship for learning hash codes without relaxing the quantization information loss.(2)We provide an improved divide-code layer, substituting for fully connected layers to learn binary hash code to reduce high redundancy and parameters in the retrieval task. Besides, we use an improved triplet loss function to guarantee the feature similarity to the binary code features to improve the algorithmic efficiency while training.(3)In terms of applications, the deep hash method is employed for large-scale tourist image retrieval. Consequently, this paper illustrates ways to design and train a deep network of large-scale tourist image retrieval.

This section briefly reviews two topics: (1) tourist image retrieval models and (2) hashing retrieval models.

2.1. Tourist Image Retrieval Model

Numerous tourist image retrieval methods based on landmark datasets have been proposed. They often use visual descriptors to describe images. The key is how to improve the expressive ability of visual descriptors. For example, Hao et al. [14] and Xiao et al. [15] used multidimensional models to sort in space property and utilized the three-dimensional visual phrase to describe the landmark images. However, these methods have the disadvantages of long-time modeling and high retrieval cost. Recently, to reduce the cost of retrieval, many researchers began to devote themselves to the research of binary images that compose the landmark features of high-dimensional visual words. Ji et al. [16] proposed a Location Discriminative Vocabulary Coding (LDVC) scheme, which achieves deficient bit rate query transmission, discriminative landmark description, and scalable descriptor delivery in a unified framework. Duan et al. [17] combined multiple information, such as image, GPS, and crowd-sourced hotspot Wi-Fi, to extract location discriminative compact image descriptors. Zhou et al. [18] used the scalable cascaded hashing (SCH) method to implement the landmark hashing retrieval. Zhu et al. [19] used a discrete multimodal hash scheme (Cv-Dmh) based on a canonical view to learn binary code through a new three-stage learning process. Jing et al. [20] investigated the spatiotemporal dynamic patterns of inbound tourism. Cui et al. [21] proposed a Scalable deep hashing (SCADH) to learn enhanced hash codes for social image retrieval.

Furthermore, complex network theory has been used to mine tourism flow patterns [22]. These methods are based on the feature extraction of the image, and then the hashing algorithm is used for iterative computation. However, no method of them is an end-to-end method to learn the hash function. Furthermore, most methods still use hand-craft features to extract image features, which have a weak generalization and migration ability.

Recent examples in which deep learning has made significant advances in tourist image retrieval include positioning the city [23] and tourist photo classification [24]. In addition, many studies have been conducted to analyze the tourist’s urban image by modifying the classifier part of the CNN model [25] or considering local characteristics [26]. However, these studies are limited in reflecting the unique landscape or regional characteristics in the area.

2.2. Hashing Retrieval Model

Learning-based hashing retrieval methods can be divided into unsupervised methods and supervised methods. Unsupervised learning has a catalytic effect in reviving interest in hashing retrieval but has been overshadowed by the successes of purely supervised learning. The researchers introduced unsupervised learning procedures that only use the information on image samples without requiring supervision information for hashing. Notable examples in this category include local sensitive hashing (LSH) [27], iterative quantization (ITQ) [28], direct graph hashing (DGH) [29], scalable graph hashing (SGH) [30], and spectral hashing (SH) [31]. Unsupervised training of hashing retrieval is regarded as a “pretraining” phase whose role is to discover good features that model the structure in the input domain. Besides, supervised methods learn hash coding using both feature information and label, including minimum loss hashing [32], kernel-based supervised hashing (KSH) [33], ranking-based supervised hashing (RSH) [34], and column generation hashing (CGH) [35].

New advances in machine learning using deep neural networks enable automated learning of hashing functions. Xia et al. [36] applied deep hashing using a similarity matrix and minimized loss function to discover an approximate hash code. Although it has dramatically improved the retrieval performance, it is still not an accurate end-to-end method. Zhao et al. [37] proposed a deep hashing algorithm for sorting tags. Since image retrieval aims to return an image based on the correlation among the pictures, this approach is optimized for the final evaluation index. Lin et al. [38] proposed a straightforward method to obtain hash values. They added a fixed-length hidden layer to the CNN network that is limited by the activation function. After fine-tuning the CNN network, the hidden layer value is extracted directly. The number of nodes in the hidden layer is the length of the hash code. Although the eigenvalues obtained by this method contain the high-level semantics of the image, the process does not consider the correlation of the Hamming space features. Therefore, it cannot guarantee the retrieval effect of the elements in the Hamming space.

Later, Lai et al. [9] proposed a training method based on the triplet. Training the objective function is to distance similar images in the Hamming space closer than dissimilar images. Recently, some semisupervised deep hashing models are proposed to utilize unlabeled data to improve retrieval accuracy. Yan et al. [39] proposed the BGDH method to learn embeddings and features simultaneously, as well as hash codes. Zhang and Peng [40] developed a deep hashing method SSDH, which maintains the underlying data structures and the semantic similarity simultaneously to learn hash functions. Both ways use a graph to model unlabeled training samples, which are computationally expensive and memory hog, especially with a large-scale dataset. Shi et al. [41] used the GAN and a discriminative model to learn from both the unlabeled data and labeled data to augment the training dataset, which may not be adapted to semantic representation. Tu et al. developed RDUH [42], which focuses on reducing noisy points by investigating the various input data structures.

Recently, cross-modal hashing methods have provided insight into capturing the intrinsic relationships between various modalities [43] and quantization-based cross-modal similarity [44]. Furthermore, Deng et al. [45] showed that semantic similarity of the training data could perform binary hash codes in an unsupervised manner. However, natural images can have significant intraclass and minor interclass variations. Thus, learning hash codes with class-specific representation centers is required [46]. To further bridge the inherent modality gap, a multitask consistency-preserving adversarial hashing (CPAH) [47] was proposed to fully explore the semantic consistency and correlation between different modalities for efficient cross-modal retrieval.

3. The Proposed Method

In this section, we present the details of our proposed method. We first define the notations used in this paper. Then, we introduce our deep feature learning process, deep hash model training process, and hash codes learning process. Finally, we present a hash optimization solution for solving hash codes and functions and analyzing their convergence and complexity.

3.1. Notations and Problem Definitions

For a tourist image dataset consisting of n images with l user-provided semantic tags, each image is represented by and the relationships between the image and tags can be represented as l-dimensional binary-valued vector . The image matrix is denoted as , and represents the observed image-tag relation matrix.

We aim to learn a set of hash codes with , , where is the length of binary code and is the hash function. The binary code should guarantee the similarity of the original data space. Generally, the hash function satisfies the following:(1) and are closer in the Hamming space when (2) and are far away in the Hamming space when

From the view of geographical position semantics, tourist images and the accompanying tags are highly correlated. These tags contain explicit semantics that is complementary to the latent image semantics. Hence, it is promising to exploit the refined auxiliary social tags for the semantic enrichment of image hash codes. To this end, we introduce a semantic correlation matrix W that directly correlates hash codes with refined social tags. The dynamic semantics can be directly transferred to hash codes. We aim to minimize the difference between the binary hash codes and the mapped semantic vectors from the refined tags.

We propose an architecture of deep convolution networks designed for hash learning, as shown in Figure 1. In detail, we build an end-to-end learning framework that utilizes hash mapping for tourist attraction image retrieval. The method is divided into three parts. The first is a subnetwork with multiple convolutions and pooling layers for learning discriminative image features, pretrained on the Place-2 dataset [48]. The second is the hash layer, which consists of a block coding layer and an activation function. The third is the improved triplet loss function that we use as the objective function to optimize the network. The training process is divided into many minibatches for iterative learning. Each small batch uses multiple images which belong to different categories as input.

3.2. Feature Learning and Deep Convolution Subnetwork Module

Most existing hashing methods adopt hand-crafted features for hash function learning. However, these methods may achieve limited performance because the hand-crafted features might not be optimally compatible with the hash function learning procedure. We propose our deep convolution subnetwork module, which can perform simultaneous feature learning and hash learning in the same framework. The subnetwork is used to learn the image features that can describe the image accurately. After training, the input image is processed through the network to obtain rich semantic descriptors with excellent expressiveness and robustness.

The tags from tourist images are subject to two properties: low rank and error sparsity. In such cases, we use VGG-16 as the subnetwork and transfer the model parameters trained on the Place-365 dataset to the network as the initial parameters. Since the scene recognition task has some similarities with the tourist attraction recognition task, transferring the setting from the network trained on Place-365 to the subnetwork can significantly improve the model’s performance. The concrete structure of the network is shown in Table 1, which contains five large convolutional layers, five pooling layers, and two fully connected layers. Each large convolutional layer is followed by a 2 × 2 maximum pooling of 2 steps, and the detailed network configuration is shown in Table 1.

3.3. Hash Code Learning

Most existing studies use metric learning to train the positive and negative sample pairs to ensure the binary code similarity relationship [4952]. However, it is challenging to represent geographic characteristics as a single binary code without losing a significant amount of helpful information. Hence, there is no need to conduct such an evaluation globally, but only among segments with users’ geographic information needs. For example, a single tourist image can be represented into multiple binary vectors by treating each block as an image feature.

Tourist images and the accompanying tags are positively correlated with each other. Moreover, these tags contain explicit semantics, which is complementary to the latent image semantics. Hence, it is promising to exploit the refined auxiliary social tags for the semantic enrichment of image hash codes. To this end, we aim to minimize the difference between the binary hash codes and the mapped semantic vectors from the refined social tags.

This paper uncovered the intrinsic low-rank matrix by decomposing the image-tag relation matrix into its low-rank and sparse components. The low-rank matrix is then taken into Semantic Enhancement as a semantic source to enhance the discriminative capability of the learned hash codes. Therefore, we use a block-coded structure instead of a fully connected layer to implement a hash layer consisting of a block-coded layer, an active layer for each subblock, and a concat layer.

Consider a tourist image dataset consisting of n images ; we divide the features of fully connected layer into blocks. denotes the length of the binary hash code for constructing the block-coded structure. The subfeatures are obtained from the j-th slice layer as the input to fully connected layers, , and the output of each fully connected layer is 1-dimensional, which is expressed as follows:where is the weight matrix of the -th subblock, the output of each subblock is the input of the active layer, and the sigmoid function is chosen as the activation function, which is denoted as follows:where . After the eigenvalues are converted into the eigenvector, the relaxation of the binary vector is obtained. To improve the performance, we do not directly map the image into binary values of . Instead, we use the activation function to limit the eigenvalues among and then use the thresholding to quantize the relaxation binary into binary code.

3.4. Triplet Loss and Optimization

We propose an improved triplet loss function to optimize the network to effectively preserve semantic similarities of images into the binary hash codes.

Let x be an image, the input to the proposed deep architecture is triplets of sample images, that is, {}. and , where S denotes the similar identity of the images; the optimization of this triplet loss function is to narrow the distance between samples and and to push away the distance between samples and . We use and to represent the Euclidean distance between them and for the relaxed binary code obtained from the samples. As Euclidean distance can approximately represent their Hamming distance, the optimization goal is . In this way, the objective function can be defined as

Because the Euclidean distance is more stable in the training process and the meaning of the function is more consistent with the actual definition [42], we use Euclidean distance to measure the distance in Hamming space rather than the square of Euclidean distance , which is used in the classical triplet loss function. The optimization aims to distinguish between similar samples and the different samples at least margin, which can map semantically equivalent pictures to adjacent locations in the Hamming space. Thus, the semantic features of the images extracted from CNN can be preserved in the hash code.

The basic rule of designing the loss function is to preserve the similarity order, that is, minimize the gap between the approximate nearest neighbor search result computed from the hash codes and the ideal search result obtained from the input space. A widely used solution is to select sample pairs in which the distance between and is greater than the distance between and , in a minibatch. In this work, we choose the hardest positive and negative sample pairs to compute the loss. The function is defined as follows:where P stands for the categories in the batch, K stands for the number of images in the category, means the picture in the class, and σ is the margin parameter.

For fast convergence, it is sensitive to the selection of triplets. Here, we use large mini-batches and only compute the hardest positive and negative samples within a minibatch instead of selecting the hardest triplets in all training data. Furthermore, these functions are differentiable almost everywhere, which means they can be used in models trained by stochastic gradient descent. On the other hand, implementation details make batches of 20–30 exemplars more efficient.

Moreover, by minimizing equation (4), the manual margin parameter σ is designed to enforce a margin between the hard positive and hard negative pairs. Therefore, we optimize the parameter through the training process with the initial value of 0.2, and implementation details make margin parameters of 0.1 to 0.8 of exemplars more efficient. How to automatically determine the margin and incorporate class-specific or sample-specific margins remains challenging.

3.5. Generate Hash Code

When the network training is completed, the given image will get a K-bit hash code. We define as a symbolic function for each component.

If the eigenvector of image extracted from the network merging layer is , then the hash code of this image can be described as . We can compute all images in the database to build a binary index library. We can use the hash code to do the nearest-neighbor retrieval in the Hamming space during the retrieval process, which is very efficient because the Hamming distance can be calculated using XOR.

The main steps of the proposed method are summarized in Algorithm 1.

Input: , the training image matrix
  q, the hash code length
  j, number of sub-layers
  W, the weight matrix
Output: deep hash functions h(x)
(1) Initialize the deep models by the pre-trained VGG-16 Sub-Network
(2) Update W in training process according to loss function;
(3) For do
(4)  For iter = 1 to j do
(5)   Compute ;
(6)   Compute ;
(7)   Quantize the relaxation binary into binary code with ;
(8)   Return
(9)  End for
(10) End for

4. Experiments

In this section, we conduct extensive experiments on two tourist image datasets to evaluate the efficiency and effectiveness of the proposed method. The details of the experiments and the results are described in the following sections.

4.1. Datasets and Experimental Settings
4.1.1. Datasets

(1) China-60 Dataset. Most public landmarks such as Oxford5K and Paris6K present unrelated images suitable for classification frameworks. However, images representing views of the same scene are needed. Thus, we developed a dataset called China-60, randomly selected from Flickr and Baidu Images based on the keywords of 60 popular tourist attractions in China. Variability of images comes from different viewing scales, angles, lighting conditions, and image clutter. Therefore, we provide 3–5 tags to describe the image contents, such as name and places. Our research’s primary purpose is tourism image retrieval, so we have developed a Chinese image dataset with attraction to verify the method’s performance on the image retrieval task.

For each tourist attraction, we crawl 500 to 600 images and remove irrelevant or low-quality photos. The final dataset contains 25,890 images of 60 tourist attractions, including buildings, rivers, forests, mountains, and other types of interests, all photographed under different light, seasons, and angles. We divide the dataset into the training set, test set, and validation set in a ratio of 8 : 1 : 1. In evaluation, the images belonging to the same tourist attraction are considered similar. On the contrary, they are deemed dissimilar. Typical images are shown in Figure 2.

(2) Public Datasets. For a clear comparison and analysis, we also experiment on the different datasets Cifar-10 and Flickr30k. Cifar-10 contains 60,000 images, which are divided into ten categories, each containing 1,000 images. All photos have a 32 × 32 resolution. We also divide them into the training set, validation set, and test set according to the proportion of 8 : 1 : 1. Flickr30k contains 31,783 images focusing mainly on people and animals. We select 1000 outdoor images randomly for the testing set and 30783 other for the training set.

4.1.2. Baseline and Evaluating Indicators

(1) Baseline. To illustrate the benefits of the proposed method, we compare it with various approaches, including existing traditional hash approaches LSH [27], SH [31], PCAH [53], PCA-ITQ, PCA-RR [28], CBR-rand, CBR-opt [54], and DSH [55]. We also compare it with deep hashing approaches, such as DLBHC [38] and DNNH [9]. Finally, after fine-tuning, the features are extracted from the pretrained VGG network as the mapping function input instead of handcraft features.

(2) Evaluating Indicators. Four evaluation indicators were used to assess the performance of the different methods as follows: (1) precision at N sample curve, where precision is the proportion of the correct samples in the returned images, (2) recall at N samples curve, where recall is the proportion of the accurate results in the query results to all correct results, (3) precision-recall (P-R) curve which is the curve of precision changing with recall, and (4) mean average precision (MAP), which is the area surrounded by the P-R curve.

4.2. Results and Discussion on China-60

We first evaluate the effectiveness by comparing each method’s performance under different lengths of hash code, which can get a convincing result. Firstly, we assess the performance in terms of MAP, calculated for all returned samples by sorting with the Hamming distance. The MAP value is shown in Table 2, where DNNH, DLBHC, and the proposed method are deep hashing methods, while the other ways are traditional hashing methods. As shown in Table 2, the proposed method’s results perform better than other methods, and the MAP values of most practices have a positive correlation with the length of the hash code. The experiments show that traditional hashing methods and the size of the binary feature are often highly correlated.

Figure 3 shows the precision-recall (P-R) curves for different methods on the Cifar-10 dataset. We plotted P-R curves on the hash code of four diverse lengths. It can be seen from the diagram that our approach can always maintain the highest precision rate and smaller curve slope under all-length hash code when the recall rate is low. This means that our policy has better retrieval performance. We can also find the gap between the deep hashing algorithm and the traditional algorithm in the graph. Most traditional hashing algorithms have a concave curve on the short hash code, signifying that they have terrible performance on the short hash code. However, with the increase of the length of hash code, part of the P-R curves of traditional hashing algorithms become convex curves, which signify that an extended hash code is often required to ensure the retrieval of conventional hashing effect algorithms. This is consistent with what we said before. On the other hand, the deep hashing algorithms have a slight variation in curve radian under different lengths of hash code, showing the stability and superiority of the deep hashing algorithms.

The TOP-K accuracy rate reflects the proportion of the first K returned results from the correct results of the query, which the user can intuitively perceive in the retrieval results. Therefore, the TOP-K precision rate is an important index to evaluate the retrieval algorithm’s practical application performance. Figure 4 shows the precision of TOP-K retrieval results in the nearest neighbor retrieval. Similarly, the plot shows the precision curves of 32 bits (a), 64 bits (b), 128 bits (c), and 256 bits (d) lengths of hash code, respectively. The horizontal coordinate of the curve is the number of returned samples, and the vertical coordinate is the precision rate. It can be seen from the diagram that the retrieval precision of our approach is always the best of all cases, and when fewer samples are returned, the retrieval precision can reach the highest value. This reflects that the correct samples can be usually returned preferentially, which makes our method sufficiently meet the requirements of image recognition and retrieval for unknown scenic spot images.

Figure 5 shows the TOP-K relation curve between the recall rate and returned sample. The horizontal coordinate is the number in returned samples, and the ordinate is the recall rate of the sample. The correct sample in the returned sample accounts for all of the correct samples in the database. This is an essential criterion of evaluation that developers and administrators of the retrieval system concern about. In addition, it reflects the retrieval success degree of the algorithm in the database. As shown in the figure, our method achieves the best TOP-K recall for all coding lengths. Figure 6 exhibits some query examples on the China-60 dataset. For each query, each method returns the top 6 query results by using the 128-bit hash code, and red represents the incorrect returned results.

4.3. Generalization to Other Image Data Sources

Although the primary purpose of this article is to explore the effect of retrieval methods on image retrieval tasks in tourist attractions, for demonstrating the universality of the process, we also conducted experiments in public image datasets. Considering that the size of the Cifar-10 dataset image is 32 × 32, we shorted the generated hash code length to 12 bits, 24 bits, 32 bits, and 48 bits. Thus, the hash code length is also consistent with the Flickr30 dataset.

Table 3 shows the results of MAP values on the two datasets, where CNNH, DNNH, DLBHC, and the proposed method are deep hashing methods, and the others are the no-deep methods. It can be seen from the results that our approach has a significant advantage over the no-deep hashing algorithm. The MAP value of most no-deep methods dramatically increases with the length of the hash code. In the best case, compared with the best no-deep hashing method, the deep hashing algorithm still has a significant superiority. For the deep hashing approach, the accuracy of our process has a 4% to 8% enhancement, which shows that the hash code generation strategy proposed in this paper can efficiently improve the retrieval effect.

4.4. Generalization to Cross-Datasets

To verify our method in general, we conduct experiments over the cross-datasets. The aim is to utilize two or more datasets labeled with different classes to train and evaluate a single model. For example, we train the proposed model by various datasets: the Flickr30 dataset and the Cifar-10 dataset, respectively. The performance of the trained model is tested by taking a different dataset, China-60.

The experimental results are shown in Table 4, which shows that the overall precision scores are relatively low, indicating that cross-datasets evaluation is more challenging for the retrieval task. However, it also demonstrates that the proposed method achieves the competitive performance on the cross-datasets tourist images retrieval task, demonstrating the effectiveness of our proposed method.

4.5. Time-Cost Analysis

Besides the effectiveness analysis, we also compare the proposed approach with other methods, deep and no-deep, in terms of the computation time cost. All the experiments are carried out on the same platform with Intel i7 8700K CPU, NVIDIA GTX 2080, and 64G RAM. Table 5 shows the average computation times of different methods. The proposed approach is comparable with other methods.

5. Conclusion

In this paper, we proposed a deep hashing method with scalable interblock for large-scale tourist attractions. After end-to-end training of the constructed deep hash network, the network utilizes the triplet loss function to guarantee the hash code’s characteristic similarity. To enhance the performance and efficiency of function optimization and the descriptive ability of hash code, we improve the network and triplet loss function. Based on the results, we report the quantitative evaluation of the proposed method to scale hash length. Experimental results on social image datasets validate the superiority of the proposed method. However, the relaxed binary code obtained from the network may cause feature loss in the threshold process. In future work, we will improve the activation function to dispose of these problems.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflicts of interest.

Acknowledgments

The work was supported by the National Natural Science Foundation of China (41971365) and the Chongqing Research Program of Basic Science and Frontier Technology (cstc2019jcyj-msxmX0131).