Comparative analysis on cross-modal information retrieval: A review

doi:10.1016/j.cosrev.2020.100336

Computer Science Review

Volume 39, February 2021, 100336

https://doi.org/10.1016/j.cosrev.2020.100336 Get rights and content

Highlights

•
Summary on recent progress in the image–text cross-modal retrieval.
•
Broad classification of various cross-modal techniques.
•
Prominent benchmark datasets and evaluation metrics are introduced.
•
Comparative analysis of diverse cross-modal methods.
•
Challenges and open issues are presented in the area of multi-modal retrieval.

Abstract

Human beings experience life through a spectrum of modes such as vision, taste, hearing, smell, and touch. These multiple modes are integrated for information processing in our brain using a complex network of neuron connections. Likewise for artificial intelligence to mimic the human way of learning and evolve into the next generation, it should elucidate multi-modal information fusion efficiently. Modality is a channel that conveys information about an object or an event such as image, text, video, and audio. A research problem is said to be multi-modal when it incorporates information from more than a single modality. Multi-modal systems involve one mode of data to be inquired for any (same or varying) modality outcome whereas cross-modal system strictly retrieves the information from a dissimilar modality. As the input–output queries belong to diverse modal families, their coherent comparison is still an open challenge with their primitive forms and subjective definition of content similarity. Numerous techniques have been proposed by researchers to handle this issue and to reduce the semantic gap of information retrieval among different modalities. This paper focuses on a comparative analysis of various research works in the field of cross-modal information retrieval. Comparative analysis of several cross-modal representations and the results of the state-of-the-art methods when applied on benchmark datasets have also been discussed. In the end, open issues are presented to enable the researchers to a better understanding of the present scenario and to identify future research directions.

Introduction

When we fail to understand the contents of an image embedded in a text, figure captions, and referral text often help. Just by looking at a figure, a person might not be able to understand it exactly but with the help of collateral text, it can be understood efficiently. For instance, when we see a volleyball picture (Fig. 1), we may not be able to understand or know about the volleyball game. However, the picture can be completely understood with the help of collateral text (such as caption, figure reference, and related citation) describing the volleyball game. This implies that information from more than one source is beneficial in further understanding of things and also helpful in better information retrieval. This is where cross-modal data fusion and retrieval come into the picture.

Recently, cross-modal retrieval has gained a lot of attention due to the rapid increase in multi-modal data such as images, text, video, and audio. The term modality represents a specific form in which data exists and it is also associated with a sensory perception such as vision and hear modalities which are major sources of communication and responsiveness in human beings and animals. The data consisting of more than one modality is known as multi-modal data. It has the characteristic of high-level semantic homogeneity and low-level expressive heterogeneity such as the same thing having diverse representations. Different forms of representation help people better understand things as illustrated in the volleyball example above. While searching for something, people often want to get accurate results in different forms which create a need for an efficient multi-media information retrieval platform. Classic approaches to information retrieval are of uni-modal nature. Uni-modal means information derived just from one channel, such as only from images or only from the text (but not both). For example, only the text query is used for information search and retrieval from a text repository. This retrieval approach is of the least use these days when enormous multimedia data is being generated. Cross-modal and multi-modal systems, on the other hand, are able to link more than one modalities such as image, text, audio, and video. In cross-modal, input query mode and resultant mode are dissimilar. For example, query text for related images and query image for related text. However, the resultant mode can be similar to the query mode in a multi-modal system. For example, query text to retrieve related images and matched text. Cross-modal and multi-modal are explained using a simple example in Fig. 2 where ＋ represents both text and images can be retrieved using an image query and vice versa in multi-modal approach.

Therefore, the fundamental idea of cross-modal is to integrate numerous modes of information to derive better results than just one channel. For instance, an image–text cross-modal system integrates textual information along with an image which is known as image annotation. Vice-versa, it also queries text keywords to retrieve images, known as image retrieval. In simple words, image annotation is a process of explaining an image with appropriate linguistic cues. It is useful in knowledge transfer sessions for application areas such as medical science, military, business, education, and sports to name a few. For example, a CT scan is known to the radiologist but not to an intern or a patient. Therefore, the expert has to explain it using proper terminology by pointing out key areas on the given image. Image retrieval is a process of retrieving an appropriate image from the database as per the user query, for instance, with text keywords. With the evolution of the semantic web and huge data repositories, a major challenge comes into the picture which is effective indexation and retrieval of both still and moving images and the identification of key areas inside the images. An image cannot be expressed completely just by using visual features only as they under-constrain the information contained in it. Visual features of an image include color distribution, texture, shape, and edges. Typically, image retrieval systems make use of images and the corresponding text/keywords for indexing and retrieving images using both keywords and visual features of the image. Cross-modal image retrieval aims to use text for retrieving relevant images related to the text.

Cross-modal learning has become tremendously popular because of its effective information retrieval capability. Numerous cross-modal representation and retrieval methods have been proposed by researchers to resolve the issue of cross-modal retrieval considering several modalities. Various appealing surveys have been introduced which summarizes the work done in this field. Image and text are the highly utilized modalities and a number of articles on cross-modal retrieval have been published considering these. However, there is no proper survey mainly focusing on the image–text cross-modal retrieval techniques. The objective of this article is to conduct a comprehensive review of cross-modal retrieval which incorporates image and text modalities, the main concerns of which are different from previous surveys and reviews. So, the motivation behind this review article is:

1.
Lack of a full-fledged review article on image and text modalities.
2.
To present various challenges and open issues in the cross-modal retrieval field.
3.
Image and text modalities are the basic and highly utilized modalities, however, we are still away from achieving an ideal level in their cross-modal retrieval process.

Existing literature reviews related to cross-modal information retrieval have presented the topic quite well to the research community. [1] presented an overview of cross-modal retrieval in 2016, however, it does not comprise several significant works proposed in recent years. In [2], authors have presented numerous multi-modal techniques, but their focus is only on techniques based on machine learning. [3] is a contemporary survey, however, it presents a brief study of cross-media retrieval methods compared to the vastness of the topic. An overview of different cross-media retrieval techniques incorporating miscellaneous modalities has been provided in [4]. [5] article only explore various cross-media retrievals with joint graph regularization. The focus of [6] is on cross-media analysis and reasoning and the various analysis methods rather than cross-media retrieval. [7] has provided a survey on cross-media image and text information fusion where the main focus is on analyzing two methods of image and text associations.

Table 1 shows the comparison of the current survey with the existing reviews related to cross-modal learning sorted year wise. Comparison is performed on the basis of the domain, different modalities incorporated in the paper, comparative analysis, challenges, open issues, benchmark datasets, and evaluation metrics. It can be seen in the table that only one survey is focusing on image and text modalities but their main concern is an image–text association and not cross-modal retrieval. A blank cell in the table implies that the information is missing for that particular column and ✓means that it is present in the article. Domain column specifies the main focus of the article and all value under Mode column means that the article is not particularly focusing on any two or three modalities rather it is talking about the whole multi-media. Comparative analysis depicts whether the comparison among techniques has been performed quantitatively or qualitatively.

The significant contributions of this paper are as follows:

1.
This review focuses to present a summary of recent progress in cross-modal retrieval considering text and image (image-to-text and text-to-image). It comprises several novel works and references which are absent in previous surveys. It will act as a valuable resource for beginners to get acquainted with the topic.
2.
A broad classification of various cross-modal approaches has been presented and difference among them is also discussed.
3.
It provides information regarding various prominent benchmark datasets and evaluation metrics utilized for cross-modal method performance estimation.
4.
It presents a comparative analysis of diverse cross-modal representation techniques when applied on benchmark datasets. This analysis will be highly useful for future research.
5.
The article summarizes various challenges in the field of cross-modal retrieval and open issues to work upon by future researchers.

This article starts with an introduction of cross-modal retrieval in Section 1 which includes motivation for the survey, contributions, comparison with existing surveys, article road map, and organization. An appropriate review methodology (Section 2) has been shadowed in writing the proposed survey which incorporates five subtopics: research questions, sources of information, search criteria, data extraction, and publication metrics. The inception of cross-modal retrieval, its general architecture, applications, observed challenges in the process, and the initial related articles are presented in concert under the background section (Section 3). Section 4 discusses about the diverse cross-modal representation and retrieval techniques which are broadly classified into real-valued and binary techniques. The literature related to these techniques has also been included in this section. The famous image–text benchmark datasets which have been widely used by the researchers in the cross-modal field have been presented in Section 5. Section 6 is of comparative analysis which introduces different performance evaluation metrics along with a comparison of various cross-modal retrieval methods. A summary of several state-of-the-art cross-modal retrieval works has been demonstrated with the use of tables in Section 7. The miscellaneous open issues in cross-modal retrieval domain have been discussed in Section 8. Finally, Section 9 culminates the survey with the conclusion. Fig. 3 depicts the road map of the article.

Section snippets

Review methodology

The categorical survey technique described in this research article has been obtained from the technique described by Kitchenham et al. [8], [9]. Distinct stages used in this review are: to create a review technique, planning an exhaustive survey, executing the survey, comparison of results, comparative result analysis, and exploring open issues. The review technique employed in this categorical survey is described in Fig. 4.

Background

The inception of the terms cross-modal and multi-modal is in neurology and are inspired from multi-sensory integration inside brain [10], [11]. We often need to understand images of objects/scenes through the use of phrases because image does not contain all the relevant information. Thus, we use one modality of communication to compensate for the absence of information in another mode [12] which implies co-relating text and image.

In simple terms, cross-modal or multi-modal is linking of

Cross-modal representation and retrieval techniques

Cross-modal representation techniques can be broadly classified into two categories: (a) Real-valued representation and (b) Binary representation. In real-valued representation learning, the learned common representations of diverse modalities are real-valued. However, in binary representation learning, diverse modalities are mapped into a common hamming space. Cross-modal similarity searching is faster in binary representation, so the retrieval process also becomes faster. However, the

Benchmark datasets

With the advent of huge multi-modal data generation, cross-modal retrieval has become a crucial and interesting problem. Researchers have composed diverse multi-modal datasets for evaluating the proposed cross-modal techniques. Fig. 21 presents the evolution of the datasets in recent years. Summary of prominent multi-modal datasets is given in Table 7 which includes dataset name, mode, total concepts, dataset size, image representation, text representation, related article, and data source.

Comparative analysis

In this section, prominent evaluation metrics used for cross-modal retrieval method performance analysis are defined. Afterward, comparisons of various cross-modal retrieval methods when applied on diverse datasets are presented on the basis of MAP score.

Discussion

Cross-modal information retrieval is a burdensome task because of the semantic gap among modalities. Due to which different modalities cannot be compared directly to each other. To handle this issue, researchers have introduced several techniques for multi-modal data representation in the past few years. Table 18 presents a summary of recent literature for state-of-the-art techniques used for image–text cross-modal retrieval. It is divided into three parts: the first part contains works

Open issues

The motive of cross-modal learning is to prepare a model to which one type of modality is inserted as a query to retrieve the results in another modality. For this process, the collected data has to be arranged in a manner so that retrieval can happen in less time as well as the results must be accurate and semantically relate to the queried modality data. Researchers have proposed miscellaneous algorithms for making cross-modal retrieval task more effective, however, there are few open issues

Conclusion

From the review on cross-modal information retrieval, it has been found that cross-modal retrieval techniques are better than classic uni-modal systems in retrieving the multi-modal data and adding values to complement meaningful information. The article summarizes the prominent works done by various researchers in the field of image–text cross-modal retrieval. Primary information has been presented with the help of tables, figures, and graphs to make it more understandable. A taxonomy of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (163)

KitchenhamB. et al.
Systematic literature reviews in software engineering–a systematic literature review
Inf. Softw. Technol.
(2009)
LiuZ. et al.
Audiovisual cross-modal material surface retrieval
Neural Comput. Appl.
(2019)
LazaridisM. et al.
Multimedia search and retrieval using multimodal annotation propagation and indexing techniques
Signal Process., Image Commun.
(2013)
YuY. et al.
Deep cross-modal correlation learning for audio and lyrics in music retrieval
ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)
(2019)
VermaY. et al.
A support vector approach for cross-modal search of images and texts
Comput. Vis. Image Underst.
(2017)
GaoN. et al.
Cross modal similarity learning with active queries
Pattern Recognit.
(2018)
XuX. et al.
Learning unified binary codes for cross-modal retrieval via latent semantic hashing
Neurocomputing
(2016)
HanburyA.
A survey of methods for image annotation
J. Vis. Lang. Comput.
(2008)
WangS. et al.
Cluster-sensitive structured correlation analysis for web cross-modal retrieval
Neurocomputing
(2015)
LiZ. et al.
Mlrank: Multi-correlation learning to rank for image annotation
Pattern Recognit.
(2013)

JiangB. et al.

Internet cross-media retrieval based on deep learning

J. Vis. Commun. Image Represent.

(2017)

FengF. et al.

Deep correspondence restricted Boltzmann machine for cross-modal retrieval

Neurocomputing

(2015)

CaoW. et al.

Hybrid representation learning for cross-modal retrieval

Neurocomputing

(2019)

WangK. et al.

A comprehensive survey on cross-modal retrieval

(2016)

BaltrušaitisT. et al.

Multimodal machine learning: A survey and taxonomy

IEEE Trans. Pattern Anal. Mach. Intell.

(2018)

AyyavaraiahM. et al.

Cross media feature retrieval and optimization: A contemporary review of research scope, challenges and objectives

PengY. et al.

An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges

IEEE Trans. Circuits Syst. Video Technol.

(2017)

AyyavaraiahM. et al.

Joint graph regularization based semantic analysis for cross-media retrieval: a systematic review

Int. J. Eng. Technol.

(2018)

PengY.-x. et al.

Cross-media analysis and reasoning: advances and directions

Front. Inf. Technol. Electron. Eng.

(2017)

PriyankaM. et al.

Analysis of cross-media web information fusion for text and image association-a survey paper

Global J. Comput. Sci. Technol.

(2013)

KitchenhamB. et al.

Guidelines for Performing Systematic Literature Reviews in Software Engineering

(2007)

SteinB.E. et al.

Development of multisensory integration from the perspective of the individual neuron

Nat. Rev. Neurosci.

(2014)

MillerR.L. et al.

Multisensory integration: How the brain combines information across the senses

Comput. Model. Brain Behav.

(2017)

SrihariR.K.

Use of captions and other collateral text in understanding photographs

SteinB.E. et al.

The Merging of the Senses

(1993)

SteinB.E. et al.

Behavioral indices of multisensory integration: orientation to visual cues is affected by auditory stimuli

J. Cogn. Neurosci.

(1989)

OtoomM.

Beyond von Neumann: Brain-computer structural metaphor

YuhasB.P. et al.

Integration of acoustic and visual speech signals using neural networks

IEEE Commun. Mag.

(1989)

SaracenoC. et al.

Indexing audiovisual databases through joint audio and video processing

Int. J. Imaging Syst. Technol.

(1998)

RoyD.

Integration of speech and vision using mutual information

McGurkH. et al.

Hearing lips and seeing voices

Nature

(1976)

WesterveldT. et al.

Extracting bimodal representations for language-based image retrieval

WesterveldT.

Image retrieval: Content versus context

XiongC. et al.

Voice-face cross-modal matching and retrieval: A benchmark

(2019)

DuarteA.C.

Cross-modal neural sign language translation

MariooryadS. et al.

Exploring cross-modality affective reactions for audiovisual emotion recognition

IEEE Trans. Affect. Comput.

(2013)

JingM. et al.

Integration of text and image analysis for flood event image recognition

RahmanM.M. et al.

Interactive cross and multimodal biomedical image retrieval based on automatic region-of-interest (ROI) identification and classification

Int. J. Multimed. Inf. Retrieval

(2014)

CaoD. et al.

Video-based cross-modal recipe retrieval

XiaD. et al.

A cross-modal multimedia retrieval method using depth correlation mining in big data environment

Multimedia Tools Appl.

(2019)

X. Zhai, Y. Peng, J. Xiao, Heterogeneous metric learning with joint graph regularization for cross-media retrieval, in:...

ElizaldeB. et al.

Cross modal audio search and retrieval with joint embeddings based on text and audio

ZengD. et al.

Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval

(2019)

TripathiP. et al.

Discover cross-modal human behavior analysis

ImuraJ. et al.

Efficient multi-modal retrieval in conceptual space

GoyalP. et al.

Cross-modal learning for multi-modal video categorization

(2020)

PereiraJ.C. et al.

Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems

Comput. Vis. Image Underst.

(2014)

GouT. et al.

A new approach to cross-modal retrieval

N. Srivastava, R. Salakhutdinov, Learning representations for multimodal data with deep belief nets, in: International...

HabibianA. et al.

Discovering semantic vocabularies for cross-media retrieval

Cited by (62)

A survey of efficient fine-tuning methods for Vision-Language Models — Prompt and Adapter
2024, Computers and Graphics (Pergamon)
Vision Language Model (VLM) is a popular research field located at the fusion of computer vision and natural language processing (NLP). With the emergence of transformer networks and mass web data, numerous large scale VLMs or Vision-Language Pre-training Models (VLPM) have been achieving state-of-the-art results in many tasks, such as retrieval (CLIP) and generation (DALL-E). Although large models have shown impressive results, the cost of retraining and full fine-tuning is prohibitive for general researchers. In recent years, Efficient fine-tuning (EFT) which a very low-cost tuning method has been a good solution to this problem has greatly alleviated this problem, and driven by this, a new fine-tuning paradigm has developed. Since Prompt and Adapter are most widely used in the field of visual language, this review focuses on analysing the progress of the application of these two methods. Firstly, we reviewed the VLM research paradigm based on the differences in pre-training-fine-tuning methods; Next, We categorized the Prompt into 3 types (7 subtypes) of usage patterns based on the different modal information, and categorized the Adapter into 2 types of usage patterns based on whether it plays a role in modal fusion, furthermore we discussed them in vision and vision-language tasks. Finally, we discussed the stability and social ethics of EFT, and possible future research directions were proposed.
P2S distance induced locally conjugated orthogonal subspace learning for feature extraction
2024, Expert Systems with Applications
When performing data classification tasks, it often occurs to them the curse of dimensionality problem. To address the issue, a manifold learning method termed locally conjugated orthogonal subspace (LCOS) is put forward for dimensionality reduction or feature extraction in this paper. Note that point to feature space (P2S) distance contributes to mining local geometry information, both a local margin characterizing data apartness and a locally conjugated orthogonal constraint beneficial to removing data redundancy are well studied from the P2S distance metric. They are all exploited to model the proposed LCOS. Then, a low dimensional subspace can be explored by maximizing the P2S distance induced local margin under the constraint. Compared with some other related dimensionality reduction methods, experimental results on benchmark face and object data sets validate the performance of the proposed method.
Perceive, Reason, and Align: Context-guided cross-modal correlation learning for image–text retrieval
2024, Applied Soft Computing
Due to the inconsistency in feature representations between different modalities, namely “Heterogeneous gap”, it remains a persistent challenge to correlate images and texts. Existing studies on image–text retrieval (ITR) mainly emphasize on inter-modal correlation learning through aligning instances or their patches from different modalities. However, it is hard to break through performance bottlenecks of ITR without powerfully supporting from intra-modal correlation. Unfortunately, few studies have sufficiently considered two critical tasks in intra-modal correlation learning: (1) intricate contextual information perceiving, and (2) intrinsic semantic relationships reasoning. Therefore, in this paper, we propose the Context-guided Cross-modal Correlation Learning (CCCL) framework for ITR under a novel paradigm: “Perceive, Reason, and Align”. Firstly, in the stage of “Perceive”, the context-guided mechanism based on the self-attention and gate mechanism is proposed to fully discover contextual information within modalities, eliminating unnecessary interactions between local-level patches. Secondly, in the stage of “Reason”, graph convolutional network with the residual structure is used to uncover relationships among patches within each modality to make reasonable inferences. Thirdly, in the stage of “Align”, to achieve precise inter-modal alignment, the complementarity between different modalities from both global-level and local-level is effectively mined and fused. Finally, to optimize our proposed CCCL framework, the hybrid loss is constructed by combining the cross-modal coherence term with the cross-modal alignment term. Our approach yields highly competitive results on two publicly available ITR datasets, that is, Flickr30K and MS-COCO.
Supervised adaptive similarity consistent latent representation hashing
2024, Neurocomputing
Cross-modal hashing has attracted significant attention in multimedia data similarity given its appealing computational cost and retrieval performance. Supervised hashing benefits from the auxiliary learning of a similarity matrix, which is usually predefined by inner product features or category labels. However, a predefined similarity matrix fails to reflect the real similarity relationship between image-text pairs. In addition, existing methods fix the weights to a value or update them by introducing sensitive dataset-related hyper-parameters. To overcome these problems, we propose a method to perform supervised adaptive similarity consistent latent representation hashing (SCLRH) that adaptively learns the similarity matrix during hashing learning. In SCLRH, we assume that multimodal data are observed and reconstructed from different perspectives of a common consistent latent representation. Instead of using a predefined similarity matrix, SCLRH adaptively learns this matrix to reflect the underlying manifold structure and describes the fine-grained similarity between consistent latent representations. In addition, SCLRH introduces a self-weighted learning strategy to update the weights based on the contributions of different modalities without involving additional hyper-parameters. Experimental results on three benchmark datasets demonstrate the superiority of the proposed SCLRH for cross-modal retrieval.
Multi-similarity reconstructing and clustering-based contrastive hashing for cross-modal retrieval
2023, Information Sciences
In unsupervised cross-modal hashing, there are two notable issues that require attention. The inter- and intra-modal similarity matrices in the original and Hamming spaces lack sufficient neighborhood information and semantic consistency, while solely relying on the reconstruction of instance-level similarity matrices fails to effectively capture the global intrinsic correlation and manifold structure of the training samples. We propose a novel method that combines multi-similarity reconstructing with clustering-based contrastive hashing. Firstly, we construct image feature, text feature and joint-semantic feature multi-similarity matrices in their original space, along with their corresponding hashing code similarity matrices in the Hamming space, to enhance the semantic consistency of the inter-and intra-modal reconstructions. Secondly, the clustering-based contrastive hashing is proposed to capture the global intrinsic correlation and manifold structure of the image-text pairs. Extensive experiment results on Wiki, NUS-WIDE, MIRFlickr-25K and MS-COCO demonstrate the promising cross-modal retrieval performance of the proposed method.
From scattered sources to comprehensive technology landscape: A recommendation-based retrieval approach
2023, World Patent Information
Mapping the technology landscape is crucial for market actors to take informed investment decisions. However, given the large amount of data on the Web and its subsequent information overload, manually retrieving information is a seemingly ineffective and incomplete approach. In this work, we propose an end-to-end recommendation based retrieval approach to support automatic retrieval of technologies and their associated companies from raw Web data. This is a two-task setup involving (i) technology classification of entities extracted from company corpus, and (ii) technology and company retrieval based on classified technologies. Our proposed framework approaches the first task by leveraging DistilBERT which is a state-of-the-art language model. For the retrieval task, we introduce a recommendation-based retrieval technique to simultaneously support retrieving related companies, technologies related to a specific company and companies relevant to a technology. To evaluate these tasks, we also construct a data set that includes company documents and entities extracted from these documents together with company categories and technology labels. Experiments show that our approach is able to return 4 times more relevant companies while outperforming traditional retrieval baseline in retrieving technologies.

View all citing articles on Scopus

View full text

Review articleComparative analysis on cross-modal information retrieval: A review

Highlights

Abstract

Introduction

Section snippets

Review methodology

Background

Cross-modal representation and retrieval techniques

Benchmark datasets

Comparative analysis

Discussion

Open issues

Conclusion

Declaration of Competing Interest

Inf. Softw. Technol.

Neural Comput. Appl.

Signal Process., Image Commun.

ACM Trans. Multimed. Comput. Commun. Appl. (TOMM)

Comput. Vis. Image Underst.

Pattern Recognit.

Neurocomputing

J. Vis. Lang. Comput.

Neurocomputing

Pattern Recognit.

J. Vis. Commun. Image Represent.

Neurocomputing

Neurocomputing

A comprehensive survey on cross-modal retrieval

Multimodal machine learning: A survey and taxonomy

IEEE Trans. Pattern Anal. Mach. Intell.

Cross media feature retrieval and optimization: A contemporary review of research scope, challenges and objectives

An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges

IEEE Trans. Circuits Syst. Video Technol.

Joint graph regularization based semantic analysis for cross-media retrieval: a systematic review

Int. J. Eng. Technol.

Cross-media analysis and reasoning: advances and directions

Front. Inf. Technol. Electron. Eng.

Analysis of cross-media web information fusion for text and image association-a survey paper

Global J. Comput. Sci. Technol.

Guidelines for Performing Systematic Literature Reviews in Software Engineering

Development of multisensory integration from the perspective of the individual neuron

Nat. Rev. Neurosci.

Multisensory integration: How the brain combines information across the senses

Comput. Model. Brain Behav.

Use of captions and other collateral text in understanding photographs

The Merging of the Senses

Behavioral indices of multisensory integration: orientation to visual cues is affected by auditory stimuli

J. Cogn. Neurosci.

Beyond von Neumann: Brain-computer structural metaphor

Integration of acoustic and visual speech signals using neural networks

IEEE Commun. Mag.

Indexing audiovisual databases through joint audio and video processing

Int. J. Imaging Syst. Technol.

Integration of speech and vision using mutual information

Hearing lips and seeing voices

Nature

Extracting bimodal representations for language-based image retrieval

Image retrieval: Content versus context

Voice-face cross-modal matching and retrieval: A benchmark

Cross-modal neural sign language translation

Exploring cross-modality affective reactions for audiovisual emotion recognition

IEEE Trans. Affect. Comput.

Integration of text and image analysis for flood event image recognition

Interactive cross and multimodal biomedical image retrieval based on automatic region-of-interest (ROI) identification and classification

Int. J. Multimed. Inf. Retrieval

Video-based cross-modal recipe retrieval

A cross-modal multimedia retrieval method using depth correlation mining in big data environment

Multimedia Tools Appl.

Cross modal audio search and retrieval with joint embeddings based on text and audio

Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval

Discover cross-modal human behavior analysis

Efficient multi-modal retrieval in conceptual space

Cross-modal learning for multi-modal video categorization

Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems

Comput. Vis. Image Underst.

A new approach to cross-modal retrieval

Discovering semantic vocabularies for cross-media retrieval

Review article
Comparative analysis on cross-modal information retrieval: A review