Elsevier

Computer Science Review

Volume 39, February 2021, 100336
Computer Science Review

Review article
Comparative analysis on cross-modal information retrieval: A review

https://doi.org/10.1016/j.cosrev.2020.100336Get rights and content

Highlights

  • Summary on recent progress in the image–text cross-modal retrieval.

  • Broad classification of various cross-modal techniques.

  • Prominent benchmark datasets and evaluation metrics are introduced.

  • Comparative analysis of diverse cross-modal methods.

  • Challenges and open issues are presented in the area of multi-modal retrieval.

Abstract

Human beings experience life through a spectrum of modes such as vision, taste, hearing, smell, and touch. These multiple modes are integrated for information processing in our brain using a complex network of neuron connections. Likewise for artificial intelligence to mimic the human way of learning and evolve into the next generation, it should elucidate multi-modal information fusion efficiently. Modality is a channel that conveys information about an object or an event such as image, text, video, and audio. A research problem is said to be multi-modal when it incorporates information from more than a single modality. Multi-modal systems involve one mode of data to be inquired for any (same or varying) modality outcome whereas cross-modal system strictly retrieves the information from a dissimilar modality. As the input–output queries belong to diverse modal families, their coherent comparison is still an open challenge with their primitive forms and subjective definition of content similarity. Numerous techniques have been proposed by researchers to handle this issue and to reduce the semantic gap of information retrieval among different modalities. This paper focuses on a comparative analysis of various research works in the field of cross-modal information retrieval. Comparative analysis of several cross-modal representations and the results of the state-of-the-art methods when applied on benchmark datasets have also been discussed. In the end, open issues are presented to enable the researchers to a better understanding of the present scenario and to identify future research directions.

Introduction

When we fail to understand the contents of an image embedded in a text, figure captions, and referral text often help. Just by looking at a figure, a person might not be able to understand it exactly but with the help of collateral text, it can be understood efficiently. For instance, when we see a volleyball picture (Fig. 1), we may not be able to understand or know about the volleyball game. However, the picture can be completely understood with the help of collateral text (such as caption, figure reference, and related citation) describing the volleyball game. This implies that information from more than one source is beneficial in further understanding of things and also helpful in better information retrieval. This is where cross-modal data fusion and retrieval come into the picture.

Recently, cross-modal retrieval has gained a lot of attention due to the rapid increase in multi-modal data such as images, text, video, and audio. The term modality represents a specific form in which data exists and it is also associated with a sensory perception such as vision and hear modalities which are major sources of communication and responsiveness in human beings and animals. The data consisting of more than one modality is known as multi-modal data. It has the characteristic of high-level semantic homogeneity and low-level expressive heterogeneity such as the same thing having diverse representations. Different forms of representation help people better understand things as illustrated in the volleyball example above. While searching for something, people often want to get accurate results in different forms which create a need for an efficient multi-media information retrieval platform. Classic approaches to information retrieval are of uni-modal nature. Uni-modal means information derived just from one channel, such as only from images or only from the text (but not both). For example, only the text query is used for information search and retrieval from a text repository. This retrieval approach is of the least use these days when enormous multimedia data is being generated. Cross-modal and multi-modal systems, on the other hand, are able to link more than one modalities such as image, text, audio, and video. In cross-modal, input query mode and resultant mode are dissimilar. For example, query text for related images and query image for related text. However, the resultant mode can be similar to the query mode in a multi-modal system. For example, query text to retrieve related images and matched text. Cross-modal and multi-modal are explained using a simple example in Fig. 2 where + represents both text and images can be retrieved using an image query and vice versa in multi-modal approach.

Therefore, the fundamental idea of cross-modal is to integrate numerous modes of information to derive better results than just one channel. For instance, an image–text cross-modal system integrates textual information along with an image which is known as image annotation. Vice-versa, it also queries text keywords to retrieve images, known as image retrieval. In simple words, image annotation is a process of explaining an image with appropriate linguistic cues. It is useful in knowledge transfer sessions for application areas such as medical science, military, business, education, and sports to name a few. For example, a CT scan is known to the radiologist but not to an intern or a patient. Therefore, the expert has to explain it using proper terminology by pointing out key areas on the given image. Image retrieval is a process of retrieving an appropriate image from the database as per the user query, for instance, with text keywords. With the evolution of the semantic web and huge data repositories, a major challenge comes into the picture which is effective indexation and retrieval of both still and moving images and the identification of key areas inside the images. An image cannot be expressed completely just by using visual features only as they under-constrain the information contained in it. Visual features of an image include color distribution, texture, shape, and edges. Typically, image retrieval systems make use of images and the corresponding text/keywords for indexing and retrieving images using both keywords and visual features of the image. Cross-modal image retrieval aims to use text for retrieving relevant images related to the text.

Cross-modal learning has become tremendously popular because of its effective information retrieval capability. Numerous cross-modal representation and retrieval methods have been proposed by researchers to resolve the issue of cross-modal retrieval considering several modalities. Various appealing surveys have been introduced which summarizes the work done in this field. Image and text are the highly utilized modalities and a number of articles on cross-modal retrieval have been published considering these. However, there is no proper survey mainly focusing on the image–text cross-modal retrieval techniques. The objective of this article is to conduct a comprehensive review of cross-modal retrieval which incorporates image and text modalities, the main concerns of which are different from previous surveys and reviews. So, the motivation behind this review article is:

  • 1.

    Lack of a full-fledged review article on image and text modalities.

  • 2.

    To present various challenges and open issues in the cross-modal retrieval field.

  • 3.

    Image and text modalities are the basic and highly utilized modalities, however, we are still away from achieving an ideal level in their cross-modal retrieval process.

Existing literature reviews related to cross-modal information retrieval have presented the topic quite well to the research community. [1] presented an overview of cross-modal retrieval in 2016, however, it does not comprise several significant works proposed in recent years. In [2], authors have presented numerous multi-modal techniques, but their focus is only on techniques based on machine learning. [3] is a contemporary survey, however, it presents a brief study of cross-media retrieval methods compared to the vastness of the topic. An overview of different cross-media retrieval techniques incorporating miscellaneous modalities has been provided in [4]. [5] article only explore various cross-media retrievals with joint graph regularization. The focus of [6] is on cross-media analysis and reasoning and the various analysis methods rather than cross-media retrieval. [7] has provided a survey on cross-media image and text information fusion where the main focus is on analyzing two methods of image and text associations.

Table 1 shows the comparison of the current survey with the existing reviews related to cross-modal learning sorted year wise. Comparison is performed on the basis of the domain, different modalities incorporated in the paper, comparative analysis, challenges, open issues, benchmark datasets, and evaluation metrics. It can be seen in the table that only one survey is focusing on image and text modalities but their main concern is an image–text association and not cross-modal retrieval. A blank cell in the table implies that the information is missing for that particular column and ✓means that it is present in the article. Domain column specifies the main focus of the article and all value under Mode column means that the article is not particularly focusing on any two or three modalities rather it is talking about the whole multi-media. Comparative analysis depicts whether the comparison among techniques has been performed quantitatively or qualitatively.

The significant contributions of this paper are as follows:

  • 1.

    This review focuses to present a summary of recent progress in cross-modal retrieval considering text and image (image-to-text and text-to-image). It comprises several novel works and references which are absent in previous surveys. It will act as a valuable resource for beginners to get acquainted with the topic.

  • 2.

    A broad classification of various cross-modal approaches has been presented and difference among them is also discussed.

  • 3.

    It provides information regarding various prominent benchmark datasets and evaluation metrics utilized for cross-modal method performance estimation.

  • 4.

    It presents a comparative analysis of diverse cross-modal representation techniques when applied on benchmark datasets. This analysis will be highly useful for future research.

  • 5.

    The article summarizes various challenges in the field of cross-modal retrieval and open issues to work upon by future researchers.

This article starts with an introduction of cross-modal retrieval in Section 1 which includes motivation for the survey, contributions, comparison with existing surveys, article road map, and organization. An appropriate review methodology (Section 2) has been shadowed in writing the proposed survey which incorporates five subtopics: research questions, sources of information, search criteria, data extraction, and publication metrics. The inception of cross-modal retrieval, its general architecture, applications, observed challenges in the process, and the initial related articles are presented in concert under the background section (Section 3). Section 4 discusses about the diverse cross-modal representation and retrieval techniques which are broadly classified into real-valued and binary techniques. The literature related to these techniques has also been included in this section. The famous image–text benchmark datasets which have been widely used by the researchers in the cross-modal field have been presented in Section 5. Section 6 is of comparative analysis which introduces different performance evaluation metrics along with a comparison of various cross-modal retrieval methods. A summary of several state-of-the-art cross-modal retrieval works has been demonstrated with the use of tables in Section 7. The miscellaneous open issues in cross-modal retrieval domain have been discussed in Section 8. Finally, Section 9 culminates the survey with the conclusion. Fig. 3 depicts the road map of the article.

Section snippets

Review methodology

The categorical survey technique described in this research article has been obtained from the technique described by Kitchenham et al. [8], [9]. Distinct stages used in this review are: to create a review technique, planning an exhaustive survey, executing the survey, comparison of results, comparative result analysis, and exploring open issues. The review technique employed in this categorical survey is described in Fig. 4.

Background

The inception of the terms cross-modal and multi-modal is in neurology and are inspired from multi-sensory integration inside brain [10], [11]. We often need to understand images of objects/scenes through the use of phrases because image does not contain all the relevant information. Thus, we use one modality of communication to compensate for the absence of information in another mode [12] which implies co-relating text and image.

In simple terms, cross-modal or multi-modal is linking of

Cross-modal representation and retrieval techniques

Cross-modal representation techniques can be broadly classified into two categories: (a) Real-valued representation and (b) Binary representation. In real-valued representation learning, the learned common representations of diverse modalities are real-valued. However, in binary representation learning, diverse modalities are mapped into a common hamming space. Cross-modal similarity searching is faster in binary representation, so the retrieval process also becomes faster. However, the

Benchmark datasets

With the advent of huge multi-modal data generation, cross-modal retrieval has become a crucial and interesting problem. Researchers have composed diverse multi-modal datasets for evaluating the proposed cross-modal techniques. Fig. 21 presents the evolution of the datasets in recent years. Summary of prominent multi-modal datasets is given in Table 7 which includes dataset name, mode, total concepts, dataset size, image representation, text representation, related article, and data source.

Comparative analysis

In this section, prominent evaluation metrics used for cross-modal retrieval method performance analysis are defined. Afterward, comparisons of various cross-modal retrieval methods when applied on diverse datasets are presented on the basis of MAP score.

Discussion

Cross-modal information retrieval is a burdensome task because of the semantic gap among modalities. Due to which different modalities cannot be compared directly to each other. To handle this issue, researchers have introduced several techniques for multi-modal data representation in the past few years. Table 18 presents a summary of recent literature for state-of-the-art techniques used for image–text cross-modal retrieval. It is divided into three parts: the first part contains works

Open issues

The motive of cross-modal learning is to prepare a model to which one type of modality is inserted as a query to retrieve the results in another modality. For this process, the collected data has to be arranged in a manner so that retrieval can happen in less time as well as the results must be accurate and semantically relate to the queried modality data. Researchers have proposed miscellaneous algorithms for making cross-modal retrieval task more effective, however, there are few open issues

Conclusion

From the review on cross-modal information retrieval, it has been found that cross-modal retrieval techniques are better than classic uni-modal systems in retrieving the multi-modal data and adding values to complement meaningful information. The article summarizes the prominent works done by various researchers in the field of image–text cross-modal retrieval. Primary information has been presented with the help of tables, figures, and graphs to make it more understandable. A taxonomy of

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (163)

  • JiangB. et al.

    Internet cross-media retrieval based on deep learning

    J. Vis. Commun. Image Represent.

    (2017)
  • FengF. et al.

    Deep correspondence restricted Boltzmann machine for cross-modal retrieval

    Neurocomputing

    (2015)
  • CaoW. et al.

    Hybrid representation learning for cross-modal retrieval

    Neurocomputing

    (2019)
  • WangK. et al.

    A comprehensive survey on cross-modal retrieval

    (2016)
  • BaltrušaitisT. et al.

    Multimodal machine learning: A survey and taxonomy

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2018)
  • AyyavaraiahM. et al.

    Cross media feature retrieval and optimization: A contemporary review of research scope, challenges and objectives

  • PengY. et al.

    An overview of cross-media retrieval: Concepts, methodologies, benchmarks, and challenges

    IEEE Trans. Circuits Syst. Video Technol.

    (2017)
  • AyyavaraiahM. et al.

    Joint graph regularization based semantic analysis for cross-media retrieval: a systematic review

    Int. J. Eng. Technol.

    (2018)
  • PengY.-x. et al.

    Cross-media analysis and reasoning: advances and directions

    Front. Inf. Technol. Electron. Eng.

    (2017)
  • PriyankaM. et al.

    Analysis of cross-media web information fusion for text and image association-a survey paper

    Global J. Comput. Sci. Technol.

    (2013)
  • KitchenhamB. et al.

    Guidelines for Performing Systematic Literature Reviews in Software Engineering

    (2007)
  • SteinB.E. et al.

    Development of multisensory integration from the perspective of the individual neuron

    Nat. Rev. Neurosci.

    (2014)
  • MillerR.L. et al.

    Multisensory integration: How the brain combines information across the senses

    Comput. Model. Brain Behav.

    (2017)
  • SrihariR.K.

    Use of captions and other collateral text in understanding photographs

  • SteinB.E. et al.

    The Merging of the Senses

    (1993)
  • SteinB.E. et al.

    Behavioral indices of multisensory integration: orientation to visual cues is affected by auditory stimuli

    J. Cogn. Neurosci.

    (1989)
  • OtoomM.

    Beyond von Neumann: Brain-computer structural metaphor

  • YuhasB.P. et al.

    Integration of acoustic and visual speech signals using neural networks

    IEEE Commun. Mag.

    (1989)
  • SaracenoC. et al.

    Indexing audiovisual databases through joint audio and video processing

    Int. J. Imaging Syst. Technol.

    (1998)
  • RoyD.

    Integration of speech and vision using mutual information

  • McGurkH. et al.

    Hearing lips and seeing voices

    Nature

    (1976)
  • WesterveldT. et al.

    Extracting bimodal representations for language-based image retrieval

  • WesterveldT.

    Image retrieval: Content versus context

  • XiongC. et al.

    Voice-face cross-modal matching and retrieval: A benchmark

    (2019)
  • DuarteA.C.

    Cross-modal neural sign language translation

  • MariooryadS. et al.

    Exploring cross-modality affective reactions for audiovisual emotion recognition

    IEEE Trans. Affect. Comput.

    (2013)
  • JingM. et al.

    Integration of text and image analysis for flood event image recognition

  • RahmanM.M. et al.

    Interactive cross and multimodal biomedical image retrieval based on automatic region-of-interest (ROI) identification and classification

    Int. J. Multimed. Inf. Retrieval

    (2014)
  • CaoD. et al.

    Video-based cross-modal recipe retrieval

  • XiaD. et al.

    A cross-modal multimedia retrieval method using depth correlation mining in big data environment

    Multimedia Tools Appl.

    (2019)
  • X. Zhai, Y. Peng, J. Xiao, Heterogeneous metric learning with joint graph regularization for cross-media retrieval, in:...
  • ElizaldeB. et al.

    Cross modal audio search and retrieval with joint embeddings based on text and audio

  • ZengD. et al.

    Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval

    (2019)
  • TripathiP. et al.

    Discover cross-modal human behavior analysis

  • ImuraJ. et al.

    Efficient multi-modal retrieval in conceptual space

  • GoyalP. et al.

    Cross-modal learning for multi-modal video categorization

    (2020)
  • PereiraJ.C. et al.

    Cross-modal domain adaptation for text-based regularization of image semantics in image retrieval systems

    Comput. Vis. Image Underst.

    (2014)
  • GouT. et al.

    A new approach to cross-modal retrieval

  • N. Srivastava, R. Salakhutdinov, Learning representations for multimodal data with deep belief nets, in: International...
  • HabibianA. et al.

    Discovering semantic vocabularies for cross-media retrieval

  • Cited by (62)

    View all citing articles on Scopus
    View full text