1 Introduction

The academic discipline that we term today “information retrieval” (IR) goes back, though opinions vary, to at least the seminal position paper by Bush (1945). In the ensuing roughly 70 years of work, some mechanisms have been introduced early on, but have persisted and proven versatile since then; e.g. the formulae that govern the ranking of retrieved documents. Amongst these are some of the most popular weighting schemes for (textual) retrieval, which can all be described in terms of how they combine three main components; the term frequency (tf); i.e. how often a term appears in a given document, the document frequency (df); i.e. in how many documents a term appears and a document length normalization component. Originally developed for retrieval on English language text, these weighting schemes have generalized well to many related tasks, such as multilingual retrieval (Peters et al. 2012), multimedia retrieval (Müller et al. 2010) and others.

Today, we have to deal with increasingly complex document collections and queries (Imhof and Braschler 2015) that no longer just consist of textual modalities but also of a large set of non-textual modalities such as visual words in image retrieval (Villegas et al. 2015), locations in geographical IR (Mandl et al. 2009) or timestamps in time-aware IR (Li and Croft 2003). This is particularly true in enterprise search, domain-specific IR and many real IR applications, where it is not an option to simply ignore or discard entire modalities. Therefore, we claim that it becomes crucial to treat the modalities with unified methods instead of finding new approaches for each new modality or train a new model for every combination of modaltities. In this paper, we discuss the underpinnings of weighting schemes for textual retrieval and show how they can be applied or adapted methodically to non-textual modalities, such as ratings of books and geographical coordinates, which we understand as the first step into finding a unified model.

As a contribution towards establishing best practices for the integration of many modalities into an IR application, we demonstrate that BM25 is a suitable weighting scheme outperforming its alternatives to be used on non-textual modalities and to merge them under the so-called raw-score merging hypothesis by checking the assumptions underlying the BM25 formula. Being able to merge the modalities under the raw-score merging hypothesis with little or no training is particularly important due to the limited generalizability of suitable test collections and training data.

We start by considering an “ideal” robust approach, which is based on term sampling in order to correct the differences in average document length, which is one of the most obvious collection statistics. Then, we prove that there are cases, where BM25 can be interpreted as being identical to this sampling based approach. Using the sampling approach, we can further correct the difference between the variance of the document lengths. Along the investigation of the sampling approach, we further analyze the tf saturation parameter \(k_1\) of BM25 and explain its significance for non-textual modalities. Finally, we present experiments on the effectiveness of merging the results of the individual modalities into a unified multimodal result. We contrast our approach, which avoids learning, with an “optimized” baseline and find encouraging results.

The remainder of this paper is structured as follows. Section 2 outlines the anatomy of multimodal IR systems and describes the challenges faced when dealing with complex multimodal collections. We then demonstrate that BM25 is a suitable weighting scheme in multimodal IR systems w.r.t. document length normalization (Sect. 4). Section 5 describes how BM25 can be used for non-textual modalities by redefining the three main components of the weighting scheme. A sampling based BM25 approach is proposed in Sect. 6, which allows us to prove that BM25 fulfills the raw-score merging hypothesis w.r.t. the average document length and the variance of document lengths. In Sect. 7, we describe the multimodal test collections that we use for evaluation, followed by the experiments and the discussion of their results. Section 8 concludes this paper and discusses future work.

2 The anatomy of a multimodal IR system

2.1 Anatomy

In a multimodal IR system, both the documents as well as the queries consist of several modalities. Figure 1 shows an explanatory excerpt of four of the modalities of the documents in the social book search (SBS) collection used in the SBS lab at the CLEF evaluation forum (Koolen et al. 2016). The documents (\(d_1, d_2,\ldots , d_D\)) consist of the modalities: book title, reviews, binding and ratings, each of which can be treated as a bag of features. Hereby, \(d_j^m\) is the bag of features of modality m of document \(d_j\). The query both contains explicit and implicit modalities; i.e. the textual description of the request is explicit, while other information such as acceptable languages and ratings of the books are implicit. A more detailed description of the collection is given in Sect. 7.1.2. The queries in the SBS task are not particularly complex. In general, information needs embed several implicit and explicit modalities.

Fig. 1
figure 1

Excerpt of four modalities of a sample document (denoted \(d_j\)) in the SBS collection

During retrieval, weighting schemes define the retrieval score (retrieval status value \({\text {RSV}}(q,d_j^m)\)) of modality m of document \(d_j\) w.r.t. query q. The retrieval scores allow producing a ranked list for each modality according to the estimated probabilities of relevance, although the retrieval scores are not necessarily probability values but are order-preserving w.r.t the probabilities of relevance Robertson and Zaragoza (2009). These ranked lists of all the modalities, similarly to multilingual retrieval, need to be merged into a single ranked list. Hence, a function f has to be found to compute the retrieval score for each document including the retrieval scores of all modalities

$$\begin{aligned} {\text {RSV}}(q,d_j) = f({\text {RSV}}(q,d_j^1), {\text {RSV}}(q,d_j^2), \ldots , {\text {RSV}}(q,d_j^M)), \end{aligned}$$
(1)

where M is the number of modalities.

Evaluation has a strong tradition in IR, since information is hard to be defined in general (Cleverdon 1967). A crucial part of an IR evaluation is the availability of a suitable test collection. However, most of the existing test collections are not representative for multimodal IR systems and it is clearly not practical to create a test collection that covers all possible modalities and their combinations (Imhof and Braschler 2015). We are convinced that in order to improve and broaden the applicability of multimodal IR, a generalizable method to deal with complex collections with a large amount of very different modalities is crucial. Therefore, we claim that we need a unified weighting model for all types of modalities in order to avoid a lot of effort to come up with a new model for every modality type. Further, a merging strategy that works with little or no training is necessary, both because training can become very complex for a large amount of modalities and because in practical applications training data is not always available (Imhof and Braschler 2015).

2.2 Challenges

A multimodal IR system as described in this Section comes with several challenges that need to be solved in order to effectively use all the modalities. On the pursuit of a suitable weighting scheme for non-textual modalities, we can analyze the most popular textual weighting schemes. These can all be described in terms of how they combine three main components; the term frequency (tf), the document frequency (df) and the document length normalization component (Salton and Buckley 1988). Looking at these three components, we can understand their respective roles as follows: The first two components make sure that “characteristic” terms are weighed heavily. Hereby, a characteristic term is one that appears frequently in the document in consideration (term frequency) and rarely in the remainder of the collection (document frequency). These terms are suitable to distinguish a document from other documents in the collection. The third component, the document length normalization, was introduced to ensure no documents of a particular length are favored in an undue way, offsetting the increasing probability to observe terms frequently simply due to the verbosity of the document.

The concept of “being characteristic”, embodied through tf as well as df, is quite general and therefore applicable to other non-textual modalities (Robertson and Zaragoza 2009). One basically needs to check the assumption that an “unforeseen” local frequency of a feature hints at relevance. For non-textual modalities, the “term frequency” is usually referred to as “feature frequency” (ff). In the remainder of this paper, we will use the two expressions interchangeably. In Sect. 5, we show how we can define the tf and df for the two non-textual modalities ratings and geographical coordinates.

When analyzing the requirements of a weighting scheme for effective merging of ranked lists, usually the raw-score merging hypothesis is considered. The raw-score merging hypothesis describes that similarity values are directly comparable if they are produced from similar search engines and underlying collections with similar statistics (Braschler 2004; Kwok et al. 1995; Savoy 2003, 2005). In Appendix 1, we show that it is favorable to use the same weighting scheme for all modalities when using raw-score merging. However, already textual modalities often invalidate the raw-score merging hypothesis w.r.t. to the similar collection statistics. For non-textual modalities, this is usually even more severe, since they do not follow the language statistics. Therefore, we propose a sampling-based approach in Sect. 6 to eliminate the differences in average and variance of document lengths and show that BM25 satisfies the derived properties, which makes it a viable weighting scheme for raw-score merging.

We can summarize the challenges of building multimodal IR systems discussed in this paper as follows.

  1. 1.

    Adapt BM25 to non-textual modalities

    1. (a)

      Define tf, df and document length

    2. (b)

      Validate generalizability of document length normalization

  2. 2.

    Evaluate merging strategies (raw-score merging hypothesis)

  3. 3.

    Validate suitability of BM25 for raw-score merging

  4. 4.

    Evaluate effectiveness of the approach

3 Related work

Much work has been done using additional non-textual modalities in order to improve the retrieval effectiveness of textual IR systems. A famous example is the query-independent modality PageRank (Brin and Page 1998) and it is now an established practice to use modalities such as URL-type, anchor text and link indegree in retrieval of Web data (Craswell etal. 2005; Hashemi and Kamps 2014; Macdonald et al. 2015). A lot of other retrieval research sub-fields such as geographical IR (Mandl et al. 2009), image retrieval (Villegas et al. 2015), XML retrieval (Kamps et al. 2004) and living labs (Schuth et al. 2015) provide and use a large range of different modalities in order to optimize the retrieval results. Hereby, the additional modalities are often no longer query-independent, but also explicitly or implicitly (e.g. inside a user profile) part of the query. In contrast to this paper, most of these models have been developed for a specific modality and the generalization to other modalities was not a focus.

For non-textual modalities the document length normalization is particularly important, since items usually have large variances in the “length” of their content in terms of those modalities. Looking towards textual retrieval, a number of efforts investigating the role of document length in ranking textual documents exist. Generally, consensus is that including document length normalization in weighting schemes tends to improve the retrieval performance (Amati and Rijsbergen 2002; Chowdhury et al. 2002; Losada and Azzopardi 2008; Singhal et al. 1996). The weighting scheme Lnu.ltn (Singhal et al. 1996) is explicitly based on the idea of revisiting the cosine document length normalization of TF.IDF. Singhal et al. (1996) estimate the likelihood of relevance and the likelihood of retrieval for all document lengths and improve the document length normalization by tilting the slope of the likelihood of retrieval in order to better match the slope of the likelihood of relevance. This tilt of the slopes then results in the new improved “pivoted document length normalization scheme”. Investigations of the document length normalization of the BM25 weighting scheme have shown that it fails when documents are very long (Lv and Zhai 2011) and that choosing the right document length normalization parameter b in BM25 can increase the retrieval performance by 22% Chowdhury et al. (2002). In XML retrieval, document length normalization is particularly important, since the retrievable items (XML elements) have a great variety in length. Kamps et al. (2004) revisit the role of language model document length normalization in the context of XML retrieval. Amongst others, they found that a combination of restricting the minimal size of the XML elements and length priors results in a higher effectiveness.

Oftentimes multiple intermediate result lists, one per modality, are produced when matching on multimodal collections. The problem of merging multiple ranked lists into a single ranked list is known from multilingual, multimedia and distributed retrieval. Fox and Shaw (1994) propose different strategies to fuse the scores; e.g. the sum of the scores or the maximal score. However, as Callan et al. (1995) point out, the scores might not be directly comparable, due to the different ranges of the scores.

The merging problem is very prominently studied in the multimedia IR community. Depeursinge and Müller show that 62% of the ImageCLEF working notes deal with data fusion, their detailed analysis reveals that, similar to all the other domains, the most used fusion strategy is a linear combination of the scores (Depeursinge and Müller 2010). Mostly the weights of the linear combination are either found manually or based on training data. Wilkins et al. (2006) however describe a method to automatically determine query-dependent modality weights using the score distribution of visual and textual modalities used in the context of video retrieval. Another unsupervised method to fuse multiple ranked lists for medical IR is presented by Mourão et al. (2015). Their fusion method combines the inverse rank approach of reciprocal rank fusion (Cormack et al. 2009) with the number of times a document appears on a rank and achieves a high precision. The unsupervised methods proposed in this paper try to fuse the modality scores without any weights, which we claim, is possible when treating all modalities with the same model.

Robertson et al. (2004) show the problems that arise when using a linear combination of the scores obtained from scoring multiple textual fields individually using BM25. The most important reason why this leads to poor retrieval effectiveness is the non-linear treatment of the term frequencies. This non-linearity is desirable for individual fields, since the information gain on observing a term for the first time is greater than the information gained on subsequently seeing the term. However, when using a linear combination of scores this non-linearity breaks. Therefore, Robertson et al. (2004) propose a method that uses a linear combination of the term frequencies instead of using a linear combination of the scores, with which the problem can be solved. The term frequency is not the only point that has to be considered in a retrieval setup with multi-field documents, also the document length and the parameters of the weighting scheme have to be questioned. When computing a score for each individual field the weighting scheme parameters, in BM25 the tf saturation parameter \(k_1\) and the document length normalization parameter b have to be optimized for each field individually, which results in a huge number of optimization parameters. With the method suggested by Robertson et al. (2004) only two weighting scheme parameters have to be optimized. The suggested method also leads to substantially different term frequencies, since they replicate the content of the fields with the weight, the authors therefore suggest to use an adapted \(k_1\) that is a scaled version of the original \(k_1\) by the ratio between the original and the resulting average term frequency. For our methods, we use the idea of scaling \(k_1\) when sampling all modalities to the same length.

4 Validating the generalizability of document length normalizations

Similar to traditional textual retrieval, special care needs to be taken to handle varying document lengths for non-textual modalities as well. Non-textual modalities can have large variances in document lengths. In order to find a suitable weighting scheme for non-textual modalities, we analyze four of the most known weighting schemes with respect to their document length normalization robustness.

Fig. 2
figure 2

Likelihood of Retrieval/Relevance for the TREC5 / TREC 8 data using 24 bins and the original weighting schemes

The experiments are conducted using the TREC 5 ad hoc collection (Voorhees and Harman 1996) and the TREC 8 ad hoc collection (Voorhees and Harman 1999). The choice of these rather classic test collections is motivated as follows: TREC 5 includes the Federal Register sub-collection that contains very lengthy documents, resulting in a high variance w.r.t. the document lengths of the collection. TREC 8 has been chosen due to its use in earlier literature about document length normalization (Chowdhury et al. 2002; Losada and Azzopardi 2008; Lv and Zhai 2011), however has a smaller variance w.r.t. the document lengths than TREC 5 and we therefore expect that the effects of the document length component to be less pronounced. We used the full datasets and automatically generated queries from the topic title (T) and the description (D).

We examine the document length normalization and its impact on the retrieval effectiveness using the idea of Singhal et al. (1996). They calculate the likelihood of retrieval and relevance for each document length and employ these to adjust the document length normalization. We use these two likelihoods to visualize the effectiveness of the document length normalization of the four weighting schemes in study. To compute these likelihoods the documents are binned by their length. For each bin, the likelihood is defined as the ratio between the number of relevant/retrieved documents and the total number of documents in the bin. We then plot the likelihoods against the median document length in the bins.

Figure 2 shows the likelihood of relevance (bold line) and the likelihood of retrieval for all the weighting schemes for the TREC 5 and TREC 8 collections. The documents are divided in to 24 bins. As shown in this figure, longer documents have a higher probability of being relevant and retrieved. For both TREC 5 and TREC 8 as well as the long (TD) and short topics (T), BM25 and DFR match the likelihood of retrieval the best and we conclude that BM25 is able to handle large variances in document length. Since the document length normalization of BM25 is robust, it is suited to be used with non-textual modalities without any restriction regarding the variance of document lengths. Note that we did not include weighting scheme extensions, such as BM25L (Lv and Zhai (2011), that specifically target the robustness of the document length normalization, since they usually come with further assumptions regarding the statistics of the modalities.

5 BM25 model for non-textual modalities

5.1 BM25

Our experiments to validate the raw-score merging hypothesis and the generalizability of the document length normalization show that BM25 both works best for the raw-score merging and is amongst the most robust weighting schemes with highly varying document lengths. Therefore, we will focus our work with non-textual modalities on BM25.

Let us explore multimodal document collections such as used in GeoCLEF (Mandl et al. 2009) or in the social book search lab (Bogers et al. 2014). In these collections, documents are no longer just represented by only a set of terms (textual features) but also by geographical features or by book ratings that further describe the documents.

Table 1 Notation used for the BM25 for textual and non-textual modalities

In this Section, we first re-capitulate BM25 for a textual modality and then show how its idea can be adapted to geographical coordinates and to book ratings. Table 1 shows the notations used for BM25 as well as for its non-textual adaptions.

The retrieval status value (RSV) of document \(d_j\) w.r.t. query q when using BM25 can be written as an inner product

$$w(\varphi _{k} ,d_{j} ): = \frac{{{\text{ff(}}\varphi _{k} ,d_{j} {\text{)}}}}{{k_{1} ((1 - b) + b\frac{{l_{j} }}{\Delta }) + {\text{ff}}(\varphi _{k} ,d_{j} )}}$$
(2)
$$\begin{aligned} w(\varphi _k, q)& := {\text {ff}}(\varphi _k, q) \cdot \log \left( \frac{0.5 + N - {\text {df}}(\varphi _k)}{0.5 + {\text {df}}(\varphi _k)}\right) \end{aligned}$$
(3)
$$\begin{aligned} {\text {RSV}}_{\text {BM25}}(q, d_j)& := \sum _{\varphi _k \in \Phi (q) \cap \Phi (d_j)} w(\varphi _k, d_j) \cdot w(\varphi _k, q), \end{aligned}$$
(4)

where \(k_1\) is the tf saturation parameter and b is the document length normalization parameter.

For its document length normalization, BM25 (Robertson and Zaragoza 2009; Robertson et al. 1980) assumes a standard length of a document represented by the average document length. Hence, an author can decide to write a document longer or shorter than the standard length. Robertson and Zaragoza (2009) and Robertson et al. (1980) describe two cases why an author might decide to write a long document; either the author is more verbose than others or the author covers a larger scope. The verbosity assumption would lead to a division of the tf values by the document length. The scope assumption points to an opposite course of action, hence not dividing at all. Normally, the reason for a longer document is a combination of the two, thus Robertson’s normalization balances the two using a tuning parameter b. Robertson proposed to use the number of tokens in a document as the document length, although he pointed out that BM25 should lead to similar results with slightly different definitions of the document length such as the number of characters. When using BM25 for non-textual modalities, it needs to be considered if this assumption holds true for those as well.

Since BM25 was originally designed for textual modalities, the question arises if its concept depends on the Zipfian distribution of the modalities as it is the case for natural language features. In particular the heuristic definition of the inverse document frequency (idf) can be motivated by the Zipf’s law. However, over the years people have come up with several other interpretations on why the idf works as well as it does. For example, the theories that the idf corresponds to the probability of a term appearing in a document or to Shannon’s information theory as described by Robertson (2004). It therefore is unclear how much the performance of BM25 depends on the Zipfian distribution of the modalities. Although we will not further investigate this question in this paper, we however assume that BM25 is generalizable to non-textual modalities with any distribution as long as the tf and idf can be defined in a way that the characteristic features still emerge.

Apart from the open question how well BM25 generalizes to modalities with a non-Zipfian distribution, it has been shown that BM25 is indeed generalizable to modalities with a Zipfian distribution such as a bag-of-visual-words in multimedia retrieval (Yang et al. 2007). Also the distribution of the modalities we use in our experiments satisfy Zipf’s law. In the case of the GeoCLEF collection, which we use for our experiments with geographical coordinates, the coordinates have a Zipfian distribution, since they are extracted from the locations mentioned in the textual representation. Further, we analyzed the distribution of the ratings in the social book search collection and realized that they also have an approximate Zipfian distribution. It seems that the distribution of the ratings in this collection is not an exception, but appears to be a general phenomenon (Dalvi et al. 2013; Rajaraman 2009; Woolf 2014).

The tf saturation is parametrized by \(k_1\) and makes sure that an increase of a high tf will contribute less to the score than an increase of a smaller tf. The higher the \(k_1\) value, the more will an increase of a high tf contribute to the score, i.e. the saturation is less pronounced with high \(k_1\) values.

The optimal choice of \(k_1\) is not simple to make and also depends on the collection (Chowdhury et al. 2002). Further, \(k_1\) needs to be adjusted if documents are replicated (Robertson et al. 2004). When replicating the content of all the documents (concatenate each document with itself; all documents have twice the length), neither the informativeness of a single document is changed nor the relevance of the documents to a particular query changes. However, if \(k_1\) is not adjusted the BM25 weighting scheme will not lead to the same ranked list as without the replication. The BM25 weight for document \(d'_j\) that are replicated x-times is

$$\begin{aligned} w(\varphi _k, d'_j, k_1)& = &\frac{x \cdot {\text {ff}}(\varphi _k, d_j)}{k_1 ((1 - b) + b\frac{x \cdot l_j}{x \cdot \Delta }) + x \cdot {\text {ff}}(\varphi _k, d'_j)}, \end{aligned}$$
(5)

which is not order preserving. However, if we set \(k'_1 = x \cdot k_1\) we get \(w(\varphi _k, d'_j, k'_1) = x \cdot w(\varphi _k, d_j, k_1)\) with which we can maintain the original ordering.

5.2 Geographical coordinates

For our BM25 model for geographical coordinates, we consider documents that are enriched with a discrete set of geographical coordinates. Let us model the three main ingredients of our weighting scheme: ff, df and document length, as follows. The ff of a coordinate in a document is defined as the number of occurrences of that coordinate in the document. The df is the number of documents that contain this coordinate and the document length is the number of locations in a document. Hereby, we assume that a document annotated with many geographical coordinates, covers a larger scope than a document with less coordinates, thus the argument of the textual BM25 document length normalization holds. Further, we assume, that the queries ask for documents in a specific geographical area, therefore a query is described by a single bounding box that encloses this area. The feature set and the feature frequency of a geographical feature \(\varphi _k\) for a query q is defined as

$$\begin{aligned} \Phi (q)& := {\text {boundingbox}}(q) \end{aligned}$$
(6)
$$\begin{aligned} {\text {ff}}(\varphi _k, q)& := 1. \end{aligned}$$
(7)

5.3 Ratings of books

For the ratings, we consider documents, that describe books including ratings given by their readers. When searching for books with a textual query, we do not know any query specific preference for a rating. However, we assume that in general readers will prefer books with higher ratings. If the ratings are in the range between one and five, we define the query as

$$\begin{aligned} \Phi (q)& := \{1, 2, 3, 4, 5\}\end{aligned}$$
(8)
$$\begin{aligned} {\text {ff}}(\varphi _k, q)& := \varphi _k. \end{aligned}$$
(9)

Hereby, all the possible ratings (1–5) are part of the query, while the weight of a rating is equal to the rating itself; i.e. the weight of the rating 5 is 5 times higher than the weight of the rating 1. The three main ingredients of our weighting scheme: feature frequencies \({\text {ff}}(\varphi _k, d_j)\), document frequencies \({\text {df}}(\varphi _k)\) and document lengths \(l_j\), are defined analogously to their definition for textual modalities. The ff is the number of times a rating occurs in a given document, the df is the number of documents that contain a given rating and the document length is the number of ratings in a document. We assume that a document with many ratings covers a larger range of opinions, hence covering a larger scope and thus the argument of the textual BM25 document length normalization holds.

6 Sampling-based BM25 for modality merging

6.1 Sampling

The proposed BM25 adaption for non-textual modalities enables us to merge modalities using the same weighting scheme, i.e a similar search engine as requested by the raw-score merging hypothesis. However, the raw-score merging hypothesis not only demands that similar search engines are used but also that the collection statistics are similar. Note, that the raw-score merging hypothesis is a rather old concept that has been introduced when merging multiple, possibly distributed textual document collections. In retrieval tasks with multiple modalities, the “collections” are no longer a set of textual documents but the different modalities. We have seen that the non-textual modalities have vastly different collection statistics, which invalidates the raw-score merging hypothesis. Therefore, we suggest a sampling based approach that allows us to adjust some properties of the collection statistics in order to reduce the difference. In particular, we adjust the average document length and the variance of the document lengths.

Our proposed sampling approach is similar to what is done in image retrieval when using dense or random feature sampling, where the same number of features for each image regardless of the pixel density and the number of concepts shown in the image is used (Moulin et al. 2010). The idea is to sample all modalities in all documents to a fixed document length as illustrated in Figure 3 for a single modality before BM25 is applied. Hereby, we use the number of tokens as the document length, although different definitions can be used. This results in the same collection statistics for all the modalities with respect to the average document length and the variance of document lengths. Namely, the average document length is the sampling size and the variance is zero. Since all documents have the same length no BM25 document length normalization is necessary, thus we choose \(\textit{b}=0\).

Fig. 3
figure 3

Visualization of sampling three documents to the sampling size 5

The randomized sampling, however, leads to data loss due to down sampling and non-deterministic results. Therefore, we idealize the sampling idea by not sampling the document but simply simulating the resulting term statistics. This can be done by scaling the feature frequencies by the relative change of the document length that would result from sampling. For a single document \(d_j\) and a single modality with length \(l_j\) and a token \(\varphi _k\) with the feature frequency \({\text {ff}}(\varphi _k, d_j)\) the scaled term frequency \({\text {ff}}'(\varphi _k, d_j)\) is

$$\begin{aligned} {\text {ff}}'(\varphi _k, d_j) = {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{l_j}, \end{aligned}$$
(10)

where s is the sampling size (the fixed length of all documents). For example, if s is \(3l_j\), all term frequencies are multiplied by 3.

We denote our idealized sampling based BM25 adaption BM25*S, where S stands for the sampling and the asterisk shows that no traditional document length normalization is applied; i.e. \(\textit{b}=0\). The resulting the BM25*S weight for document \(d_j\) with sampling size s is

$$\begin{aligned} w_{\text {BM25*S}}(\varphi _k, d_j)&= \frac{ {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{l_j} }{ k_1 + {\text {ff}}(\varphi _k, d_j) \cdot \frac{{\text {s}}}{l_j}}. \end{aligned}$$
(11)

Our sampling approach is some form of document replication, and thus the ff saturation parameter \(k_1\) is not optimal anymore as described in Section 5 and by Robertson et al. (2004). In order to achieve the same retrieval effectiveness as without the sampling, the \(k_1\) parameter needs to be adjusted. Since not all documents are replicated with the same factor, the optimal adjustment of the \(k_1\) parameter cannot simply be the replication factor as in Sect. 5. However, we observed an approximately linear dependency of the optimal \(k_1\) parameter to the average document length. Therefore, we set

$$\begin{aligned} k'_1 = \frac{\Delta '}{\Delta } \cdot k_1, \end{aligned}$$
(12)

where \(\Delta\) is the average document length of the original documents and \(\Delta '\) is the average document length of the sampled documents. This adjustment is slightly different to the adjustment Robertson et al. (2004) suggested, who used the ratio between the average term frequencies rather than the average document lengths. However, with their setup the two ratios are equivalent. With the sampling, the two ratios are not exactly equal, although quite similar, therefore both options seem valid. Further, when sampling, calculating the ratio between the average document lengths is a lot simpler than between the average term frequencies since the average document length after the sampling is equal to the sampling size (\(\Delta ' = s\)), while the new average term frequencies are only known after the sampling is performed.

The weight for a document \(d_j\), when using the combination of the idealized sampling and the \(k_1\) adjustment (BM25-sampled), is calculated as

$$\begin{aligned} w_{{\text {BM25-sampled}}}(\varphi _k, d_j)&= \frac{ {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{l_j} }{ k_1 \cdot \frac{s}{\Delta } + {\text {ff}}(\varphi _k, d_j) \cdot \frac{{\text {s}}}{l_j}}. \end{aligned}$$
(13)

We now have a sampling method BM25-sampled that can be applied to all modalities. We suggest using the same sampling length for all modalities, which results in the same collection statistics for all modalities with respect to the average document length and variance in document lengths. Hence, the raw-score merging hypothesis is fulfilled with respect to these two properties.

We can prove that this sampling method results in exactly the same weights as for BM25 with the normalization parameter b set to one.

Proof

$$\begin{aligned} w_{{\text {BM25-sampled}}}(\varphi _k, d_j)&= \frac{ {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{l_j} }{ k_1 \cdot \frac{s}{\Delta } + {\text {ff}}(\varphi _k, d_j) \cdot \frac{{\text {s}}}{l_j}}\\&= \frac{ {\text {ff}}(\varphi _k, d_j) }{ k_1 \cdot \frac{s}{\Delta } \cdot \frac{l_j}{s} + {\text {ff}}(\varphi _k, d_j) }\\&= \frac{ {\text {ff}}(\varphi _k, d_j) }{ k_1 \cdot \frac{l_j}{\Delta } + {\text {ff}}(\varphi _k, d_j) }\\&= w_{{\text {BM25}}(\textit{b}=1)}(\varphi _k, d_j). \end{aligned}$$

\(\square\)

This proof shows, that BM25 with full document length normalization (\(\textit{b}=1\)) already guarantees that the raw-score merging hypothesis is fulfilled with respect to the average document length and variance in document lengths. Therefore, BM25 seems to be suited to be used in a multimodal retrieval task. It however has been shown, that using \(\textit{b}=1\) for BM25 tends to underestimate the relevance of long documents and therefore usually a smaller b is used; e.g. \(\textit{b}=0.75\). In the following, we show how the sampling idea can be extended to allow arbitrary document length normalization parameters \(\textit{b}\).

6.2 Scope-aware sampling

Sampling all documents to the same length, which is equal to using BM25 with full document length normalization (\(\textit{b}=1\)), assumes that all documents have the same scope. However, some documents might discuss more topics than other documents and thus indeed should be represented with more tokens as described in Sect. 5. Similarly to BM25, we assume that the original document lengths of the documents give an indication about their scope. Thus, we can account for different document scopes by sampling the documents to different lengths based on their original length.

Many different definitions of a scope-aware sampling length using a document length normalization parameter bs are possible. We can however choose a definition so that the sampling based approach is identical to the traditional BM25 with parameter \(b=bs\). We therefore define the adjusted number of sampled tokens \(s'\) for a document \(d_j\) as

$$\begin{aligned} s'(d_j) = l_j \cdot \frac{s}{\left(1-bs + bs \cdot \frac{l_j}{\Delta }\right) \cdot \Delta }. \end{aligned}$$
(14)

All documents are now sampled to their corresponding sampling size \(s'(d_j)\) rather than the same sampling size s for all documents. The adjusted feature frequencies therefore are

$$\begin{aligned} {\text {ff}}'(\varphi _k, d_j)& = {\text {ff}}(\varphi _k, d_j) \cdot \frac{s'(d_j)}{l_j}\nonumber \\& = {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{\left(1-bs + bs \cdot \frac{l_j}{\Delta }\right) \cdot \Delta }. \end{aligned}$$
(15)

Unfortunately, this non-linear transformation of the document lengths does not exactly result in the same average document length for each modality, which would be necessary to fulfill the raw-score merging hypothesis. However, we found that the new sampled average document lengths of the modalities are close to each other and it is in practice a valid assumption that they are equal.

Further, we have found, that the optimal \(k_1\) has no longer a linear dependency on the new average document length \(\Delta '\) as we found for the sampling with a fixed sampling size s (BM25-sampled) as described in Sect. 6. It rather has a linear dependency to the sampling length s. Thus, for the scope-aware sampling we adjust the \(k_1\) parameter as

$$\begin{aligned} k'_1 = \frac{s}{\Delta } \cdot k_1. \end{aligned}$$
(16)

We denote this scope-aware sampling with the \(k_1\) adjustment and the non-normalized BM25 as BM25-scope. Its weight for a document \(d_j\) is calculated as

$$\begin{aligned} w_{{\text {BM25-scope}}} = \frac{ {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{\left(1-bs + bs \cdot \frac{l_j}{\Delta }\right) \cdot \Delta } }{ k_1 \cdot \frac{s}{\Delta } + {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{\left(1-bs + bs \cdot \frac{l_j}{\Delta }\right) \cdot \Delta }}. \end{aligned}$$
(17)

With the scope-aware sampling it is possible to achieve approximately the same average document length for all modalities in all documents, while documents with a large scope are still represented by more tokens, by using the same sampling size parameter s for all modalities.

We can show that this scope-aware sampling is identical to the traditional BM25 for any document length parameter bs.

Proof

$$\begin{aligned} w_{{\text {BM25-scope}}}&= \frac{ {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{\left(1-bs + bs \cdot \frac{l_j}{\Delta }\right) \cdot \Delta } }{ k_1 \cdot \frac{s}{\Delta } + {\text {ff}}(\varphi _k, d_j) \cdot \frac{s}{\left(1-bs + bs \cdot \frac{l_j}{\Delta }\right) \cdot \Delta }}\nonumber \\&= \frac{ {\text {ff}}(\varphi _k, d_j) }{ k_1 \cdot \frac{s}{\Delta } \cdot \frac{(1-bs + bs \cdot \frac{l_j}{\Delta }) \cdot \Delta }{s} + {\text {ff}}(\varphi _k, d_j) }\nonumber \\&= \frac{ {\text {ff}}(\varphi _k, d_j) }{ k_1 \cdot (1-bs + bs \cdot \frac{l_j}{\Delta }) + {\text {ff}}(\varphi _k, d_j) }\nonumber \\&= w_{{\text {BM25}}(\textit{b=bs})}(\varphi _k, d_j). \end{aligned}$$
(18)

\(\square\)

Since BM25 is identical to our sampling approach BM25-scope, also BM25 is fulfilling the raw-score merging hypothesis with respect to the average document length with any document length normalization parameter. We can therefore conclude, that differences between average document lengths can be ignored when using raw-score merging with BM25. Hence, we can use BM25 with the same document length normalization parameter b for all modalities. The sampling approach is not needed in practice, since we have shown that it is identical to BM25.

Unlike BM25 with full document length normalization (\(\textit{b}=1\)), the variances of the document lengths are however not necessarily the same. Using our sampling idea, we can further adjust the definition of the sampled number of tokens in order to compensate the different variances of document lengths. We first apply a transformation to the document lengths to adjust the variance and then adjust the average document lengths as in Eq. 14 using the transformed document lengths. Thus, we do not ensure that all variances in document length are the same, but we ensure that the ratio between the standard deviation and the average document length is the same for all modalities. The adjusted number of tokens \(s''\) with the adjustment for the variance of document length is

$$\begin{aligned} l_j'&= (l_j - \Delta ) \cdot rs \cdot \frac{\Delta }{\sigma } + \Delta \end{aligned}$$
(19)
$$\begin{aligned} s''(d_j)&= l_j' \cdot \frac{s}{\left(1-bs + bs \cdot \frac{l_j'}{\Delta }\right) \cdot \Delta }, \end{aligned}$$
(20)

where \(\sigma\) is the standard deviation of the document lengths and rs is the variance parameter that defines the target ratio between the standard deviation and the mean. We denote this sampling variation as BM25-var.

7 Experiments

The focus of our evaluation lies on measuring the effectiveness of a multimodal IR system built according to our guidelines (consistent treatment of the modalities, little or no training). In the scenarios we are interested in, the system needs to incorporate all modalities; ignoring modalities is not an option.

Our test system is built on top of LuceneFootnote 1 and is using the built-in weighting schemes wherever possible. For the scaled feature frequency and the \(k_1\) adjustment, we adapted the built-in BM25 implementation. The merging of the modalities is performed using a raw-score merging (“.raw”) or a linear combination of the scores (“.opt”). By using the latter, we violate our goal of using no training phase. Indeed, we use the opt-variant only for comparison purposes as a benchmark. In line with this role as a sort of “upper bound” on performance, we train the optimal weights using the same collection as used for testing. In essence for the opt-variant, we are only interested in showing that the effectiveness can be improved using BM25 on multiple textual as well as non-textual modalities.

Our experiments use two multimodal test collections, GeoCLEF and SBS.

7.1 Test collections

7.1.1 GeoCLEF

For the experiments with the geographical modality, we use the topics and collection of the GeoCLEF 2008 (Mandl et al. 2009) monolingual English search task. The collection is composed of the news articles from the British newspaper The Glasgow Herald (1995) and the American newspaper The Los Angeles Times (1994). In this task, 24 geographically challenging topics have been defined; e.g. “Nobel prize winners from Northern European countries”. Here, we can differentiate between the textual information “Nobel price winners” and the geographical information “from Northern European countries”. One of the challenges of geographical IR is that relevant documents not only contain the textual representation of geographical information “Northern European countries”, but also concepts such as unions, countries or cities inside the geographical region.

Overell et al. (2008) and Buscaldi and Rosso (2008) proposed to separate the geographical information from the textual information, so that the two modalities (geographical and textual) can be treated differently. This allows that the additional information about geographical regions can be considered. Buscaldi and Rosso (2008) extracted location names from the documents and topics and mapped them to their geographical coordinates (longitude, latitude) using GeoWordNet. D. Buscaldi provided us a preprocessed geotagged version of the GeoCLEF 2008 collection. Further, we preprocessed the title fields of the topics by manually extracting a geographical bounding box for each topic. This could also be done automatically using the convex hull of the locations found with GeoWordNet (Buscaldi and Rosso 2008).

An important characteristic of the collection and task described above is the overlap of the textual and geographical modalities, since the geographical modality is extracted from the text. Therefore, we also created a second modified version of the GeoCLEF 2008 test collection, which separates the geographical and textual information. For this, we removed the textual description of the geographical region from the queries; e.g. the query “Nobel prize winners from Northern European countries” becomes “Nobel prize winners” with the geographical bounding box that includes all Northern European countries. In the experiments, we refer to this task as “geoCLEFmod”.

7.1.2 Social book search

For the experiments using the ratings as an additional modality, we use the social book search (SBS) 2016 lab task (Koolen et al. 2016). The collection consists of 2.8 million books from Amazon, extended with social meta-data from LibraryThing. For each book the fields ISBN, title, review, summary, ratings and tags are given. Each query is constructed from a real user request on LibraryThing. The query not only includes the title of the request and the description of the request itself but also example books mentioned by the user. Additionally, the personal catalog of each topic creator is available, which includes a list of the books the user has archived on LibraryThing along with his personal ratings. The relevance assessments are based on the actual suggestions to the original query on the LibraryThing forum. Forum suggestions normally get a relevance value of 1, however if the suggested book is already in the personal catalog of the topic creator the relevance value is 0. When the topic creator actually adds a suggested book to his library it is considered highly relevant and receives a relevance value of 4.

For the textual modality, we use the textual baseline established in our SBS participation (Imhof 2016; Imhof et al. 2015). We combine all textual fields of the documents into a single textual index field. The queries are constructed from the two textual topic fields title and request that are analogously combined into a single textual representation. Further, we expand the query text with the 35 most characteristic terms (determined by BM25) from the textual representation of the content of the example books given by the topic creator. All books already read by the topic creator are filtered from the result list.

7.2 Results

Following our own guidelines on how to build a multimodal IR system, we sample the non-textual modalities to the same length as the textual modality. For the GeoCLEF 2008 collection, we therefore sample the geographical modality from an average document length of 7.4 to the sampling length of 357.7. Analogously, the ratings in the SBS collection with an average document length of 5.05 are sampled to the sampling length of 674.7. The target standard deviation ratio parameter rs is chosen based on the textual modality as well. For GeoCLEF 2008 this is 1.01 and 2.75 for SBS. This results in a reduction of the standard deviation for the non-textual modalities to 83% respectively 93%. For the runs using the scope-aware sampling (BM25-scope and BM25-var) the normalization parameter bs is 0.75. Note that the scope-aware sampling BM25-scope is identical to BM25 and BM25-sampled is identical to BM25 with document length normalization parameter \(\textit{b}=1\).

As mentioned, the goal of this paper is to establish a baseline for a multimodal IR system that involves all the given modalities and merges the scores generated by a unified model under the raw-score merging hypothesis. Hereby, we require all the modalities to be considered in the result list. We argue that in practice, it is not possible, for many reasons, including e.g. regulatory ones, to simply ignore or discard entire modalities, or parts of the document collection. For example, a book selling company might find that good ratings of books positively influences the purchase behavior of their customers and thus the ratings have to be included in the search engine.

Building an effective multimodal IR system that integrates all modalities with little or no training remains a hard challenge. Wildly different characteristics, and wildly different degrees of informativeness across the modalities means that the average retrieval effectiveness may drop when integrating all modalities, such as evaluated through popular measures like MAP. We advise caution in overinterpreting such a result. Firstly, the average hides many meaningful changes in system behavior and secondly, user perception will likely be different from the measured average improvement if a user realizes that parts of his query or of the documents are ignored. For the time being, a lower retrieval effectiveness of an experiment integrating all modalities versus an experiment discarding some modalities thus mainly serves to highlight how far we still are from finding the perfect recipe for multimodal retrieval, but not to point to a reduced system as a viable, practical alternative.

In the following experiments, we show the effectiveness of our multimodal baseline using the three derived versions of BM25 as the unified weighting scheme for all the modalities merged under the raw-score merging hypothesis. In each of the following Tables 2, 3, 4 and 5, we compare two runs with the same collection. We underline any statistically significant differences in performance according to the MAP to the first run resulting from a paired randomization test (Smucker et al. 2007) (significance level \(\alpha = 5\%\)). For the GeoCLEF 2008 collection, we removed the outlier query 79-GC to calculate the significance. In Appendix 2 we additionally show the same runs evaluated using the nDCG@10 measure. The following conclusions drawn from the results using the MAP are all supported by the results using the nDCG@10.

7.2.1 Base performance of systems integrating non-overlapping modalities

We start our experiments by establishing the base performance of multimodal systems that integrate all non-overlapping modalities as built according to our guidelines.

Table 2 Retrieval results (MAP) for the runs with the textual modalities and the raw-score merging of both modalities for the SBS 2016 and the GeoCLEFmod 2008 collection using the three BM25 versions

To this end, Table 2 shows the MAP for the SBS 2016 and the GeoCLEFmod 2008 collection both for the multimodal baseline (denoted as “.raw”) and the runs with the textual modalities alone (denoted as “.text”). As a consequence of our discussion above, the “.text”-run can only serve as a yardstick: it violates the rule that we want to integrate all modalities. Effectively, it gives us a “lower bound” of performance to compare to. For the SBS collection, the multimodal baseline achieves a significantly higher MAP than the textual run. For the GeoCLEFmod 2008 collection the run with BM25 with no document length normalization (BM25 (\(b=1\))), which is identical to BM25-sampled, achieves a MAP in the range of the textual run. The BM25-scope and BM25-var runs with raw-score merging achieve a lower MAP than the run with text only.

7.2.2 Analysis of individual modalities

It is helpful to further look into the contributions of individual modalities to the overall result. Table 3 shows the retrieval effectiveness of each modality individually. Both the geographical modality and the ratings do not achieve the same retrieval effectiveness as the textual modality. This was expected for both, since intuitively the textual description of a book is more important than its ratings and the textual content of a newspaper article is more important than the mentioned geographical locations.

Merging under the raw-score hypothesis suggests adding the scores of the different modalities into a single score without any weights. However, as shown in Table 2 even though we proved that the raw-score merging hypothesis is fulfilled w.r.t. the average document length as well as for the variance of the document lengths (for BM25-var) the merged result list is only better than the textual run for the SBS task and not for the GeoCLEF task. We claim that this is since the method so far cannot properly capture the difference in informativeness of the modalities.

Table 3 Retrieval results (MAP) for the runs with the textual modalities and the non-textual modalities (geographical coordinates and ratings) for the SBS 2016 and the GeoCLEFmod 2008 collection using the three BM25 versions

7.2.3 Dealing with overlapping modalities

We next want to explore to what extent the overlapping of content in modalities has an impact on the overall effectiveness. Table 4 shows the MAP of the textual run and the multimodal baseline using the GeoCLEFmod 2008 task as well as GeoCLEF 2008 task.

As expected the textual modality in the GeoCLEFmod task achieves a lower MAP than the textual modality in the original GeoCLEF task. This is due to the deletion of the geographical information in the textual modality as described in Sect. 7.1.1. The modalities in the GeoCLEFmod 2008 task therefore do not have an information overlap, while the modalities in the GeoCLEF 2008 task do contain overlapping information, namely all the information present in the geographical modality is also present in the textual modality. The experiments that merge the two modalities under the raw-score merging hypothesis show that without the information overlap between the modalities the MAP of the merged run (“geoCLEFmod.text+geo.raw”) is within the range of the textual modality alone. However, when merging modalities with an information overlap (“geoCLEF.text+geo.raw”) the MAP drops significantly—it is much harder to merge the modalities so that only the “additional” contribution makes a beneficial impact.

Table 4 Retrieval results (MAP) for the runs with the textual modalities and the raw-score merging of both modalities for the GeoCLEFmod 2008 and the GeoCLEF 2008 collection using the three BM25 versions

7.2.4 Optimal merging potential due to training

We argue that a lot of the drop in retrieval effectiveness from the “.text” to the “.text+geo.raw” experiment is due to the inherent difficulty of appropriately merging the contributions of the individual modalities into the overall result. The closest method to raw-score merging that allows us to weight the contributions of the individual modalities is a linear combination of the scores. Therefore, we try to verify this assumption through comparing the multimodal baseline (“.raw”) with an approximate upper bound using a linear combination of the scores with trained weights (“.opt”) (see Table 5). The optimal weights are trained on the information available in the relevance assessments of the test collection. Clearly, this information is not available in practice. Furthermore, training the optimal weights on the same queries as were tested turns this in a retrospective evaluation. As the obtained result is merely a data point to compare our results to, we accept these limitations. For SBS there is no significant difference between merging the modality scores under the raw-score hypothesis and merging using the optimal linear combination. However, for the GeoCLEFmod 2008 collection merging the scores of the textual and the non-textual modalities using optimal linear combination has a significantly higher MAP then the merging under the raw-score merging hypothesis. Consider, however, that the opt-variants only serve as a yardstick: They can only be used when training data is available which is often missing in practical applications and which was not the goal of this paper. The optimal run also shows that the usage of BM5 for the non-textual modalities not only leads to good results when merging under the raw-score merging hypothesis but also when training optimal weights. The traditional BM25, which is identical to BM25-scope, already seems to be a good choice, since the variance adjustment does not lead to a significantly better result neither for raw-score merging nor for the optimal linear combination of the scores.

Table 5 Retrieval results (MAP) for the runs with the raw-score merging of the modalities and the optimized linear combination of the modality scores for the SBS 2016 and the GeoCLEFmod 2008 collection using the three BM25 versions

To get more context in order to judge the performance of our “.raw” runs, we have also explored the use of reciprocal rank fusion (Cormack et al. 2009), another well known unsupervised fusion method. These runs are denoted with “.rcpr” in Table 5, where we underline the runs that are significantly different to the “.raw” runs. For the SBS collection, reciprocal rank fusion leads to a significantly lower MAP for all BM25 variants. However, for the GeoCLEFmod 2008 collection the MAP is in the same range as the raw-score merging run with BM25-sampled but significantly better with BM25-scope and BM25-var, although still significantly lower than the optimal linear combination (“.opt”).

7.2.5 Summary of results

We can summarize the results of our experiments with the following questions.

  1. 1.

    Can we produce a multimodal baseline with an effectiveness in the range of the textual run? Yes, we find better retrieval effectiveness for the SBS collection and retrieval effectiveness in the same range (within statistical significance) for the GeoCLEF collection without overlapping modalities.

  2. 2.

    Do modalities differ with respect to their contribution to relevance? Yes, in both collections the contribution by the textual modality is by far the greatest, thus turning the “.text” yardstick into a challenging lower bound.

  3. 3.

    Does it matter that modalities have overlapping information? Yes, it is much harder to merge individual contributions by modalities in case they are overlapping.

  4. 4.

    Is it possible to get competitive performance without training? Yes and no. We have found competitive performance in the case of the SBS collection, where we have no overlapping modalities. We are still a long way from matching the performance of the opt-variant on the GeoCLEF collection, however.

8 Conclusions

In this paper, we demonstrate best practices for the integration of many modalities into an IR application without the use of training data. We claimed that in complex multimodal collections with a large number of diverse modalities, it becomes crucial to treat the modalities with a unified model, due to the quickly increasing complexity. We started by analyzing the requirements for such a unified model and showed that BM25 is a suitable weighting scheme to be used and to merge the modalities under the raw-score merging hypothesis. We proposed an adaptation of the BM25 weighting scheme for the two non-textual modalities ratings and geographical coordinates and established a multimodal baseline that uses all the modalities and merges them under the raw-score merging hypothesis without any training.

In order to show the suitability of BM25 scores to be merged under the raw-score merging hypothesis, a sampling based approach for BM25 was introduced to deal with the different collection statistics, in particular the average document length and the variance of the document lengths of the modalities. We proved that applying BM25 with full document length normalization \(\textit{b}=1\) to all modalities already ensures that the raw-score merging hypothesis w.r.t. the average document lengths and the variance of document lengths is fulfilled, since it is identical to the sampling approach. Analogously, we proved that the raw-score merging hypothesis w.r.t. the average document length also holds for BM25 with a general document length normalization parameter \(b\ne 1\), however not w.r.t. the variance of document length. Our experiments show that adhering to the raw-score merging hypothesis is indeed beneficial.

In our experiments, we established a multimodal baseline that involves all the given modalities and merges the scores generated by a unified model under the raw-score merging hypothesis. We showed that by following our approach the multimodal baseline reaches a significantly better retrieval effectiveness than the textual run for the SBS collection and lies within the same range (within statistical significance) for the GeoCLEF 2008 collection without overlapping modalities. Further, we analyzed the contribution of the individual modalities to relevance and found that the contribution of the textual modalities is the greatest. Also, we saw in the experiments that dealing with modalities with overlapping information is a hard problem. Finally, we found similar performance of our multimodal baseline when comparing it to a trained linear combination of the scores in case of the SBS collection, which we consider to be very encouraging.

The multimodal baseline presented in this paper merges the modality scores under the raw-score merging hypothesis and therefore assumes that each modality is equally important for the overall relevance of a document. However, in the experiments we saw that there are wildly different degrees of informativeness across the modalities. As a next step towards best practices for multimodal IR systems, we will investigate to further extend the proposed methods but incorporate the informativeness of the different modalities without the usage of any training data.