Trans-Media Pseudo-Relevance Feedback Methods in Multimedia Retrieval

Size: px

Start display at page:

Download "Trans-Media Pseudo-Relevance Feedback Methods in Multimedia Retrieval"

Chrystal Garrett
5 years ago
Views:

1 Trans-Media Pseudo-Relevance Feedback Methods in Multimedia Retrieval Stephane Clinchant, Jean-Michel Renders, and Gabriela Csurka Xerox Research Centre Europe, 6 ch. de Maupertuis, Meylan, France FirstName.LastName@xrce.xerox.com Abstract. We present here some transmedia similarity measures that we recently designed by adopting some intermediate level fusion approaches. The main idea is to use some principles coming from pseudorelevance feedback and, more specifically, transmedia pseudo-relevance feedback for enriching the mono-media representation of an object with features coming from the other media. One issue that arises when adopting such a strategy is to determine how to compute the mono-media similarity between an aggregate of objects coming from a first (pseudo-)feedback step and one single multimodal object. We propose two alternative ways of adressing this issue, that result in what we called the transmedia document reranking and complementary feedback methods respectively. For the ImageCLEF - Photo Retrieval Task, it appears that monomedia retrieval performance is more or less equivalent for pure image and pure text content (around 20% MAP). Using our transmedia pseudofeedback-based similarity measures allowed us to dramatically increase the performance by 50% (relative). From a cross-lingual perspective, the use of domain-specific, corpus-adapted probabilistic dictionaries seems to offer better results than the use of a broader, more general standard dictionary. With respect to the monolingual baselines, multilingual runs show a slight degradation of retrieval performance ( 6 to 10% relative). Keywords: hybrid retrieval, trans-media relevance feedback. 1 Introduction and Related Works Up to now, many standard methods to tackle the problem of defining efficient trans-modal similarity measures and of solving the associated semantic gap use late fusion strategies. Basically, they rely on mono-modal analysis of multifacet objects, computing mono-modal similarities independently and then merging these mono-modal similarities by some simple aggregation operator. In contrast to these strategies, we propose here two intermediate level fusion approaches. Both approaches, similarly to [1,2,3], are based on Transmedia Relevance Pseudo-Feedback, i.e. they are mixed-modality extentions of Relevance Models in which the modality of data is switched during the (pseudo-) feedback process, from image to text or text to image. C. Peters et al. (Eds.): CLEF 2007, LNCS 5152, pp , c Springer-Verlag Berlin Heidelberg 2008

2 570 S. Clinchant, J.-M. Renders, and G. Csurka Our first approach, called Complementary Feedback (section 3.1), is similar to the approach suggested by [3]. However, while [3] uses classical text-based feedback algorithms (like Rocchio), we use a pseudo-feedback method issued from the language modelling approach to information retrieval, namely the mixture model method from Zhai and Lafferty [4] originally designed to enrich textual queries. In our second approach, called Transmedia Document Reranking Approach (section 3.2), we do not really extract a new query, nor enrich an existing one. This second approach uses the similarity computed in the other mode as a component of feedback, in order to rerank documents. So, this is a one step retrieval, contrarily to the first one (and the related works). This method is quite general since it can be applied to any textual/visual similarities or, equivalently, with any mono-modal (textual / visual) retrieval engine. This is not the case for the other methods: [1], for instance, is based on a specific similarity model both for texts and images. Moreover, as the alternative methods require a second retrieval step, the use of a particular choice of text feedback method depends implicitly on the underlying text retrieval model. Our method is free from such dependencies, since it works on similarities as basic components. Even if both approaches appear to be rather simpler than most alternative state-of-the-art approaches, they turned out to give superior results in the ImageClef PhotoRetrieval Track ([5]). 2 Monomedia Similarities 2.1 Cross-Entropy between Texts Starting from a traditional bag-of-word representation of pre-processed texts (here, preprocessing includes tokenization, lemmatization, word decompounding and standard stopword removal), we adopt the language modeling approach to information retrieval and we use the (asymmetric) cross-entropy function as similarity. Particular details of this textual similarity measure are given in [6]. 2.2 FisherVectorsforImages To compute the similarity measure between images I and J, we simply use the the L1-norm of the difference between their Fisher vectors (normalised gradient vector of the corresponding generative model, with unitary L1-norm; see details in [6,7]). 3 Cross-Media Similarities Based on Transmedia Relevance Feedback The main idea is the following: for a given image i, consider as new features the (textual) terms of the texts associated to the most similar images (from a purely visual viewpoint). We will denote this neighbouring set as N img (i). Its size is fixed a priori: this is typically the topn objects returned from a retrieval system

3 Trans-Media Pseudo-Relevance Feedback Methods in Multimedia Retrieval 571 (CBIR) or the N nearest-neighbours using some predefined visual similarity measures. Then, we can compute a new similarity with respect to any multimodal object j of the collection O as the textual similarity of this new representation of the image i with the textual parts of j. There are three families of approaches to compute the mono-media similarity between an aggregate of objects N img (i) and one single multimodal object: 1. aggregating N img (i) to form a single object (typically by concatenation) and then compute standard similarity between two objects; 2. use a method of pseudo feedback algorithm (for instance Rochhio s algorithm) to extract relevant, meaningfull features of an aggregate and finally use a mono-media similarity. 3. aggregating all similarity measures (assuming we can) between all possible couple of objects Methods of families 1 and 2 involve therefore the creation of a new single object (in this case, a text) and a new retrieval step (this time using a text retrieval system). The third family does not. 3.1 Complementary Pseudo-Feedback This approach, as the work presented in [3], belongs to the second family of aggregation strategies mentionned in the previous section but, contrarily to [3], our method uses the Language Modelling framework to realize the pseudo-feedback. Recall that the fundamental problem in transmedia feedback is to define how we compute the mono-media similarity between an aggregate of objects N img (i) (or N txt (i)) and one single multimodal object. The main idea here is to consider the set N img (i) as the relevance concept F and derive its corresponding language model (LM) θ F. Afterwards, we can use the cross-entropy function between θ F and the LM of the textual part of any object j in O as the new transmedia similarity. We adopt the framework given by the mixture model method from Zhai and Lafferty [4] (originally designed to enrich textual queries), to derive the LM associated with F. See [6] for practical details. Once θ F has been estimated, a new query LM can be obtained trough interpolation: θ new query = αθ old query +(1 α)θ F (1) where θ old query corresponds to the LM of the textual part of the query i. In a nearly dual way, starting from the textual part of the query, a similar scheme using N txt (i) can be adopted to derive a new visual representation (actually some generalized Fisher Vectors) of the relevance concept, this time relying on Rocchio s method that is more adapted to continuous feature representation. 3.2 Transmedia Document Reranking Unlike the Complementary Feedback, the Transmedia Document Reranking approach belongs to the third family of aggregation strategies mentionned in 3.

4 572 S. Clinchant, J.-M. Renders, and G. Csurka The main idea is to define a new cross-media similarity measure by aggregating all similarity measures (assuming we can) between all possible couple of objects retrieved by the Transmedia Relevance Feedback. More formally, if we denote by T (u) the text associated to multimodal object u and by ˆT (i) the new textual representation of image i, then the new crossmedia similarity measure w.r.t. the multimodal object j is: sim ImgTxt (i, j) =sim txt ( ˆT (i), T (j)) = sim txt (T (d), T (j)) (2) d N img(i) where sim txt is any textual similarity measure but, in a particular embodiment, we propose to use the cross-entropy function (e.g. the one based on Language Modelling, even if it is assymetric), that appears to be one of the most effective measures in purely textual information retrieval systems. This method can be seen as a reranking method. Suppose that q is some image query; if T (d) is the text of an image belonging to the initial feedback set N img (q), then the rank of the own neighbors of T (d) in the textual sense will be increased, even if they are not so similar from a purely visual viewpoint. In particular, this allows to define a similarity between a purely image query and a simple textual object without visual counterpart. By duality, we can define another cross-media similarity measure: for a given text i, we consider as new features the Fisher vectors of the images associated to the most similar texts (from a purely textual viewpoint) in the multimodal database. We will denote this neighbouring set as N txt (i). If we denote by I(u) the image associated to multimodal object u and by Î(i) the new visual representation of text i, then the new cross-media similarity measure is: sim TxtImg (i, j) =sim img (Î(i), I(j)) = d N txt(i) sim img (I(d), I(j)) (3) Finally, we can combine all the similarities to define a global similarity measure between two multi-modal objects i and j: for instance, using a linear combination, sim glob (i, j) =λ 1 sim txt (T (i), T (j)) +λ 2 sim img (I(i), I(j)) +λ 3 sim ImgTxt (i, j) +λ 4 sim TxtImg (i, j) In one embodiment, we use a simple weighted averages of these similarities, and optimize the weights through the use of a labelled/annotated training set. The main advantage of this method is that, using an aggregation strategy that belongs to family 3, it does not require any further retrieval step. Furthermore, it exploits all trans-modal paths (TXT-TXT, TXT-IMG, IMG-TXT and IMG-IMG) and combines them. Finally, we can pre-compute the monomodal similaritites (textual and visual) between all pairs of objects in the multimedia reference repository, as these computations are independent from the objects in the run-time application; once stored, these values can be re-injected into the translingual similarity equations at run-time, greatly reducing the computation time.

5 Trans-Media Pseudo-Relevance Feedback Methods in Multimedia Retrieval Experimental Results Table 1 shows the name of our ImageCLEF runs and the corresponding mean average precision measures. For a description of the task, the corpus and the queries, refer to [5]. Table 1. Official Runs Run Txt Img CF1 CF2 CF3 TR1 TR2 TR3 MAP Below is a detailed description of all the methods we used for the runs: Txt:This run was a pure text run: documents were basically preprocessed and each document was enriched using Flickr database. For each term of a document, its top 20 related tags from Flickr were added to the document (see details in [6]). Then, a unigram language model for each document is estimated, giving more weight to the original document terms. An additional step of pseudo-relevance feedback using the method explained in [4] is then performed. Img:This run is a pure image run: it uses Fisher Kernel metric to define the image similarity. As a query encompasses 3 visual sub-queries, we have to combine the similarity score with respect to these 3 subqueries. To this aim, the result lists from the image sub-queries are renormalized (by substracting the mean and dividing by the standard deviation) and merged by simple sum. CF1:This run uses both texts and images: it starts from query images only, to determine the relevance set N img (i) for each query i andthenimplements the the complementary (intermedia) feedback described in section 3.1. The size of the neighbouring set is 15. Refering to the notation of section 3.1, the value of α is 0.5. CF2:This runs works with the same principle as the previous run CF1. The main difference is that (target) english documents have been enriched with Flickr and that the initial query in German was translated by multiplying its Language Model by the probabilistic translation matrix extracted from the (small) parallel part of the corpus. Otherwise, it uses the same parameters as previously. CF3:This run uses the same process as in CF1, exceptthatitusesenglish queries to search for German annotations. English queries are translated with the probabilistic translation matrix extracted from the (small) parallel part of the corpus and the translated queries follow the same process as in CF1 but with different parameter : the size of the neighbouring set is 10, while the value of α is 0.7. TR1:This run uses both texts and images: it starts from query images only, to determine N img (i) for each query i (as in the previous run above) and then implements the method Transmedia Reranking method described in section 3.2. The size of the neighbouring set is 5.

6 574 S. Clinchant, J.-M. Renders, and G. Csurka TR2:It is basically the same algorithm as the preceding run TR1, except that the textual part of the data (annotations) is enriched with Flickr tags. TR3:This run uses the TR algorithm as in TR1 but, we merge the result lists from TR1 and from the purely text queries (Txt), by summing the relevance scores after normalisation (by substracting the mean and dividing by the standard deviation for each list). 3.4 Topic-Based Analysis of Results In order to better understand the possible correlations between the different methods and/or the systematic superiority of some of them, Figure 1 compares the Average Precisions for each pair of methods and for each topic. Methods are:text-only (TXT), image-only (IMG), our best Complementary Feedback (CF) and our best Transmedia Reranking (TR) approaches. Fig. 1. Average Precision values per topic, for six pairs of methods A deeper analysis of the individual topics leads to the following conclusions: From a purely visual aspect, search performance is better when the example images of the query are similar between themselves; search results degrade in the opposite case. See examples in Figure 2. The combination between text and image works better if the text query is complementary with respect to the visual information (see for instance left column of Figure 3). The combination does not perform well when either one of the media works very badly, especially the image, which is not suprising as the images were used for transmedia pseudo-relevance feedback (e.g. topics 3 and 32). There were also examples in which multi-media retrieval performance was poor, while individual mono-media retrieval worked not too bad (e.g. right

Right column: query images from topics for which the retrieval worked worst: (d) 9 tourist accomodation near Lake Titicaca, (e) 10 destinations in Venezuela and (e) 39

7 Trans-Media Pseudo-Relevance Feedback Methods in Multimedia Retrieval 575 Fig. 2. Left column: query images from topics for which the retrieval worked best: (a) 22 tennis player during rally, (b) 55 drawings in Peruvian desert and (c) 51 photos of goddaughters from Brasil. Right column: query images from topics for which the retrieval worked worst: (d) 9 tourist accomodation near Lake Titicaca, (e) 10 destinations in Venezuela and (e) 39 people in bad weather. Fig. 3. Left column: Query images from topics with best hybrid combinations: (a) 21 accomodations provided by host families, (b) 44 mountains in mainland Australia and (c) 48 vehicle in South Korea. Right column: query images from topics with worst hybrid combinations : (d) 38 Machu Picchu and Huayna Picchu in bad weather, (e) 11 black and white photos from Russia and (f) 3 religious statue in the foreground. column of 3). The reason might be that the retrieved images were incorrectly reranked based on their textual similarity with the query text. For example, for topic 38, non relevant images of Machu Picchu and Huayna Picchu (because not taken under showing bad weather condition) got better ranking,

8 576 S. Clinchant, J.-M. Renders, and G. Csurka with the effect of decreasing the precision (P20 falling down from 0.7 to 0.21 (for TR) and to 0.3 (for CF). 4 Conclusion With a slightly annotated corpus of images, also characterised by an abstraction level in the textual description that is significantly different from the one used in the queries, it appears that mono-media retrieval performance is more or less equivalent for pure image and pure text content (around 20% MAP). Using our transmedia pseudofeedback-based similarity measures allowed us to dramatically increase the performance by 50% (relative). Trying to model the textual relevance concept present in thetoprankeddocumentsissuedfrom a first (purely visual) retrieval and combining this with the textual part of the original query turns out to be the best strategy, being slightly superior to our transmedia document reranking method. From a cross-lingual perspective, the use of domain-specific, corpus-adapted probabilistic dictionaries seems to offer better results than the use of a broader, more general standard dictionary. With respect to the monolingual baseline, multilingual runs show a slight degradation of retrieval performance ( 6 to 10% relative). Acknowledgments. This work was partly funded by the French Government under the Infomagic project, part of the Pole CAP DIGITAL (IMVN) de Paris, Ile-de-France. The authors also want to thank Florent Perronin for his greatly appreciated help in applying some of the Generic Visual Categorizer components in our experiments. References 1. Lavrenko, V., Manmatha, R., Jeon, J.: A model for learning the semantics of pictures. In: NIPS (2003) 2. Chang, Y.C., Chen, H.H.: Approaches of using a word-image ontology and an annotated image corpus as intermedia for cross-language image retrieval. In: Working Notes of the CLEF Workshop, Alicante, Spain (2006) 3. Maillot, N., Chevallet, J.-P., Valea, V., Lim, J.H.: Ipal inter-media pseudo-relevance feedback approach to imageclef 2006 photo retrieval. In: Working Notes of the CLEF Workshop, Alicante, Spain (2006) 4. Zhai, C., Lafferty, J.D.: Model-based feedback in the language modeling approach to information retrieval. In: CIKM (2001) 5. Grubinger, M., Clough, P., Hanbury, A., Müller, H.: Overview of the ImageCLEFphoto 2007 photographic retrieval task. In: Working Notes of the CLEF Workshop, Budapest, Hungary (2007) 6. Clinchant, S., Renders, J.-M., Csurka, G.: Xrce s participation to ImageCLEF In: Working Notes of the CLEF Workshop, Budapest, Hungary (2007) 7. Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: CVPR (2007)

XRCE s Participation to ImageCLEFphoto 2007

XRCE s Participation to ImageCLEFphoto 2007 Stephane Clinchant, Jean-Michel Renders and Gabriela Csurka Xerox Research Centre Europe, 6 ch. de Maupertuis, 38240 Meylan, France FirstName.LastName@xrce.xerox.com