Multimodal Information Spaces for Content-based Image Retrieval

Research Proposal Multimodal Information Spaces for Content-based Image Retrieval Abstract Currently, image retrieval by content is a research problem of great interest in academia and the industry, due to the large collections of images available in different contexts. One of the main challenges to develop effective image retrieval systems is the automatic identification of semantic image contents. This research proposal aims to design a model for image retrieval able to take advantage of different data sources, i.e. using multimodal information, to improve the response of an image retrieval system. In particular two data modalities associated to contents and context of images are considered in this proposal: visual features and unstructured text annotations. The proposed framework is based on kernel methods that provide two main important advantages over the traditional multimodal approaches: first, the structure of each modality is preserved in a high dimensional feature space, and second, they provide natural ways to fuse feature spaces in a unique information space. This document presents the research agenda to build a Multimodal Information Space for searching images by content. Presented by Juan Carlos Caicedo Rueda Research Advisor Prof. Fabio A. González O. Ph.D Area Computer Science Research Fields Information Retrieval and Machine Learning. 1

1 INTRODUCTION 2 1 INTRODUCTION Content-Based Image Retrieval (CBIR) is an active research discipline focused on computational strategies to search for relevant images based on visual content analysis. In this proposal, multimodal analysis is considered to develop CBIR systems, specially for image collections in which there is some text associated to images. Multimodality in Information Retrieval is sometimes referred to the interaction mechanisms and devices used to query the system. However, since the Multimedia Information Retrieval perspective, multimodality is referred to those methods that take advantage of different data modalities to provide access to a digital library or a multimedia collection [1, 2]. Different data modalities in multimedia are used to better understand document contents, including textual annotations, audio, images and video. In this proposal, multimodal will refer to the ability to represent, process and analyze two data modalities simultaneously: unstructured texts and images. The study of multimodal information retrieval systems is proposed in this document. In particular, the design of computational strategies to take advantage of multimodal interactions between image contents and text descriptions is proposed to improve the response of an image retrieval system. In addition, the evaluation of different query paradigms is proposed, including query by example, a keyword based approach and multimodal queries to search for images. A unified framework is proposed in this document to manage data representation, search algorithms and query resolution. The study and evaluation of kernel methods to generate Multimodal Information Spaces is proposed. How can kernel methods be adapted to address the problems of a multimodal information retrieval system, is one of the main purposes of this research. This proposal aims to approach practical and theoretical aspects of a multimodal information representation for image retrieval systems. Kernel methods provide foundations to include structure in data representation and also to combine different heterogeneous data sources. Kernel methods for pattern analysis have been studied to design machine learning algorithms, and have been widely used for non-vectorial data, such as strings, trees and graphs among others [3]. Adapting such a framework for information retrieval, and specially for multimodal information retrieval may lead to more effective systems, and also may contribute to the understanding of the relationships between information retrieval and machine learning. 2 OBJECTIVES Main Goal: To design and evaluate a model to build multimodal information spaces for contentbased image retrieval. Specific goals: To define strategies to extract and represent visual and text contents separately using kernel functions. To propose a method for combining visual and text kernels to represent image contents together with text semantics in a multimodal information space. To design a ranking algorithm to search for images using different query paradigms in the multimodal information space induced by kernels. To evaluate the performance of the system using standard information retrieval measures.

3 PROPOSED RESEARCH 3 3 PROPOSED RESEARCH The proposed research focuses on strategies for early multimodal data fusion to model interactions between different data modalities. A Multimodal IR system under that approach has three main associated issues: content representation of each modality, information fusion and multimodal retrieval algorithms. The construction of a Multimodal Information Space for content-based image retrieval is proposed using kernel methods. Kernel methods have had a great impact in machine learning and pattern recognition, since they provide effective algorithms and strong theoretical properties. Taking advantage of these properties for image retrieval is the main problem of this research. The following subsections present the outline of the main phases that are considered to tackle the problems of how to represent image and text document contents, how to address the fusion and combination problem using kernels, and how to solve queries in a Multimodal Information Space. 3.1 Phase 1: Content Representation Content representation involves the analysis and extraction of information from each modality separately. The processing of image contents and text documents is the main task in this step. It allows the filtering of non-useful data and captures the most discriminative content as is usually done in information retrieval systems. 3.1.1 Image Content Representation Different features sets will be considered to represent image contents. Color, texture and edge histograms will be used to characterize global image properties, and the bag-of-words approach will be considered to model local features [4]. Since all these features are histograms, some similarity measures may be applied to evaluate feature likeness. A similarity measure such as the histogram intersection is proposed since it is provided with all needed properties to be a kernel function. The combination of all available visual features will be studied to lead to a unified image representation. A strategy to combine different visual features has been already approached by the proponent [5], in which an optimal linear combination of features is obtained using kernel methods. In this research proposal different visual features may also be considered as additional data modalities that have to be combined with text features. In that way, the combination of visual features may be followed using strategies defined in the Subsection 3.2, Information Fusion. 3.1.2 Text Content Representation The text associated to images will be represented using a Vector Space Model as is usually done in text information retrieval [6]. Standard operations will be applied to remove stop-words 1 and to perform stemming 2, so that a text vocabulary is identified in the text collection. This leads to a sparse representation for unstructured texts since each one is only composed of a few set of words and the representation is also a histogram that counts word ocurrences. Additional adaptations will be considered for this vector space model such as the inverse document frequency to weigth the more discriminative words in the collection. The kernel function used on this representation will be the cosine similarity. 3.1.3 Deliverables A set of programs to automatically process an image collection with their associated texts. A conference paper with an evaluation of image retrieval using text data and image features independently will be written, using a collection of medical images. 1 Usually conjuctions or articles with no semantic meaning, such as and, or, the, to 2 To find the stem of each word, such as work in working

3 PROPOSED RESEARCH 4 3.2 Phase 2: Information Fusion The information fusion step, a particular aspect of Multimodal Information Retrieval systems, leads to the design of methods to find and represent the relationships between both modalities. How to discover the most meaningful associations between images and text and how to complete missing data or non-clear relationships, are the main problems in this step. In this research, the design of early fusion methods is proposed. At the end of this step, a new document representation is obtained containing both visual and textual information. 3.2.1 Low-level Kernel construction To fuse visual and text properties, operations and algorithms on both content representations will be evaluated. Given two kernel functions, each for a data modality, a set of operations on these kernel functions may be applied to combine information. Operations such as addition, multiplication and composition may be used to rescale and modify the geometry of a new combined feature space. For example, a linear combination of two kernel functions, one textual and one visual, will lead to a new feature space that share all the information from each original feature space. The resulting feature space depends on the coefficients of the linear combination, which may be chosen in an optimal fashion [5]. Other more complex combinations may be modeled using different kernel function operations. Other kernel-based strategies for information fusion will be studied. For instance, an ANOVA kernel may be designed to evaluate the joint occurrence of different feature sets, i.e. visual patterns and text words. Following the same family of algorithms, a kernel graph may be designed to model interactions between visual and textual features or a convolution kernel to evaluate structure and contents [7]. All these approaches lead to a Multimodal Information Space in which the data of visual and text contents is represented. 3.2.2 Semantic Kernel construction Once the Multimodal Information Space has been obtained through the operation of basic low-level data, it can be enhanced using pattern analysis algorithms to discover relationships and complex interactions between features and objects. Latent Semantic Analysis will be considered to re-embed multimodal information in a space in which different features are fused into a set of representative latent concepts. This approach may be applied on a kernel-based Multimodal Information Space since the dual algorithm for Latent Semantic Analysis has been developed by the machine learning community [8]. Other family of pattern analysis algorithms may be applied on the kernel-generated space to analyse multimodal interactions. For instance, Canonical Correlation Analysis between visual and text contents may be applied using two different kernel functions, each related to a data modality. 3.2.3 Deliverables Design and definition of the proposed algorithms. A software implementation of the algorithms to generate kernel functions will be relased. Two conference papers will be written describing the low-level construction of kernel functions. A journal paper will be written considering the evaluation of these kernels for semantic analysis and multimodal retrieval.

3 PROPOSED RESEARCH 5 3.3 Phase 3: Information Retrieval Multimodal retrieval algorithms on the fused representation will be designed to identify the most relevant results for the user. The main research questions in this step are related to the query representation and how to solve unimodal and multimodal queries. 3.3.1 Ranking algorithms Since kernel functions may be interpreted as similarity functions, they provide a natural ranking to retrieve images. Depending on the particular kernel function and retrieval task, the ranking may be more or less effective. Kernel functions will be the main strategy to rank images under the proposed framework, however, other ranking algorithms may be considered according to the obtained results. For instance, learning to rank u[9] using multimodal kernel functions may be a potential approach to search for images. 3.3.2 Solving queries The proposed system to retrieve images using multimodal data may be queried using keywords, example images or both. So the query representation will be considered in this research. When both data modalities are available, the application of the proposed methods is straightforward. However, solving unimodal queries requires to define how to complete the missing data or avoid its utilization. 3.3.3 Deliverables A prototype system to rank a collection of images given a query will be implemented. This prototype will support experiments and evaluations of the proposed methods to find relevant images using different query paradigms. A journal paper will be written presenting the architecture of the complete system, the proposed methods and the results of this evaluation. 3.4 Phase 4: Evaluation There are a lot of document collections that include both, images and texts, in which users require to find information either illustrated in images or described in texts. This project has as goal to index the information of images and texts simultaneously to find relevant information independently of its original format. Although the kind of collections on which such a system may be applied is very diverse, this project aims to evaluate the proposed system in a collection of medical information, including images, medical records and scholarly papers. In particular, the collections provided in the ImageCLEFmed competition are planned to be used as well as the datasets collected in the Bioingenium Research Group, product of its operation. A prototype system will be implemented to operate with the proposed methods, particularly to search for relevant documents given multimodal or unimodal queries. The response of the system will be assessed using standard IR measures to compare results with reported baselines and state-of-the-art methods. The response of the proposed Multimodal Information Retrieval will be also compared with the response of a standard text search engine and a standard image retrieval system to evaluate their relative performance. It is expected that the Multimodal Information Space provide more accurate results.

3 PROPOSED RESEARCH 6 3.5 Performance Considerations The proposed research is mainly based on kernel methods that may work on very high dimensional spaces. Kernel based algorithms do not need to operate explicitly in the high dimensional space, and that leads to the implementation of fast similarity measures between structured data. For example, the Pyramid Match Kernel [10], used to approximate the matching between two sets of image features, provides high accuracy and low computational effort compared to the optimal correspondences between the sets of features. However some learning algorithms need to process a kernel matrix that grows quadratically with the size of the sample. For instance, a Singular Value Decomposition (SVD) of the kernel matrix is useful for doing principal component analysis or latent semantic analysis [3]. But the SVD algorithm is O(n 3 ) and it would demand huge computational resources or may take a long time to process for large data collections. The complexity of the proposed algorithms will be studied to evaluate the impact on the system performance. The majority of the algorithms that require to process a kernel matrix are training algorithms that can be executed offline. Moreover, training algorithms are not needed to be applied on the complete document collection. That is, a representative sample may be taken from the collection to analyze patterns, structure and relationships, and later the obtained models may be generalized to the whole collection. When possible, parallel or distributed implementations will be considered for algorithms with high complexity. 3.6 Research Activities The research plan is presented in Figure 1 with the activities to be followed during two years. It comprises the four phases previously discussed: (1) Content Representation, (2) Information Fusion, (3) Information Retrieval and (4) Evaluation. Fig. 1: Research plan for 2 years.

4 SUMMARY 7 4 SUMMARY This document has presented a research agenda to study and evaluate Multimodal Information Spaces for Content-Based Image Retrieval. The main research question is how can we retrieve visual information from a large multimodal document collection, taking into account that both visual and textual contents may provide useful information to improve the retrieval performance. The use of kernel functions to construct Multimodal Information Spaces is proposed, and a framework based on kernel method solutions will be followed. Under the proposed framework, different image and text features may be fused in a high-dimensional space, in which a search algorithm may be designed. Each data modality in an image collection will be processed independently and will be integrated using the proposed framework. The image collection to be used is taken from the medical domain in which the multimodal structure may be found in health records and scholarly articles. The evaluation and analysis of standard information retrieval measures is also proposed to assess the contribution of the proposed research. References [1] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, Content-based multimedia information retrieval: State of the art and challenges, ACM Trans. Multimedia Comput. Commun. Appl., vol. 2, no. 1, pp. 1 19, February 2006. [2] R. Datta, D. Joshi, J. Li, and J. Z. Wang, Image retrieval: Ideas, influences, and trends of the new age, ACM Comput. Surv., vol. 40, no. 2, pp. 1 60, April 2008. [3] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis. Cambridge University Press, 2004. [4] J. C. Caicedo, A. Cruz, and F. Gonzalez, Histopathology image classification using bag of features and kernel functions, Artificial Intelligence in Medicine Conference, AIME 2009, vol. LNAI 5651, pp. 126 135, 2009. [5] J. C. Caicedo, F. A. Gonzalez, and E. Romero, Content-based medical image retrieval using a kernel-based semantic annotation framework. Technical Report UN-BI-2009-01 - National University of Colombia. Submitted to the Artificial Intelligence in Medicine Journal, Tech. Rep., 2009. [6] C. D. Manning, P. Raghavan, and H. Schütze, Introduction to Information Retrieval. Cambridge University Press, 2008. [7] T. Gärtner, J. W. Lloyd, and P. A. Flach, Kernels and distances for structured data, Machine Learning, vol. 57, no. 3, pp. 205 232, December 2004. [8] N. Cristianini, J. Shawe-Taylor, and H. Lodhi, Latent semantic kernels, Journal of Intelligent Information Systems, vol. 18, no. 2, pp. 127 152, March 2002. [9] Z. Cao, T. Qin, T. Y. Liu, M. F. Tsai, and H. Li, Learning to rank: from pairwise approach to listwise approach, in ICML 07: Proceedings of the 24th international conference on Machine learning. New York, NY, USA: ACM, 2007, pp. 129 136. [10] K. Grauman and T. Darrell, The pyramid match kernel: discriminative classification with sets of image features, in Computer Vision, 2005. ICCV 2005. Tenth IEEE International Conference on, vol. 2, 2005.