Metric inverted - an efficient inverted indexing method for metric spaces

Size: px

Start display at page:

Download "Metric inverted - an efficient inverted indexing method for metric spaces"

Brian Campbell
6 years ago
Views:

1 Metric inverted - an efficient inverted indexing method for metric spaces Benjamin Sznajder, Jonathan Mamou, Yosi Mass, and Michal Shmueli-Scheuer IBM Haifa Research Laboratory, Mount Carmel Campus, Haifa, Israel {benjams,mamou,yosimass,shmueli}@il.ibm.com Abstract. The increasing amount of digital audio-visual content accessible today calls for scalable solutions for content based search. The state of the art solutions reveal linear scalability in respect to the collection size due to the large number of distance computations needed for comparing low level audio-visual features. As a result, search in large audio-visual collections is limited to associated metadata only done by text retrieval methods that are proven to be scalable. Search in audio-visual content can be generalized to search in metric spaces by assuming some distance function on low-level features. In this paper we propose a framework for efficient indexing and retrieval of metric spaces by extending classical techniques taken from the textual IR methods such as lexicon, posting lists and boolean constraints- thus enable scalable search in any metric space and in particular in audiovisual content. We show the efficiency and effectiveness of our method by experiments on a large image collection. 1 Introduction Recently, with the proliferation of Web 2.0 applications, we experience a new trend in multimedia production where users become mass producers of audiovisual content. Websites like Flickr, YouTube and Facebook 1 are just some examples of websites with large collections of audio-visual content. Still, search in such collections is limited to associated metadata that was added manually by users and thus search results are quite dependant on the quality of the metadata. The problem of audio-visual content based search can be generalized to a search in metric spaces by assuming some distance function on low-level features. For example, for CBIR (Content Based Image Retrieval) it is common to use low level features such as Scalable Color, Textures, Edge Histograms etc. For example, the Scalable Color feature which describes the image color histogram can be represented by a vector of 64 bytes. Distance between two Scalable Colors vectors can be computed by the L 1 norm thus inducing a metric space over the domain of all Scalable Colors. Similarly, we can have another metric space for other low level image features [9]. This work was partially funded by the European Commission Sixth Framework Programme project SAPIR - Search in Audiovisual content using P2P IR

2 State of the art solutions for search in metric spaces such as [2, 4, 14] reveal linear scalability in the collection size. This is mainly because of high number of distance computations needed for comparing low level features. Thus, some approximation is needed. On the other hand, text search is known to be scalable since text IR methods uses efficient data structures such as lexicon, posting lists and boolean operations that make it very efficient for searching in large collections. In this work, we try to bridge the performance gap between text IR and metric IR by introducing an approximation for top-k using MII (Metric Inverted Index) - an inverted index implementation for metric spaces. We show that by using text IR methods such as lexicon, posting lists and boolean operators we can achieve high scalability for search in metric spaces. Using the MII we can tradeoff between efficiency and effectiveness of our approximated top-k. We show an implementation of a system based on Lucene 2 and report experiments on a real data collection that shows the efficiency of our method compared to other methods. The paper is organized as follows. We review related work in Section 2. In Section 3, we define metric spaces and state the problem. In Section 4, we describe a text IR method for indexing metric spaces. Then, in Section 5, we show how to use them efficiently for top-k retrieval. Experimental setup and results on a large image collection are given in Section 6 and we conclude with future work in Section 7. 2 Related work The idea of using inverted index to represent audio-visual was already suggested by some works. The serious of papers by Müller et al. [13, 7, 12] is one of the early works that suggested the use of inverted index for visual features. The terms in the lexicon are fixed and it contains pre-defined set of around features which correspond to human perception features. In [12] they show how to improve the search by pruning. Specifically, they define lossy and lossless reductions, both use the knowledge of the score and score increasing. However, it is not clear if and how this approach is scalable (their experiments are on limited collection of about 2500 images). The work in [17] shares similar motivation to our work. They facilitate the use of inverted index to index visual features by using N-grams approach on the visual descriptors. Their approach is thus limited only to the vector space, whereas, our approach is more general in the sense that it can handle any type of objects with metric distance. Recently, [1, 8] showed how to generate tf idf for image visual descriptors using segmentation. Specifically, in [1], each image was segmented into several segments and like-tf is generated for the image. Our focus is different, we do not attempt to use tf idf scoring but rather efficient filtering method, which implies different indexing and querying methods. The focus in [20, 3] is how to combine low level features 2

3 with text, semantic keywords etc. In [11], they use the tf idf measure for index and search in videos. Their goal is to efficiently recognize scenes and objects in videos. For scenes search, they indexed continuous frames into regions preferring Shape Adapted and Maximally Stable regions, To locate on object in the frame, they mainly look on the spatial consistency (other objects in the frame) and look for the best match. Another line of related works discuss methodologies of indexing and accessing data in metric spaces and are covered in details in [9, 19]. These techniques are general and work for every metric space, however, there main disadvantage is that complexity increases linearly with the collection size, thus, they are not scalable and not appropriate to large-scale collections. Each of the techniques partition the metric space differently, specifically, the work in [18] considered the space as a ball and applied ball-partitioning. An orthogonal approach is the GHT by [14], where two reference objects (pivots) are arbitrarily chosen and assigned to two distinct objects subset. All objects are assigned to the subset containing the pivot which is nearest to the object itself. In contrast to ball partitioning the generalized hyperplane does not guarantee a balanced split. Finally, the works in [2, 4] present another way of partitioning the data having balanced structure. Merging techniques are also relevant to our problem since we want to do efficient merging of different index lists. Approaches like mergesort and more sophisticated, like TA [5] exists, our work, however, uses boolean query techniques. 3 Definitions and Problem Statement 3.1 Definitions Metric Spaces A metric space is an ordered pair (S, d), where S is a domain and d is a distance function d : S S R. d calculates the distance between two objects from the set S. Usually, the smaller the distance is, the more similar the objects are. The distance function d must satisfy non-negativity, reflexibility, symmetry and triangle inequality. Assume a metric space (S, d), a collection C S of objects, and a query Q S. The best-k results for the query are the k objects with the shortest distance to the query object. Top-k problem Assume m metric spaces {(S i, d i ))} m i=1, a query Q m i=1 S i, and an aggregate function f : R m R. The problem of top-k is to retrieve the best k objects that have the lower aggreagte value, namely those objects D with the lower f(d 1 (Q, D), d 2 (Q, D),..., d m (Q, D)) Measures effectiveness To measure effectiveness of a retrieval engine we use the standard Precision and MAP (Mean average Precision) [16] measures. The precision is used to measure the percentage of documents retrieved that are

4 relevant. In metric spaces each object has some distance to the query (and thus some level of relevancy). Thus, we define which is the same as MAP but we take the relevant results to be the real top-k objects. efficiency Let us assume that the cost of a single distance computation between any two objects in any of the metric spaces is c d. The efficiency of a retrieval engine on metric spaces is measured by the total computation cost (c d N d ), where N d is the total number of distance computations needed to answer a top-k query. 3.2 Problem Statement Assume {(S i, d i ))} m i=1 metric spaces and a cost c d for a single distance computation. Our goal is to return the best approximation (measured by effectiveness) to the top-ḱ query while minimizing the total computation cost (efficiency measure). 4 Indexing In this section, we describe the indexing stage. We assume m metric spaces denoted by {(S 1, d 1 ) (S m, d m )} and a collection of N objects each having m features F 1 F m such that i [1, m], F i (S i, d i ). An inverted text index contains two main parts: a lexicon of terms in the collection and posting lists containing for each term the list of documents in which the term appear. For metric spaces, a term is a feature names associated with its value. Indexing stage is done in two steps: first, we create a lexicon; then, we index each object in the corresponding entries of the lexicon. 4.1 Lexicon candidate selection In text IR, the size of the lexicon is substantially smaller than the size of the collection since terms tend to appear in many documents. This is not always the case for metric spaces. For example, low level image features such as e.g. a Scalable Color can be reprensented by a vector of 256 integers. Since the space of possible values for each feature is huge, the size of a lexicon will be in the order of the collection size. Consequently, it would be impossible to store all features in the lexicon as in text IR. Thus we would like to select candidates for the lexicon such that the max number of lexicon entries will be at most l. Note that we keep all the different features from all the metric spaces in the same lexicon. A naive approach would consist by ramdomly selecting l/m documents per feature. The m features extracted from these documents will constitute the lexicon terms. This approach can be widely improved by applying some clustering algorithms on the extracted features using the distance function such that selected candidates spans the metric space. Such algorithms are described in [10] and [15]. In the present work, we use K-Means clustering algorithm [6].

5 For each feature, we choose randomly M = l/m centroids. Note that generally M << N. K-Means algorithm is run iteratively by mapping each feature to the closest centroid using the distance function. Then new centroids are selected and the process continues until fixed clusters are reached, namely no feature was moved to another centroid in a full iteration. The lexicon elements are then the clusters centroids. K-Means is run for each of the m features and we get a lexicon of l terms. 4.2 Index creation Once the lexicon entries are selected we need to create the posting lists. Let us denote the terms of the lexicon by (F : v) where F is the feature name and v is its value. Assume an object D with features D = {(F 1 : v 1 ), (F 2 : v 2 ),..., (F m : v m )}. Most of the features (F i : v i ) are not part of the lexicon so we need to select from the lexicon terms that best represent our object. This is done by selecting for each feature (F i : v i ), the n nearest lexicon terms (n is a parameter) according to the distance function. In order to determine these nearest terms efficiently, the lexicon is stored in an M-tree [2]. After this stage, D can be represented as: D = {(F 1 : v 11 ), (F 1 : v 12 ),... (F 1 : v 1n ), (F 2 : v 21 ), (F 2 : v 22 ),... (F 2 : v 2n )..., (F m : v m1 ), (F m : v m2 ),..., (F m : v mn )}, where (F i : v ij ) are 3 lexicon terms. In other words, each document is transformed to a bag of n m lexicon terms and therefor it is indexed in n m posting lists. Note that the n nearest terms selection can be compared to the stemming step in text IR. 5 Retrieval stage Given a query Q = {F 1 : qv 1, F 2 : qv 2,..., F m : qv m } we describe now the query processing done to retrieve the best approximation of the top-k while minimizing the computation cost as defined in Section 3. Query processing is done in three steps: term selection, Boolean constraint filtering, and scoring. 5.1 Term selection: Similarly to the indexing process in section 4 above, we first find for each query feature, the n closest lexicon terms that can represent it. This results in Q = {F 1 : qv 11, F 1 : qv 12,... F 1 : v 1n,... F m : qv m1, F m : qv m2,..., F m : qv mn }. Q contains m n terms all from the lexicon. 3 BS: maybe add all

6 5.2 Boolean constraint Filtering: We look now at the posting lists of the m n in Q and we want to select the best approximation to the top-k out of the objects in those posting lists. We present two strategies to select the best results: Strict-Query mode - recall that we have m features in the query and for each feature we have n posting lists of the closest lexicon entries. In the strict mode we would like to have results only if they appear in all m features. To achieve that we create a conjunction of m clauses C i, where clause C i is a disjunction of the n terms found for query feature (F i, qv i ). Experiments demonstrate that this occurs very rarely in real configuration. Formally, m n j=1(f i : qv ij ) (1) i=1 Fuzzy-Query mode - in this mode, we consider all objects that appear in at least 2 features. To achive that we create ( ) m 2 conjunction clauses for each pair of features and then we create a disjunction between those clauses. Formally, m ( ( n j=1(f i : v ij ), n j=1(f k : v kj )), i k) (2) i=1 5.3 Scoring At the end of the filtering step we have a list of documents and we would like to rank them by relevance to the query. Given an aggregate function f, we compute for each returned document D, its aggregate score to the query Q given by f(d 1 (F 1 : qv 1, F 1 : v 1 ), d 2 (F 2 : qv 2, F 2 : v 2 ),..., d m (F m : qv m, F m : v m )), where F i : v i and F i : qv i are the original values of the document and the query. Our top-k result is the k documents with the lower aggregate score. 6 Experiments In this section, we show thorough analysis of the MII under various settings with focus on the effectiveness which is measured by the number of distance computations and the efficiency which we measured by MAP@k. Specifically, we varied the lexicon size and number of nearest terms. In addition, we compare the MII to the M-Tree data structure [2], which is one of the state of the art data structure for metric spaces 4. 4 The M-Tree implementation we used has been taken from

7 6.1 Collection description Our experiments were done on a collection of images crawled from Flickr 5, for each image, three MPEG7 features have been extracted, namely, ScalableColor, EdgeHistogram and ColorLayout. The crawling and extraction are described in [?]. The distance function that we used for ScalableColor and EdgeHistogram is the L 1 (Manhattan distance), and L 2 (Euclidean distance) for ColorLayout. The aggregate function f is simple sum. We randomly chosen 180 images from the collection to serve as the queries, for each we collected the MAP and the number of comparisons. In addition, we used the Fuzzy-Query mode. 6.2 Results Like all pruning algorithms, the experiments point on a tension between the precision of the results and the strictness of the pruning: more the pruning is strict, more the precision will be low and the efficiency of the computing high. Effectiveness We present the effectiveness of the MII approach using the MAP@k measure. Figure 1(a) shows the MAP@k values vs. the number of nearest terms for index with lexicon size of l = When the number of nearest terms increases, the MAP@k increases as well, for example, when the number of nearest terms is 20 we got MAP@30=0.75 and MAP@10=0.83. Increasing the number of nearest terms implies more accesses to entries in the lexicon and our filtering becomes lazier, eventually, more documents need to pass the full ranking. For 30 nearest terms the MAP value is around the 90%. Figure 1(b) shows the MAP@k values vs. the lexicons size for fixed number of nearest terms (30). Reducing the size of the lexicon enhanced the MAP@k. This result was expected, since the posting lists of each term became longer and the Boolean filtering need to validate more documents. (a) (b) Fig. 1. MAP value for different k s.(a) Lexicons size =16000, varied number of nearest terms(b) number of nearest term =30, varied size of lexicon Efficiency The efficiency is measured by the number of distance computations. We calculate the number of computations as follow; for each feature fet, we use M-Tree to found the n nearest terms in the lexicon. Let us designate this cost as lookup fet. After this step, we merge the different lists using Boolean filtering which does not imply additional computation costs. Finally, assume the number of objects that passed the filtering step to be m and the number of feature that 5

8 we used to be x, in order to get the top-k results, each object need additional m x comparisons. Formally, the total distance computations is: x (lookup i ) + x m i=1 Figure 2 shows the efficiency results for varying sizes of lexicon. Intuitively, reducing the size of the lexicon should increase the number of computations, however, we get that the cost for a size of lexicon l = is lower than for a size of lexicon equals to l = This phenomenon can be explained as follow, the lookup in the lexicon for selecting the Visual Vocabulary is cheap and compensates the manipulation of long returned results list. This result is interesting and promising, we can find an optimal size of lexicon that insures us a lazier pruning, and does not cost more in computations. Fig. 2. # comparisons vs. # nearest terms for varied lexicons size To validate the goodness of our efficiency results, we compare it with the number of computations involved in a lookup via M-Tree. Figure 3 shows the number of computations for MII and M-Tree for different values of ks. Fig. 3. # comparisons vs. k 7 Conclusion and future works References 1. Giuseppe Amato, Pasquale Savino, and Vanessa Magionami. Image indexing and retrieval using visual terms and text-like weighting. In DELOS Conference, pages 11 21, Paolo Ciaccia, Marco Patella, and Pavel Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Matthias Jarke, Michael Carey, Klaus R. Dittrich, Fred Lochovsky, Pericles Loucopoulos, and Manfred A. Jeusfeld, editors, Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 97), pages , Athens, Greece, August Morgan Kaufmann Publishers, Inc. 3. T. Deselaers, T. Weyand, D. Keysers, W. Macherey, and H. Ney. FIRE in ImageCLEF 2005: Combining content-based image retrieval with textual information retrieval. In Working notes of the CLEF 2005 Workshop, Vienna, Austria, September 2005.

9 4. Vlastislav Dohnal, Claudio Gennaro, Pasquale Savino, and Pavel Zezula. D-index: Distance searching index for metric data sets. Multimedia Tools Appl., 21(1):9 33, Ronald Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algorithms for middleware. In PODS 01: Proceedings of the twentieth ACM SIGMOD-SIGACT- SIGART symposium on Principles of database systems, pages , New York, NY, USA, ACM Press. 6. J. A. Hartigan and M. A. Wong. A K-means clustering algorithm. Applied Statistics, 28: , Henning Muller, David McG. Squire, Wolfgang Muller, and Thierry Pun. Efficient access methods for content-based image retrieval with inverted files. 8. Rute Frias Rui Jesus, Ricardo Dias and Nuno Correia. Multimedia information retrieval: new challenges in audio visual search. SIGIR Forum, 41(2):77 82, Hanan Samet. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, Gholamhosein Sheikholeslami, Wendy Chang, and Aidong Zhang. Semantic clustering and querying on heterogeneous features for visual data. In MULTIMEDIA 98: Proceedings of the sixth ACM international conference on Multimedia, pages 3 12, New York, NY, USA, ACM. 11. Josef Sivic and Andrew Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV 03: Proceedings of the Ninth IEEE International Conference on Computer Vision, page 1470, Washington, DC, USA, IEEE Computer Society. 12. D. Squire, H. Muller, and W. Muller. Improving response time by search pruning in a content-based image retrieval system. 13. D. Squire, W. Muller, H. Muller, and J. Raki. Content-based query of image databases, inspirations from text retrieval: inverted files, frequency-based weights and relevance feedback. In Proceedings of the 10th Scandinavian Conference on Image Analysis (SCIA 99), June Jeffrey K. Ulmann. Satisfying general proximity/similarity queries with metric trees. Inf. Process. Lett., 40(4): , A. Vailaya, M. Figueiredo, A. Jain, and H. Zhang. Image classification for contentbased indexing, C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, Peter Wilkins, Paul Ferguson, Alan F. Smeaton, and Cathal Gurrin. Text based approaches for content-based image retrieval on large image collections. In 2nd IEE European Workshop on the Integration of Knowledge, Semantic and Digital Media Technologies, P. Yianilos. Excluded middle vantage point forests for nearest neighbor search, Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer- Verlag New York, Inc., Secaucus, NJ, USA, Xiang Sean Zhou and Thomas S. Huang. Unifying keywords and visual contents in image retrieval. IEEE MultiMedia, 9(2):23 33, 2002.

MESSIF: Metric Similarity Search Implementation Framework

MESSIF: Metric Similarity Search Implementation Framework Michal Batko David Novak Pavel Zezula {xbatko,xnovak8,zezula}@fi.muni.cz Masaryk University, Brno, Czech Republic Abstract The similarity search