PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search

Size: px
Start display at page:

Download "PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search"

Transcription

1 PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli Istituto di Scienza e Tecnologie dell Informazione A. Faedo Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, Pisa, Italy ISTI:Science seminar, May 12, 2009 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 1 / 48

2 Outline 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 2 / 48

3 Outline Introduction 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 3 / 48

4 Outline Introduction Similarity search 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 4 / 48

5 Introduction Similarity search Similarity search The similarity search model involves: A collection of objects D, belonging to a domain O; a query object q O; a distance function d : O O R +. The goal is to sort the objects in D by their distance with respect to q, returning the objects that are closer to q, which are considered to be the most similar. Typically only the k-top ranked objects are returned (k-nn query), or those within a maximum distance value r (range query). The determination of a meaningful r value is often a non-easy task. k-nn queries are usually preferred, specially in end-user applications, also for the direct control on the result set size. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 5 / 48

6 Similarity search Introduction Similarity search Example (R 2, L 2 ): o 2 o 2 o 1 o 1 o 3 o 3 o 10 o 10 o 0 r o 5 q o 4 o 8 o 9 o 0 o 5 q o 4 o o 11 7 o 8 o 9 o o 11 7 o 6 o 6 o 12 o 12 Figure 1: Range query. Figure 2: k-nn query (k = 5). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 6 / 48

7 Introduction Approximate similarity search Similarity search Exhaustive search: for all o i D compute the distance d(q, o i ), while keeping track of which objects satisfy the query. It does not scale to large collections. Exact methods: equivalent to exhaustive search, but using data structures that leverage on the properties of the observed similarity space (e.g., vectorial spaces, metric spaces) in order to reduce the number of objects of D to be compared with the query. Usually efficient but still not enough for huge collections. Approximate methods: accepting that the results could contain errors (e.g., d(q, o 1 ) < d(q, o 2 ), o 2 is in the results and o 1 is not), gaining efficiency. Approximation is acceptable, e.g., when d is an approximation of a complex, human-perceived concept of similarity. It (obviously) scales! Typically derived from relaxed exact methods. Natively approximated proposals, e.g.: local similarity hashing (LSH) index and permutation-based index (the PP-Index takes inspiration from both). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 7 / 48

8 Introduction Approximate similarity search Similarity search Approximation quality: What have we missed? What have we included? How much have we saved? o 2 o 1 o 3 o 10 o 0 o 5 q o 4 o 8 o 9 o o 11 7 o 6 o 12 Figure 3: Approximate result for a k-nn query (k = 5). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 8 / 48

9 Outline Introduction Permutation based methods 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 9 / 48

10 Introduction Permutation based methods Permutation based methods Independently proposed by Amato and Savino 1 and Chavez et al. 2, using different data structures. The idea: an object is represented by its view of the surrounding world. Intuively, if two objects see the elements of a set of reference objects R in the same order of (increasing) distance, they are likely to be close one to the other. Example Where am I likely to live if I see the main European cities in the following order? Rome, Milan, Bern, Marseilles, Munich, Luxembourg, Bonn, Vienna, Belgrade, Brussels, Barcelona, Paris, Berlin, Amsterdam, London, Copenhagen, Madrid, Istanbul, Dublin, Athens, Oslo, Stockholm, Lisbon, Helsinki. 1 G. Amato and P. Savino, Approximate similarity search in metric spaces using inverted files, INFOSCALE 2008, pages E. Chavez, K. Figueroa, and G. Navarro, Effective proximity retrieval by ordering permutations, IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9): , Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 10 / 48

11 Introduction Permutation based methods Permutation based methods The method: A set of reference objects R = {r 0,..., r R 1 } O is defined (e.g., by randomly selecting R objects from D). Every object o i D is then represented by a permutation Π oi of 0,..., R 1, i.e., the list of the identifiers of reference objects, so that the identifiers are sorted by the distance of their relative reference objects with respect to o i. The search process mainly consists in computing Π q and estimating the true distance d(q, o i ) using a permutation-based distance d (Π q, Π oi ), e.g., the Spearman s footrule distance. Amato and Savino have shown that using only the prefix Π l o i of the permutation Π oi (e.g., l = 100 when R = 500) improves both efficiency and effectiveness. The PP-Index adopts a permutation-based data representation model, using very short prefixes (e.g., l = 6 when R = 1000). Differently from previous approaches, the permutation prefixes are used just to quickly find a small set of candidate objects from D for inclusion into results, not to estimate their relative order. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 11 / 48

12 Introduction Permutation based methods Permutation based methods r 0 r 0 <0,1,3> <1,3,0> <1,3,4> <0,2,3,1,5,4> <1,3,2,0,4,5> r 1 <1,4,3,2,0,5> <0,2,3> <1,3,2> r 1 <1,4,3> <5,2,3,0,1,4> r2 r 3 <4,1,3,2,5,0> r 4 <2,3,0> <3,2,1> <2,0,3> <2,5,3> <2,3,5> r 2 r 3 <3,2,5> <4,1,3> r 4 r 5 <4,3,1,2,5,0> r 5 <5,2,3> <4,3,1> <4,3,5> <5,2,3,4,1,0> Figure 4: Regions of the 2-dimensional space identified by 6 randomly selected reference points, using the Euclidean distance, and full-lenght permutations (left) or permutation prefixes of lenght 3. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 12 / 48

13 Outline Introduction Local similarity hashing methods 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 13 / 48

14 Introduction Local similarity hashing methods Local similarity hashing methods A family H of hash functions f : O U is called (r, ɛ, p 1, p 2 )-sensitive, with r, ɛ > 0, p 1 > p 2 > 0, if for any p, q O: if d(p, q) r then P[h(p) = h(q)] p 1 if d(p, q) > r(1 + ɛ) then P[h(p) = h(q)] p 2 for any function h randomly selected from H. Intuitively: two objects have a (high) probability x 1 p 1 to collide if they are closer than r, and a (low) probability x 2 p 2 if they are more distant than r(1 + ɛ). LSH-Index 3 : j randomly chosen functions h i H define a hash function g(x) = (h 1 (x)h 2 (x)... h j (x)), i.e. bad collision probability is significantly lowered to p j 2. t different hash tables are built, based on randomly generated g 1... g t functions, in order to increase good collision probability. 3 P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, STOC 1998, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 14 / 48

15 Introduction Local similarity hashing methods Local similarity hashing methods It is hard to tune LSH-Index (length of hash keys) in order to obtain good efficacy, due to the dependence between data distribution and hash length. LSH-Forest 4 : Use of variable length hash keys. Long hash key are indexed in a prefix tree (LSH-Tree). At search time the key length is varied in order to retrieve a given number of candidate objects. Candidate objects are retrieved sequentially from a data storage on disk. Multiple LSH-Tree, i.e., a forest, are used to improve effectiveness. The PP-Index uses similar data structure. 4 M. Bawa, T. Condie, and P. Ganesan, LSH-Forest: self-tuning indexes for similarity search, WWW 2005, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 15 / 48

16 Outline The PP-Index 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 16 / 48

17 Outline The PP-Index Data structures 1 Introduction 2 The PP-Index Data structures Algorithms 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 17 / 48

18 The PP-Index Data structures PP-Index: data structures The PP-Index represents each indexed object with a permutation prefix of length l. Data structures: a prefix tree kept in main memory, indexing the permutation prefixes, and a data storage kept on disk, storing the information required to compute real distances between objects in D and any object in O. The prefix tree is used in order to rapidly identify a set of at least z candidates (z k), leaving to the original distance function the task of determining the final k-nn result from such set of candidates. Candidates are retrieved from the data storage with a few sequential disk accesses. The PP-Index adopts a bulk data processing model, similar to the one used for text-based inverted list indexes (assumption on the static nature of data). It is easy to provide update capabilities (i.e., insert, delete, modify). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 18 / 48

19 Outline The PP-Index Algorithms 1 Introduction 2 The PP-Index Data structures Algorithms 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 19 / 48

20 The PP-Index PP-Index: building the index Algorithms BuildIndex(D, d, R, l) 1 prefixt ree EmptyPrefixTree() 2 datastorage EmptyDataStorage() 3 for i 0 to D 1 4 do o i GetObject(D, i) 5 datablock oi GetDataBlock(o i) 6 p oi Append(dataBlock oi, datastorage) 7 w oi ComputePrefix(o i, R, d, l) 8 h oi i 9 Insert(w oi, h oi, p oi, prefixt ree) 10 L ListPointersByOrderedVisit(prefixT ree) 11 P CreateInvertedList(L) 12 ReorderStorage(dataStorage, P ) 13 CorrectLeafValues(prefixT ree, datastorage) 14 index NewIndex(d, R, l, prefixt ree, datastorage) 15 return index Figure 5: The BuildIndex function. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 20 / 48

21 The PP-Index Algorithms PP-Index: building the index Input: dataset D, distance function d, reference objects R, prefix length l. Indexing process: Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Data storage reordering: data blocks are sorted to reflect the order of prefixes. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 21 / 48

22 The PP-Index PP-Index: building the index Algorithms Index characteristics D =10, R =6, l=3 Permutation prefixes w o 0 =<1, 3, 2> w o 1 =<2, 3, 0> w o 2 =<5, 2, 3> w o 3 =<4, 1, 3> w o 4 =<1, 3, 2> w o 5 =<4, 1, 3> w o 6 =<1, 3, 4> w o 7 =<5, 2, 3> w =<1, 3, 2> w =<4, 3, 5> o 8 o 9 Prefix tree root h p 0 h 0 p 4 h 4 p 8 h 8 p 6 h 6 p 1 h h h p p p 9 h h p p main memory secondary memory Figure 6: Sample data. o 0... o o... 5 o... o... 4 o... o 6... o 7... o 8... o 9... Data storage Figure 7: Index data structure after the first phase of object insertion. Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 22 / 48

23 The PP-Index PP-Index: building the index Algorithms Prefix tree root Prefix tree root h p 0 h 0 p 4 h 4 p 8 h 8 p 6 h 6 p 1 h h h p p p 9 h h p p h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end main memory secondary memory main memory secondary memory o 0... o o... 5 o... o... 4 o... o 6... o 7... o 8... o 9... Data storage o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage Figure 8: Index data structure after the first phase of object insertion. Figure 9: Index data structure after the first phase of object insertion. Data storage reordering: data blocks are sorted to reflect the order of prefixes. The leaves of the final prefix tree point to intervals of the data storage. Efficiency alert: performed using a m-way merge sort algorithm. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 23 / 48

24 PP-Index: search function The PP-Index Algorithms FindCandidates(q, prefixt ree, R, d, l, z) 1 w q ComputePrefix(q, R, d, l) 2 for i l to 1 3 do wq i SubPrefix(wq, i) 4 node SearchPath(wq i, prefixt ree) 5 if node nil 6 then minleaf GetMin(node, prefixt ree) 7 maxleaf GetMax(node, prefixt ree) 8 if (maxleaf.h end minleaf.h start + 1) z i = 1 9 then return (minleaf.p start, maxleaf.h end ) 10 return (0, 0) Figure 10: The FindCandidates function. Given the prefix representing the query, FindCandidates searches for the smallest subtree of the prefix tree pointing to at least z data blocks (z ). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 24 / 48

25 PP-Index: search function The PP-Index Algorithms Search(q, k, z, index) 1 (p start, p end ) FindCandidates(q, index.prefixt ree, index.r, index.d, index.l, z) 2 resultsheap EmptyHeap() 3 cursor p start 4 while cursor p end 5 do datablock Read(cursor, index.datastorage) 6 AdvanceCursor(cursor) 7 distance index.d(q, datablock.data) 8 if resultsheap.size < k 9 then Insert(resultsHeap, distance, datablock.id) 10 else if distance < resultsheap.top.distance 11 then ReplaceTop(resultsHeap, distance, datablock.id) 12 Sort(resultsHeap) 13 return resultsheap Figure 11: The Search function. The z candidate data blocks are sequentially read from the data storage. A heap is used to keep track of the best k results. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 25 / 48

26 The PP-Index Algorithms PP-Index: improving the search effectiveness The basic search strategy is designed for efficiency. Effectiveness can be boosted using various search strategies: Multiple index: building n PP-Index using different R sets, R 1... R n (LSH-Forest style). Projecting different R-induced grids on the objects, helps to approximate a better (less skewed) partitioning of the space. Can be implemented using data replication (faster/more storage) or using data referencing (slower/less storage). k-nn results from the various indexes are merged together in the final one. Multiple query: generating m perturbed versions of w q in order to explore the neighborhood of w q. The perturbed w i q prefixes are generated by swapping pairs of elements of w q, first selecting those with the smaller distance difference with respect to q. All the w i q prefixes are used to find candidates on the same index. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 26 / 48

27 The PP-Index Algorithms PP-Index: prefix tree optimizations Prefix tree root Prefix tree root 1, h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end main memory secondary memory main memory secondary memory o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage Figure 12: Pruning of only-child paths to leaves. Figure 13: Only-child paths compression. Scalability alert: reducing to a single leaf any subtree pointing to less than z data blocks. Applicable when z is hardcoded into the search function. Does not affect search results quality. Lossy with respect to index update operations. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 27 / 48

28 The PP-Index Algorithms PP-Index: merging (and updating) the index Scalability alert: the index (prefix tree) reaches its maximum memory requirement at the end of the main loop of the indexing process. Could not fit into memory, when the final index will (after optimizations). Strategy: building many smaller indexes, using the same R set, then merging them together. The merge process is efficient: The source prefix trees are merged into the final prefix tree by performing a parallel ordered visit on them. Data storages are merged into the final data storage while building the final prefix tree. Can be done from-disk-to-disk, minimum memory occupation. Linear cost with respect to index size (if not done in an m-way style). Uses only sequential reads/writes. Update operations can be supported by keeping track of such operations by using a small all-in-memory index and performing periodic merge operations. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 28 / 48

29 Outline Experiments 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 29 / 48

30 Outline Experiments The CoPhIR collection 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 30 / 48

31 Experiments The CoPhIR collection The CoPhIR collection The CoPhIR 5 consists of a crawl of 106 millions images from the Flickr photo sharing website. Textual data + five MPEG-7 visual descriptors (240 GB of XML description data). Visual similarity measure: linear combination of distance functions defined on the MPEG-7 descriptors. MPEG-7 Visual Descriptor Distance type Dimension Weight Scalable Color L Color Structure L Color Layout sum of L Edge Histogram L Homogeneous Texture L Table 1: Details on the five MPEG-7 visual descriptors used in CoPhIR, and the weights used in the linear combination. The Dim. column refer to the specific dimension for visual descriptors adopted by the CoPhIR data set. Experiments made on 1, 10, and 100 millions images, using 100 randomly selected images (excluded from indexes) Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 31 / 48

32 Outline Experiments Evaluation measures 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 32 / 48

33 Evaluation measures Experiments Evaluation measures Effectiveness measures: Recall (ranking-based 6 ): Recall(k) = Dk q P k q k Relative Distance Error (distance-based): RDE(k) = 1 k d(q, Pq k (i)) k d(q, Dq k (i)) 1 (2) where Dq k is the list of the k closest elements of D to q, sorted by their distance with respect to q, and Pq k is the list returned by the algorithm. Efficiency measures: index time. index size (RAM, disk). i=1 number of candidates retrieved from disk (z ). average search time. 6 M. Patella and P. Ciaccia, The many facets of approximate similarity search, SISAP 2008, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 33 / 48 (1)

34 Outline Experiments Results 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 34 / 48

35 Experiments Results Results Indexing time (s) M 10M 1M Figure 14: R Indexing time w.r.t. to the size of R and the data set size. D indexing prefix tree size data l time (sec) full comp. storage 1M MB 91 kb 349 MB M MB 848 kb 3.4 GB M MB 6.5 MB 34 GB 3.5 Table 2: Indexing times (with R = 100), resulting index sizes, and average prefix tree depth l (after prefix tree compression with z = 1, 000), for the various data set sizes. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 35 / 48

36 Results Experiments Results Search time (s) M 10M 1M R Figure 15: Search time w.r.t. to the size of R and the data set size. Search performed with z = 1, 000 and k = 100 (single index, single query). R D 1M 10M 100M 100 4,075 5,817 7, ,320 5,571 7, ,803 5,065 6,853 1,000 1,091 4,748 6,644 Table 3: Average z value (z = 1, 000), i.e., average number of retrieved candidate objects for a query, with respect to the size of the reference objects set and data set size. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 36 / 48

37 Results 0.20 Experiments Results M 10M 1M 0.15 k=100 k=10 k=1 RDE(k) 0.10 RDE(k) R R Recall(k) M 10M 1M Recall(k) k=100 k=10 k= R R Figure 16: Effectiveness with respect of the size of R set, on various index sizes, using k = 100, and z = 1, 000 (single index, single query). Figure 17: Effectiveness with respect of the size of R set, on the 100M index, using z = 1, 000 (single index, single query). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 37 / 48

38 Results 0.10 Experiments Results 0.10 RDE(k) k=100 k=10 k=1 RDE(k) k=100 k=10 k= indexes queries Recall(k) k=100 k=10 k=1 Recall(k) k=100 k=10 k= indexes queries Figure 18: Effectiveness of the multiple index search strategy on the 100M index, using R = 1, 000 and z = 1, 000. Figure 19: Effectiveness of the multiple query search strategy on the 100M index, using R = 1, 000 and z = 1, 000. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 38 / 48

39 Results 0.01 Experiments Results 1e-3 RDE(k) 8e-3 6e-3 4e-3 100M 10M 1M RDE(k) 8e-4 6e-4 4e-4 100M 10M 1M 2e-3 2e-4 Recall(k) k 100M 10M 1M k Recall(k) k 100M 10M 1M k Figure 20: Effectiveness of the combined multiple Figure 21: Effectiveness of the combined multiple query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using R = 100, and z = 1, 000. query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using R = 1, 000, and z = 1, 000. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 39 / 48

40 Outline Demo 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 40 / 48

41 Demo Demo Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 41 / 48

42 Outline Conclusions 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 42 / 48

43 Outline Conclusions Summary 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Summary Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 43 / 48

44 Conclusions Summary Summary The PP-Index: is a simple but effective data structure for approximate similarity search. scales well, both at indexing time and at search time. can be kept updated with minor additional effort. has good parallelization properties. relates well with other data structures (i.e., inverted lists). There is a lot still to investigate: policies for reference points selection. studing the relations between l, R, z, k, l, and z. giving a theoretical foundation to the permutation based methods. applicability to other domains and similarity space types. policies for data partitioning Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 44 / 48

45 Outline Conclusions Questions 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Summary Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 45 / 48

46 Questions? Conclusions Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 46 / 48

47 FAQ Conclusions Questions Q: How does the PP-Index differ from the orthodox metric approach X? A: Please help yourself: No explicit requirement of metric properties. Use of a predetermined (i.e., fixed) set of reference points. Any reference point has a global influence. Different data access model. Different data update model.... Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 47 / 48

48 FAQ Conclusions Questions Q: What are the key differences between the permutation-based methods and the LSH-based methods? A: The permutation-based methods are mostly based on geometrical considerations, while LSH-based methods are mostly based on probabilistic considerations. The permutation-based methods are able to take into account how data is distributed in the similarity space (by means of R), while the LSH hash functions are derived only from the distance function. Each element of the hash key generated by an LSH hash function is independent from the others, while the order relation between the elements of a permutation is the crucial information for a permutation-based method. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 48 / 48

Pivot Selection Strategies for Permutation-Based Similarity Search

Pivot Selection Strategies for Permutation-Based Similarity Search Pivot Selection Strategies for Permutation-Based Similarity Search Giuseppe Amato, Andrea Esuli, and Fabrizio Falchi Istituto di Scienza e Tecnologie dell Informazione A. Faedo, via G. Moruzzi 1, Pisa

More information

An approach to Content-Based Image Retrieval based on the Lucene search engine library (Extended Abstract)

An approach to Content-Based Image Retrieval based on the Lucene search engine library (Extended Abstract) An approach to Content-Based Image Retrieval based on the Lucene search engine library (Extended Abstract) Claudio Gennaro, Giuseppe Amato, Paolo Bolettieri, Pasquale Savino {claudio.gennaro, giuseppe.amato,

More information

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013 Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest

More information

Large-scale visual recognition Efficient matching

Large-scale visual recognition Efficient matching Large-scale visual recognition Efficient matching Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline!! Preliminary!! Locality Sensitive Hashing: the two modes!! Hashing!! Embedding!!

More information

Geometric data structures:

Geometric data structures: Geometric data structures: Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Sham Kakade 2017 1 Announcements: HW3 posted Today: Review: LSH for Euclidean distance Other

More information

Searching in one billion vectors: re-rank with source coding

Searching in one billion vectors: re-rank with source coding Searching in one billion vectors: re-rank with source coding Hervé Jégou INRIA / IRISA Romain Tavenard Univ. Rennes / IRISA Laurent Amsaleg CNRS / IRISA Matthijs Douze INRIA / LJK ICASSP May 2011 LARGE

More information

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of

More information

Crawling, Indexing, and Similarity Searching Images on the Web

Crawling, Indexing, and Similarity Searching Images on the Web Crawling, Indexing, and Similarity Searching Images on the Web (Extended Abstract) M. Batko 1 and F. Falchi 2 and C. Lucchese 2 and D. Novak 1 and R. Perego 2 and F. Rabitti 2 and J. Sedmidubsky 1 and

More information

Distributed Multi-modal Similarity Retrieval

Distributed Multi-modal Similarity Retrieval Distributed Multi-modal Similarity Retrieval David Novak Seminar of DISA Lab, October 14, 2014 David Novak Multi-modal Similarity Retrieval DISA Seminar 1 / 17 Outline of the Talk 1 Motivation Similarity

More information

If you wish to cite this paper, please use the following reference:

If you wish to cite this paper, please use the following reference: This is an accepted version of a paper published in Proceedings of the 4th ACM International Conference on SImilarity Search and Applications (SISAP 11). If you wish to cite this paper, please use the

More information

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology 1 / 36 Outline 1 Welcome

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors Classic Ideas, New Ideas Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura University of Toronto, July 2007 1 / 39 Outline

More information

CLSH: Cluster-based Locality-Sensitive Hashing

CLSH: Cluster-based Locality-Sensitive Hashing CLSH: Cluster-based Locality-Sensitive Hashing Xiangyang Xu Tongwei Ren Gangshan Wu Multimedia Computing Group, State Key Laboratory for Novel Software Technology, Nanjing University xiangyang.xu@smail.nju.edu.cn

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part V Lecture 13, March 10, 2014 Mohammad Hammoud Today Welcome Back from Spring Break! Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+

More information

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1 MUFIN Basics MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic mufin@fi.muni.cz SEMWEB 1 Search problem SEARCH index structure data & queries infrastructure SEMWEB 2 The thesis

More information

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI Other Use of Triangle Ineuality Algorithms for Nearest Neighbor Search: Lecture 2 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Outline

More information

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors Algorithms for Nearest Neighbors State-of-the-Art Yury Lifshits Steklov Institute of Mathematics at St.Petersburg Yandex Tech Seminar, April 2007 1 / 28 Outline 1 Problem Statement Applications Data Models

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part V Lecture 15, March 15, 2015 Mohammad Hammoud Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+ Tree) and Hash-based (i.e., Extendible

More information

Succinct Nearest Neighbor Search

Succinct Nearest Neighbor Search Succinct Nearest Neighbor Search Eric Sadit Tellez a, Edgar Chavez b, Gonzalo Navarro c,1 a División de Estudios de Posgrado, Facultad de Ingeniería Eléctrica, Universidad Michoacana de San Nicolas de

More information

Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search

Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search Dror Aiger, Efi Kokiopoulou, Ehud Rivlin Google Inc. aigerd@google.com, kokiopou@google.com, ehud@google.com Abstract

More information

Full-Text Search on Data with Access Control

Full-Text Search on Data with Access Control Full-Text Search on Data with Access Control Ahmad Zaky School of Electrical Engineering and Informatics Institut Teknologi Bandung Bandung, Indonesia 13512076@std.stei.itb.ac.id Rinaldi Munir, S.T., M.T.

More information

VK Multimedia Information Systems

VK Multimedia Information Systems VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at Dienstags, 16.oo Uhr c.t., E.1.42 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Indexing

More information

Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order

Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order E. Chávez 1, K. Figueroa 1,2, and G. Navarro 2 1 Facultad de Ciencias Físico-Matemáticas, Universidad Michoacana, México.

More information

On the Least Cost For Proximity Searching in Metric Spaces

On the Least Cost For Proximity Searching in Metric Spaces On the Least Cost For Proximity Searching in Metric Spaces Karina Figueroa 1,2, Edgar Chávez 1, Gonzalo Navarro 2, and Rodrigo Paredes 2 1 Universidad Michoacana, México. {karina,elchavez}@fismat.umich.mx

More information

Permutation Search Methods are Efficient, Yet Faster Search is Possible

Permutation Search Methods are Efficient, Yet Faster Search is Possible Permutation Search Methods are Efficient, Yet Faster Search is Possible Bilegsaikhan Naidan Norwegian University of Science and Technology Trondheim, Norway bileg@idi.ntnu.no Leonid Boytsov Carnegie Mellon

More information

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan

THE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan THE B+ TREE INDEX CS 564- Spring 2018 ACKs: Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? The B+ tree index Basics Search/Insertion/Deletion Design & Cost 2 INDEX RECAP We have the following query:

More information

Secure Metric-Based Index for Similarity Cloud

Secure Metric-Based Index for Similarity Cloud Secure Metric-Based Index for Similarity Cloud Stepan Kozak, David Novak, and Pavel Zezula Masaryk University, Brno, Czech Republic {xkozak1,david.novak,zezula}@fi.muni.cz Abstract. We propose a similarity

More information

Kathleen Durant PhD Northeastern University CS Indexes

Kathleen Durant PhD Northeastern University CS Indexes Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical

More information

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document

More information

Locality- Sensitive Hashing Random Projections for NN Search

Locality- Sensitive Hashing Random Projections for NN Search Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade

More information

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton

More information

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19

Treaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19 CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Efficient Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 6 7 Outline 1 2 3 4 5 6 7 Text es An index is a mechanism to locate a given term in

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:

More information

IEEE Infocom 2005 Demonstration ITEA TBONES Project

IEEE Infocom 2005 Demonstration ITEA TBONES Project IEEE Infocom 2005 Demonstration ITEA TBONES Project A GMPLS Unified Control Plane for Multi-Area Networks D.Papadimitriou (Alcatel), B.Berde (Alcatel), K.Casier (IMEC), and R.Theillaud (Atos Origin) ITEA

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

Indexing and Searching

Indexing and Searching Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:

More information

over Multi Label Images

over Multi Label Images IBM Research Compact Hashing for Mixed Image Keyword Query over Multi Label Images Xianglong Liu 1, Yadong Mu 2, Bo Lang 1 and Shih Fu Chang 2 1 Beihang University, Beijing, China 2 Columbia University,

More information

CRE investment weakens in Q as investors struggle to find product in prime markets

CRE investment weakens in Q as investors struggle to find product in prime markets 2006.3 2007.1 2007.3 2008.1 2008.3 2009.1 2009.3 2010.1 2010.3 2011.1 2011.3 2012.1 2012.3 2013.1 2013.3 2014.1 2014.3 2015.1 2015.3 2016.1 Europe Capital Markets, Q1 2016 CRE investment weakens in Q1

More information

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser Datenbanksysteme II: Multidimensional Index Structures 2 Ulf Leser Content of this Lecture Introduction Partitioned Hashing Grid Files kdb Trees kd Tree kdb Tree R Trees Example: Nearest neighbor image

More information

Adaptive Binary Quantization for Fast Nearest Neighbor Search

Adaptive Binary Quantization for Fast Nearest Neighbor Search IBM Research Adaptive Binary Quantization for Fast Nearest Neighbor Search Zhujin Li 1, Xianglong Liu 1*, Junjie Wu 1, and Hao Su 2 1 Beihang University, Beijing, China 2 Stanford University, Stanford,

More information

A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions

A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions Lee A Carraher, Philip A Wilsey, and Fred S Annexstein School of Electronic and Computing Systems University of Cincinnati

More information

Large scale object/scene recognition

Large scale object/scene recognition Large scale object/scene recognition Image dataset: > 1 million images query Image search system ranked image list Each image described by approximately 2000 descriptors 2 10 9 descriptors to index! Database

More information

Online Document Clustering Using the GPU

Online Document Clustering Using the GPU Online Document Clustering Using the GPU Benjamin E. Teitler, Jagan Sankaranarayanan, Hanan Samet Center for Automation Research Institute for Advanced Computer Studies Department of Computer Science University

More information

Nearest Neighbor Search by Branch and Bound

Nearest Neighbor Search by Branch and Bound Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30 Outline 1 Short Intro to

More information

Trees. Reading: Weiss, Chapter 4. Cpt S 223, Fall 2007 Copyright: Washington State University

Trees. Reading: Weiss, Chapter 4. Cpt S 223, Fall 2007 Copyright: Washington State University Trees Reading: Weiss, Chapter 4 1 Generic Rooted Trees 2 Terms Node, Edge Internal node Root Leaf Child Sibling Descendant Ancestor 3 Tree Representations n-ary trees Each internal node can have at most

More information

Course Content. Objectives of Lecture? CMPUT 391: Spatial Data Management Dr. Jörg Sander & Dr. Osmar R. Zaïane. University of Alberta

Course Content. Objectives of Lecture? CMPUT 391: Spatial Data Management Dr. Jörg Sander & Dr. Osmar R. Zaïane. University of Alberta Database Management Systems Winter 2002 CMPUT 39: Spatial Data Management Dr. Jörg Sander & Dr. Osmar. Zaïane University of Alberta Chapter 26 of Textbook Course Content Introduction Database Design Theory

More information

CSE 530A. B+ Trees. Washington University Fall 2013

CSE 530A. B+ Trees. Washington University Fall 2013 CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key

More information

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic tomas.skopal@vsb.cz

More information

Multidimensional Indexing The R Tree

Multidimensional Indexing The R Tree Multidimensional Indexing The R Tree Module 7, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create

More information

Spatial Data Management

Spatial Data Management Spatial Data Management [R&G] Chapter 28 CS432 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite imagery, where each pixel stores a measured value

More information

Principles of Data Management. Lecture #14 (Spatial Data Management)

Principles of Data Management. Lecture #14 (Spatial Data Management) Principles of Data Management Lecture #14 (Spatial Data Management) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s Notable News v Project

More information

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18. Image Analysis & Retrieval CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W 4-5:15pm@Bloch 0012 Lec 18 Image Hashing Zhu Li Dept of CSEE, UMKC Office: FH560E, Email: lizhu@umkc.edu, Ph:

More information

Quantifying the Specificity of Near-Duplicate Image Classification Functions

Quantifying the Specificity of Near-Duplicate Image Classification Functions Quantifying the Specificity of Near-Duplicate Image Classification Functions Richard Connor 1 and Franco Alberto Cardillo 2 1 Department of Computer and Information Sciences, University of Strathclyde,

More information

Spatial Data Management

Spatial Data Management Spatial Data Management Chapter 28 Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite

More information

Locality-Sensitive Hashing

Locality-Sensitive Hashing Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a query q (or not), how do we find similar items from a large search set quickly? Can t do all pairwise comparisons;

More information

External Memory Algorithms and Data Structures Fall Project 3 A GIS system

External Memory Algorithms and Data Structures Fall Project 3 A GIS system External Memory Algorithms and Data Structures Fall 2003 1 Project 3 A GIS system GSB/RF November 17, 2003 1 Introduction The goal of this project is to implement a rudimentary Geographical Information

More information

Combining Local and Global Visual Feature Similarity using a Text Search Engine

Combining Local and Global Visual Feature Similarity using a Text Search Engine Combining Local and Global Visual Feature Similarity using a Text Search Engine Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro, Fausto Rabitti {giuseppe.amato, fabrizio.falchi, paolo.bolettieri,

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

A Metric Cache for Similarity Search

A Metric Cache for Similarity Search A Metric Cache for Similarity Search Fabrizio Falchi ISTI-CNR Pisa, Italy fabrizio.falchi@isti.cnr.it Raffaele Perego ISTI-CNR Pisa, Italy raffaele.perego@isti.cnr.it Claudio Lucchese ISTI-CNR Pisa, Italy

More information

CSIT5300: Advanced Database Systems

CSIT5300: Advanced Database Systems CSIT5300: Advanced Database Systems L08: B + -trees and Dynamic Hashing Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong SAR,

More information

Panel. Progress in Description and Retrieval of Complex Multimedia Information

Panel. Progress in Description and Retrieval of Complex Multimedia Information Panel Progress in Description and Retrieval of Complex Multimedia Information The Fifth International Conferences on Advances in Multimedia The Fourth International Conference on Models and Ontology-based

More information

Engineering Efficient Similarity Search Methods

Engineering Efficient Similarity Search Methods Doctoral thesis Bilegsaikhan Naidan Doctoral theses at NTNU, 2017:84 NTNU Norwegian University of Science and Technology Thesis for the Degree of Philosophiae Doctor Faculty of Information Technology and

More information

Compressed local descriptors for fast image and video search in large databases

Compressed local descriptors for fast image and video search in large databases Compressed local descriptors for fast image and video search in large databases Matthijs Douze2 joint work with Hervé Jégou1, Cordelia Schmid2 and Patrick Pérez3 1: INRIA Rennes, TEXMEX team, France 2:

More information

Physical Level of Databases: B+-Trees

Physical Level of Databases: B+-Trees Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL

More information

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016 Machine Learning 10-701, Fall 2016 Nonparametric methods for Classification Eric Xing Lecture 2, September 12, 2016 Reading: 1 Classification Representing data: Hypothesis (classifier) 2 Clustering 3 Supervised

More information

SISTEMI PER LA RICERCA

SISTEMI PER LA RICERCA SISTEMI PER LA RICERCA DI DATI MULTIMEDIALI Claudio Lucchese e Salvatore Orlando Terzo Workshop di Dipartimento Dipartimento di Informatica Università degli studi di Venezia Ringrazio le seguenti persone,

More information

Improved nearest neighbor search using auxiliary information and priority functions

Improved nearest neighbor search using auxiliary information and priority functions Improved nearest neighbor search using auxiliary information and priority functions Omid Keivani 1 Kaushik Sinha 1 Abstract Nearest neighbor search using random projection trees has recently been shown

More information

Large Scale Nearest Neighbor Search Theories, Algorithms, and Applications. Junfeng He

Large Scale Nearest Neighbor Search Theories, Algorithms, and Applications. Junfeng He Large Scale Nearest Neighbor Search Theories, Algorithms, and Applications Junfeng He Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School

More information

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017 Nearest neighbors Focus on tree-based methods Clément Jamin, GUDHI project, Inria March 2017 Introduction Exact and approximate nearest neighbor search Essential tool for many applications Huge bibliography

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Priority Queues Ulf Leser Special Scenarios for Searching Up to now, we assumed that all elements of a list are equally important and that any of them could be searched next

More information

Data Mining and Machine Learning: Techniques and Algorithms

Data Mining and Machine Learning: Techniques and Algorithms Instance based classification Data Mining and Machine Learning: Techniques and Algorithms Eneldo Loza Mencía eneldo@ke.tu-darmstadt.de Knowledge Engineering Group, TU Darmstadt International Week 2019,

More information

DL-Media: An Ontology Mediated Multimedia Information Retrieval System

DL-Media: An Ontology Mediated Multimedia Information Retrieval System DL-Media: An Ontology Mediated Multimedia Information Retrieval System ISTI-CNR, Pisa, Italy straccia@isti.cnr.it What is DLMedia? Multimedia Information Retrieval (MIR) Retrieval of those multimedia objects

More information

Query-Driven Document Partitioning and Collection Selection

Query-Driven Document Partitioning and Collection Selection Query-Driven Document Partitioning and Collection Selection Diego Puppin, Fabrizio Silvestri and Domenico Laforenza Institute for Information Science and Technologies ISTI-CNR Pisa, Italy May 31, 2006

More information

Spatial Queries. Nearest Neighbor Queries

Spatial Queries. Nearest Neighbor Queries Spatial Queries Nearest Neighbor Queries Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer efficiently point queries range queries k-nn

More information

Data Structure. A way to store and organize data in order to support efficient insertions, queries, searches, updates, and deletions.

Data Structure. A way to store and organize data in order to support efficient insertions, queries, searches, updates, and deletions. DATA STRUCTURES COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges book. Data

More information

High Dimensional Indexing by Clustering

High Dimensional Indexing by Clustering Yufei Tao ITEE University of Queensland Recall that, our discussion so far has assumed that the dimensionality d is moderately high, such that it can be regarded as a constant. This means that d should

More information

Clustered Pivot Tables for I/O-optimized Similarity Search

Clustered Pivot Tables for I/O-optimized Similarity Search Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško Charles University in Prague, Faculty of Mathematics and Physics, SIRET research group mosko.juro@centrum.sk Jakub Lokoč Charles University

More information

Chapter 11: Indexing and Hashing

Chapter 11: Indexing and Hashing Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree

More information

Approximate Similarity Search in Metric Spaces using Inverted Files

Approximate Similarity Search in Metric Spaces using Inverted Files Approximate Similarity Search in Metric Spaces using Inverted Files Giuseppe Amato ISTI-CNR Via G. Moruzzi, 56, Pisa, Italy giuseppe.amato@isti.cnr.it ABSTRACT We propose a new approach to perform approximate

More information

The B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

The B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland Yufei Tao ITEE University of Queensland Before ascending into d-dimensional space R d with d > 1, this lecture will focus on one-dimensional space, i.e., d = 1. We will review the B-tree, which is a fundamental

More information

Chapter 17 Indexing Structures for Files and Physical Database Design

Chapter 17 Indexing Structures for Files and Physical Database Design Chapter 17 Indexing Structures for Files and Physical Database Design We assume that a file already exists with some primary organization unordered, ordered or hash. The index provides alternate ways to

More information

Indexing and Hashing

Indexing and Hashing C H A P T E R 1 Indexing and Hashing This chapter covers indexing techniques ranging from the most basic one to highly specialized ones. Due to the extensive use of indices in database systems, this chapter

More information

EXTERNAL SORTING. Sorting

EXTERNAL SORTING. Sorting EXTERNAL SORTING 1 Sorting A classic problem in computer science! Data requested in sorted order (sorted output) e.g., find students in increasing grade point average (gpa) order SELECT A, B, C FROM R

More information

Hash-Based Indexing 165

Hash-Based Indexing 165 Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19

More information

Spatial Data Structures

Spatial Data Structures CSCI 420 Computer Graphics Lecture 17 Spatial Data Structures Jernej Barbic University of Southern California Hierarchical Bounding Volumes Regular Grids Octrees BSP Trees [Angel Ch. 8] 1 Ray Tracing Acceleration

More information

Real-time Collaborative Filtering Recommender Systems

Real-time Collaborative Filtering Recommender Systems Real-time Collaborative Filtering Recommender Systems Huizhi Liang 1,2 Haoran Du 2 Qing Wang 2 1 Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia Email:

More information

IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1

IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1 IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1 : Presentation structure : 1. Brief overview of talk 2. What does Object Recognition involve? 3. The Recognition Problem 4. Mathematical background:

More information

Multidimensional Indexes [14]

Multidimensional Indexes [14] CMSC 661, Principles of Database Systems Multidimensional Indexes [14] Dr. Kalpakis http://www.csee.umbc.edu/~kalpakis/courses/661 Motivation Examined indexes when search keys are in 1-D space Many interesting

More information

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

The Boundary Graph Supervised Learning Algorithm for Regression and Classification The Boundary Graph Supervised Learning Algorithm for Regression and Classification! Jonathan Yedidia! Disney Research!! Outline Motivation Illustration using a toy classification problem Some simple refinements

More information

DSH: Data Sensitive Hashing for High-Dimensional k-nn Search

DSH: Data Sensitive Hashing for High-Dimensional k-nn Search DSH: Data Sensitive Hashing for High-Dimensional k-nn Search Jinyang Gao, H. V. Jagadish, Wei Lu, Beng Chin Ooi School of Computing, National University of Singapore, Singapore Department of Electrical

More information

Problem 1: Complexity of Update Rules for Logistic Regression

Problem 1: Complexity of Update Rules for Logistic Regression Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1

More information

Spatial Data Structures

Spatial Data Structures CSCI 480 Computer Graphics Lecture 7 Spatial Data Structures Hierarchical Bounding Volumes Regular Grids BSP Trees [Ch. 0.] March 8, 0 Jernej Barbic University of Southern California http://www-bcf.usc.edu/~jbarbic/cs480-s/

More information

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Introduction to Indexing R-trees. Hong Kong University of Science and Technology Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records

More information

SparkBOOST, an Apache Spark-based boosting library

SparkBOOST, an Apache Spark-based boosting library SparkBOOST, an Apache Spark-based boosting library Tiziano Fagni (tiziano.fagni@isti.cnr.it) Andrea Esuli (andrea.esuli@isti.cnr.it) Istituto di Scienze e Tecnologie dell Informazione (ISTI) Italian National

More information

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Using Natural Clusters Information to Build Fuzzy Indexing Structure Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,

More information

Instances on a Budget

Instances on a Budget Retrieving Similar or Informative Instances on a Budget Kristen Grauman Dept. of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimham Work with Sudheendra Vijayanarasimham,

More information

Unsupervised Learning Partitioning Methods

Unsupervised Learning Partitioning Methods Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form

More information

University of Florida CISE department Gator Engineering. Clustering Part 4

University of Florida CISE department Gator Engineering. Clustering Part 4 Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of

More information