PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search

Size: px

Start display at page:

Download "PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search"

Branden Martin
6 years ago
Views:

1 PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli Istituto di Scienza e Tecnologie dell Informazione A. Faedo Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, Pisa, Italy ISTI:Science seminar, May 12, 2009 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 1 / 48

2 Outline 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 2 / 48

3 Outline Introduction 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 3 / 48

4 Outline Introduction Similarity search 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 4 / 48

5 Introduction Similarity search Similarity search The similarity search model involves: A collection of objects D, belonging to a domain O; a query object q O; a distance function d : O O R +. The goal is to sort the objects in D by their distance with respect to q, returning the objects that are closer to q, which are considered to be the most similar. Typically only the k-top ranked objects are returned (k-nn query), or those within a maximum distance value r (range query). The determination of a meaningful r value is often a non-easy task. k-nn queries are usually preferred, specially in end-user applications, also for the direct control on the result set size. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 5 / 48

6 Similarity search Introduction Similarity search Example (R 2, L 2 ): o 2 o 2 o 1 o 1 o 3 o 3 o 10 o 10 o 0 r o 5 q o 4 o 8 o 9 o 0 o 5 q o 4 o o 11 7 o 8 o 9 o o 11 7 o 6 o 6 o 12 o 12 Figure 1: Range query. Figure 2: k-nn query (k = 5). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 6 / 48

7 Introduction Approximate similarity search Similarity search Exhaustive search: for all o i D compute the distance d(q, o i ), while keeping track of which objects satisfy the query. It does not scale to large collections. Exact methods: equivalent to exhaustive search, but using data structures that leverage on the properties of the observed similarity space (e.g., vectorial spaces, metric spaces) in order to reduce the number of objects of D to be compared with the query. Usually efficient but still not enough for huge collections. Approximate methods: accepting that the results could contain errors (e.g., d(q, o 1 ) < d(q, o 2 ), o 2 is in the results and o 1 is not), gaining efficiency. Approximation is acceptable, e.g., when d is an approximation of a complex, human-perceived concept of similarity. It (obviously) scales! Typically derived from relaxed exact methods. Natively approximated proposals, e.g.: local similarity hashing (LSH) index and permutation-based index (the PP-Index takes inspiration from both). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 7 / 48

8 Introduction Approximate similarity search Similarity search Approximation quality: What have we missed? What have we included? How much have we saved? o 2 o 1 o 3 o 10 o 0 o 5 q o 4 o 8 o 9 o o 11 7 o 6 o 12 Figure 3: Approximate result for a k-nn query (k = 5). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 8 / 48

9 Outline Introduction Permutation based methods 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 9 / 48

10 Introduction Permutation based methods Permutation based methods Independently proposed by Amato and Savino 1 and Chavez et al. 2, using different data structures. The idea: an object is represented by its view of the surrounding world. Intuively, if two objects see the elements of a set of reference objects R in the same order of (increasing) distance, they are likely to be close one to the other. Example Where am I likely to live if I see the main European cities in the following order? Rome, Milan, Bern, Marseilles, Munich, Luxembourg, Bonn, Vienna, Belgrade, Brussels, Barcelona, Paris, Berlin, Amsterdam, London, Copenhagen, Madrid, Istanbul, Dublin, Athens, Oslo, Stockholm, Lisbon, Helsinki. 1 G. Amato and P. Savino, Approximate similarity search in metric spaces using inverted files, INFOSCALE 2008, pages E. Chavez, K. Figueroa, and G. Navarro, Effective proximity retrieval by ordering permutations, IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9): , Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 10 / 48

11 Introduction Permutation based methods Permutation based methods The method: A set of reference objects R = {r 0,..., r R 1 } O is defined (e.g., by randomly selecting R objects from D). Every object o i D is then represented by a permutation Π oi of 0,..., R 1, i.e., the list of the identifiers of reference objects, so that the identifiers are sorted by the distance of their relative reference objects with respect to o i. The search process mainly consists in computing Π q and estimating the true distance d(q, o i ) using a permutation-based distance d (Π q, Π oi ), e.g., the Spearman s footrule distance. Amato and Savino have shown that using only the prefix Π l o i of the permutation Π oi (e.g., l = 100 when R = 500) improves both efficiency and effectiveness. The PP-Index adopts a permutation-based data representation model, using very short prefixes (e.g., l = 6 when R = 1000). Differently from previous approaches, the permutation prefixes are used just to quickly find a small set of candidate objects from D for inclusion into results, not to estimate their relative order. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 11 / 48

12 Introduction Permutation based methods Permutation based methods r 0 r 0 <0,1,3> <1,3,0> <1,3,4> <0,2,3,1,5,4> <1,3,2,0,4,5> r 1 <1,4,3,2,0,5> <0,2,3> <1,3,2> r 1 <1,4,3> <5,2,3,0,1,4> r2 r 3 <4,1,3,2,5,0> r 4 <2,3,0> <3,2,1> <2,0,3> <2,5,3> <2,3,5> r 2 r 3 <3,2,5> <4,1,3> r 4 r 5 <4,3,1,2,5,0> r 5 <5,2,3> <4,3,1> <4,3,5> <5,2,3,4,1,0> Figure 4: Regions of the 2-dimensional space identified by 6 randomly selected reference points, using the Euclidean distance, and full-lenght permutations (left) or permutation prefixes of lenght 3. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 12 / 48

13 Outline Introduction Local similarity hashing methods 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 13 / 48

14 Introduction Local similarity hashing methods Local similarity hashing methods A family H of hash functions f : O U is called (r, ɛ, p 1, p 2 )-sensitive, with r, ɛ > 0, p 1 > p 2 > 0, if for any p, q O: if d(p, q) r then P[h(p) = h(q)] p 1 if d(p, q) > r(1 + ɛ) then P[h(p) = h(q)] p 2 for any function h randomly selected from H. Intuitively: two objects have a (high) probability x 1 p 1 to collide if they are closer than r, and a (low) probability x 2 p 2 if they are more distant than r(1 + ɛ). LSH-Index 3 : j randomly chosen functions h i H define a hash function g(x) = (h 1 (x)h 2 (x)... h j (x)), i.e. bad collision probability is significantly lowered to p j 2. t different hash tables are built, based on randomly generated g 1... g t functions, in order to increase good collision probability. 3 P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, STOC 1998, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 14 / 48

15 Introduction Local similarity hashing methods Local similarity hashing methods It is hard to tune LSH-Index (length of hash keys) in order to obtain good efficacy, due to the dependence between data distribution and hash length. LSH-Forest 4 : Use of variable length hash keys. Long hash key are indexed in a prefix tree (LSH-Tree). At search time the key length is varied in order to retrieve a given number of candidate objects. Candidate objects are retrieved sequentially from a data storage on disk. Multiple LSH-Tree, i.e., a forest, are used to improve effectiveness. The PP-Index uses similar data structure. 4 M. Bawa, T. Condie, and P. Ganesan, LSH-Forest: self-tuning indexes for similarity search, WWW 2005, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 15 / 48

16 Outline The PP-Index 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 16 / 48

17 Outline The PP-Index Data structures 1 Introduction 2 The PP-Index Data structures Algorithms 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 17 / 48

18 The PP-Index Data structures PP-Index: data structures The PP-Index represents each indexed object with a permutation prefix of length l. Data structures: a prefix tree kept in main memory, indexing the permutation prefixes, and a data storage kept on disk, storing the information required to compute real distances between objects in D and any object in O. The prefix tree is used in order to rapidly identify a set of at least z candidates (z k), leaving to the original distance function the task of determining the final k-nn result from such set of candidates. Candidates are retrieved from the data storage with a few sequential disk accesses. The PP-Index adopts a bulk data processing model, similar to the one used for text-based inverted list indexes (assumption on the static nature of data). It is easy to provide update capabilities (i.e., insert, delete, modify). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 18 / 48

19 Outline The PP-Index Algorithms 1 Introduction 2 The PP-Index Data structures Algorithms 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 19 / 48

20 The PP-Index PP-Index: building the index Algorithms BuildIndex(D, d, R, l) 1 prefixt ree EmptyPrefixTree() 2 datastorage EmptyDataStorage() 3 for i 0 to D 1 4 do o i GetObject(D, i) 5 datablock oi GetDataBlock(o i) 6 p oi Append(dataBlock oi, datastorage) 7 w oi ComputePrefix(o i, R, d, l) 8 h oi i 9 Insert(w oi, h oi, p oi, prefixt ree) 10 L ListPointersByOrderedVisit(prefixT ree) 11 P CreateInvertedList(L) 12 ReorderStorage(dataStorage, P ) 13 CorrectLeafValues(prefixT ree, datastorage) 14 index NewIndex(d, R, l, prefixt ree, datastorage) 15 return index Figure 5: The BuildIndex function. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 20 / 48

21 The PP-Index Algorithms PP-Index: building the index Input: dataset D, distance function d, reference objects R, prefix length l. Indexing process: Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Data storage reordering: data blocks are sorted to reflect the order of prefixes. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 21 / 48

22 The PP-Index PP-Index: building the index Algorithms Index characteristics D =10, R =6, l=3 Permutation prefixes w o 0 =<1, 3, 2> w o 1 =<2, 3, 0> w o 2 =<5, 2, 3> w o 3 =<4, 1, 3> w o 4 =<1, 3, 2> w o 5 =<4, 1, 3> w o 6 =<1, 3, 4> w o 7 =<5, 2, 3> w =<1, 3, 2> w =<4, 3, 5> o 8 o 9 Prefix tree root h p 0 h 0 p 4 h 4 p 8 h 8 p 6 h 6 p 1 h h h p p p 9 h h p p main memory secondary memory Figure 6: Sample data. o 0... o o... 5 o... o... 4 o... o 6... o 7... o 8... o 9... Data storage Figure 7: Index data structure after the first phase of object insertion. Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 22 / 48

23 The PP-Index PP-Index: building the index Algorithms Prefix tree root Prefix tree root h p 0 h 0 p 4 h 4 p 8 h 8 p 6 h 6 p 1 h h h p p p 9 h h p p h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end main memory secondary memory main memory secondary memory o 0... o o... 5 o... o... 4 o... o 6... o 7... o 8... o 9... Data storage o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage Figure 8: Index data structure after the first phase of object insertion. Figure 9: Index data structure after the first phase of object insertion. Data storage reordering: data blocks are sorted to reflect the order of prefixes. The leaves of the final prefix tree point to intervals of the data storage. Efficiency alert: performed using a m-way merge sort algorithm. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 23 / 48

24 PP-Index: search function The PP-Index Algorithms FindCandidates(q, prefixt ree, R, d, l, z) 1 w q ComputePrefix(q, R, d, l) 2 for i l to 1 3 do wq i SubPrefix(wq, i) 4 node SearchPath(wq i, prefixt ree) 5 if node nil 6 then minleaf GetMin(node, prefixt ree) 7 maxleaf GetMax(node, prefixt ree) 8 if (maxleaf.h end minleaf.h start + 1) z i = 1 9 then return (minleaf.p start, maxleaf.h end ) 10 return (0, 0) Figure 10: The FindCandidates function. Given the prefix representing the query, FindCandidates searches for the smallest subtree of the prefix tree pointing to at least z data blocks (z ). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 24 / 48

25 PP-Index: search function The PP-Index Algorithms Search(q, k, z, index) 1 (p start, p end ) FindCandidates(q, index.prefixt ree, index.r, index.d, index.l, z) 2 resultsheap EmptyHeap() 3 cursor p start 4 while cursor p end 5 do datablock Read(cursor, index.datastorage) 6 AdvanceCursor(cursor) 7 distance index.d(q, datablock.data) 8 if resultsheap.size < k 9 then Insert(resultsHeap, distance, datablock.id) 10 else if distance < resultsheap.top.distance 11 then ReplaceTop(resultsHeap, distance, datablock.id) 12 Sort(resultsHeap) 13 return resultsheap Figure 11: The Search function. The z candidate data blocks are sequentially read from the data storage. A heap is used to keep track of the best k results. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 25 / 48

26 The PP-Index Algorithms PP-Index: improving the search effectiveness The basic search strategy is designed for efficiency. Effectiveness can be boosted using various search strategies: Multiple index: building n PP-Index using different R sets, R 1... R n (LSH-Forest style). Projecting different R-induced grids on the objects, helps to approximate a better (less skewed) partitioning of the space. Can be implemented using data replication (faster/more storage) or using data referencing (slower/less storage). k-nn results from the various indexes are merged together in the final one. Multiple query: generating m perturbed versions of w q in order to explore the neighborhood of w q. The perturbed w i q prefixes are generated by swapping pairs of elements of w q, first selecting those with the smaller distance difference with respect to q. All the w i q prefixes are used to find candidates on the same index. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 26 / 48

27 The PP-Index Algorithms PP-Index: prefix tree optimizations Prefix tree root Prefix tree root 1, h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end main memory secondary memory main memory secondary memory o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage Figure 12: Pruning of only-child paths to leaves. Figure 13: Only-child paths compression. Scalability alert: reducing to a single leaf any subtree pointing to less than z data blocks. Applicable when z is hardcoded into the search function. Does not affect search results quality. Lossy with respect to index update operations. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 27 / 48

28 The PP-Index Algorithms PP-Index: merging (and updating) the index Scalability alert: the index (prefix tree) reaches its maximum memory requirement at the end of the main loop of the indexing process. Could not fit into memory, when the final index will (after optimizations). Strategy: building many smaller indexes, using the same R set, then merging them together. The merge process is efficient: The source prefix trees are merged into the final prefix tree by performing a parallel ordered visit on them. Data storages are merged into the final data storage while building the final prefix tree. Can be done from-disk-to-disk, minimum memory occupation. Linear cost with respect to index size (if not done in an m-way style). Uses only sequential reads/writes. Update operations can be supported by keeping track of such operations by using a small all-in-memory index and performing periodic merge operations. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 28 / 48

29 Outline Experiments 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 29 / 48

30 Outline Experiments The CoPhIR collection 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 30 / 48

31 Experiments The CoPhIR collection The CoPhIR collection The CoPhIR 5 consists of a crawl of 106 millions images from the Flickr photo sharing website. Textual data + five MPEG-7 visual descriptors (240 GB of XML description data). Visual similarity measure: linear combination of distance functions defined on the MPEG-7 descriptors. MPEG-7 Visual Descriptor Distance type Dimension Weight Scalable Color L Color Structure L Color Layout sum of L Edge Histogram L Homogeneous Texture L Table 1: Details on the five MPEG-7 visual descriptors used in CoPhIR, and the weights used in the linear combination. The Dim. column refer to the specific dimension for visual descriptors adopted by the CoPhIR data set. Experiments made on 1, 10, and 100 millions images, using 100 randomly selected images (excluded from indexes) Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 31 / 48

32 Outline Experiments Evaluation measures 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 32 / 48

33 Evaluation measures Experiments Evaluation measures Effectiveness measures: Recall (ranking-based 6 ): Recall(k) = Dk q P k q k Relative Distance Error (distance-based): RDE(k) = 1 k d(q, Pq k (i)) k d(q, Dq k (i)) 1 (2) where Dq k is the list of the k closest elements of D to q, sorted by their distance with respect to q, and Pq k is the list returned by the algorithm. Efficiency measures: index time. index size (RAM, disk). i=1 number of candidates retrieved from disk (z ). average search time. 6 M. Patella and P. Ciaccia, The many facets of approximate similarity search, SISAP 2008, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 33 / 48 (1)

34 Outline Experiments Results 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 34 / 48

35 Experiments Results Results Indexing time (s) M 10M 1M Figure 14: R Indexing time w.r.t. to the size of R and the data set size. D indexing prefix tree size data l time (sec) full comp. storage 1M MB 91 kb 349 MB M MB 848 kb 3.4 GB M MB 6.5 MB 34 GB 3.5 Table 2: Indexing times (with R = 100), resulting index sizes, and average prefix tree depth l (after prefix tree compression with z = 1, 000), for the various data set sizes. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 35 / 48

36 Results Experiments Results Search time (s) M 10M 1M R Figure 15: Search time w.r.t. to the size of R and the data set size. Search performed with z = 1, 000 and k = 100 (single index, single query). R D 1M 10M 100M 100 4,075 5,817 7, ,320 5,571 7, ,803 5,065 6,853 1,000 1,091 4,748 6,644 Table 3: Average z value (z = 1, 000), i.e., average number of retrieved candidate objects for a query, with respect to the size of the reference objects set and data set size. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 36 / 48

37 Results 0.20 Experiments Results M 10M 1M 0.15 k=100 k=10 k=1 RDE(k) 0.10 RDE(k) R R Recall(k) M 10M 1M Recall(k) k=100 k=10 k= R R Figure 16: Effectiveness with respect of the size of R set, on various index sizes, using k = 100, and z = 1, 000 (single index, single query). Figure 17: Effectiveness with respect of the size of R set, on the 100M index, using z = 1, 000 (single index, single query). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 37 / 48

38 Results 0.10 Experiments Results 0.10 RDE(k) k=100 k=10 k=1 RDE(k) k=100 k=10 k= indexes queries Recall(k) k=100 k=10 k=1 Recall(k) k=100 k=10 k= indexes queries Figure 18: Effectiveness of the multiple index search strategy on the 100M index, using R = 1, 000 and z = 1, 000. Figure 19: Effectiveness of the multiple query search strategy on the 100M index, using R = 1, 000 and z = 1, 000. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 38 / 48

39 Results 0.01 Experiments Results 1e-3 RDE(k) 8e-3 6e-3 4e-3 100M 10M 1M RDE(k) 8e-4 6e-4 4e-4 100M 10M 1M 2e-3 2e-4 Recall(k) k 100M 10M 1M k Recall(k) k 100M 10M 1M k Figure 20: Effectiveness of the combined multiple Figure 21: Effectiveness of the combined multiple query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using R = 100, and z = 1, 000. query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using R = 1, 000, and z = 1, 000. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 39 / 48

40 Outline Demo 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 40 / 48

41 Demo Demo Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 41 / 48

42 Outline Conclusions 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 42 / 48

43 Outline Conclusions Summary 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Summary Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 43 / 48

44 Conclusions Summary Summary The PP-Index: is a simple but effective data structure for approximate similarity search. scales well, both at indexing time and at search time. can be kept updated with minor additional effort. has good parallelization properties. relates well with other data structures (i.e., inverted lists). There is a lot still to investigate: policies for reference points selection. studing the relations between l, R, z, k, l, and z. giving a theoretical foundation to the permutation based methods. applicability to other domains and similarity space types. policies for data partitioning Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 44 / 48

45 Outline Conclusions Questions 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Summary Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 45 / 48

46 Questions? Conclusions Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 46 / 48

47 FAQ Conclusions Questions Q: How does the PP-Index differ from the orthodox metric approach X? A: Please help yourself: No explicit requirement of metric properties. Use of a predetermined (i.e., fixed) set of reference points. Any reference point has a global influence. Different data access model. Different data update model.... Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 47 / 48

48 FAQ Conclusions Questions Q: What are the key differences between the permutation-based methods and the LSH-based methods? A: The permutation-based methods are mostly based on geometrical considerations, while LSH-based methods are mostly based on probabilistic considerations. The permutation-based methods are able to take into account how data is distributed in the similarity space (by means of R), while the LSH hash functions are derived only from the distance function. Each element of the hash key generated by an LSH hash function is independent from the others, while the order relation between the elements of a permutation is the crucial information for a permutation-based method. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 48 / 48

Pivot Selection Strategies for Permutation-Based Similarity Search

Pivot Selection Strategies for Permutation-Based Similarity Search Giuseppe Amato, Andrea Esuli, and Fabrizio Falchi Istituto di Scienza e Tecnologie dell Informazione A. Faedo, via G. Moruzzi 1, Pisa