PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search
|
|
- Branden Martin
- 6 years ago
- Views:
Transcription
1 PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search Andrea Esuli Istituto di Scienza e Tecnologie dell Informazione A. Faedo Consiglio Nazionale delle Ricerche Via Giuseppe Moruzzi, Pisa, Italy ISTI:Science seminar, May 12, 2009 Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 1 / 48
2 Outline 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 2 / 48
3 Outline Introduction 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 3 / 48
4 Outline Introduction Similarity search 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 4 / 48
5 Introduction Similarity search Similarity search The similarity search model involves: A collection of objects D, belonging to a domain O; a query object q O; a distance function d : O O R +. The goal is to sort the objects in D by their distance with respect to q, returning the objects that are closer to q, which are considered to be the most similar. Typically only the k-top ranked objects are returned (k-nn query), or those within a maximum distance value r (range query). The determination of a meaningful r value is often a non-easy task. k-nn queries are usually preferred, specially in end-user applications, also for the direct control on the result set size. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 5 / 48
6 Similarity search Introduction Similarity search Example (R 2, L 2 ): o 2 o 2 o 1 o 1 o 3 o 3 o 10 o 10 o 0 r o 5 q o 4 o 8 o 9 o 0 o 5 q o 4 o o 11 7 o 8 o 9 o o 11 7 o 6 o 6 o 12 o 12 Figure 1: Range query. Figure 2: k-nn query (k = 5). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 6 / 48
7 Introduction Approximate similarity search Similarity search Exhaustive search: for all o i D compute the distance d(q, o i ), while keeping track of which objects satisfy the query. It does not scale to large collections. Exact methods: equivalent to exhaustive search, but using data structures that leverage on the properties of the observed similarity space (e.g., vectorial spaces, metric spaces) in order to reduce the number of objects of D to be compared with the query. Usually efficient but still not enough for huge collections. Approximate methods: accepting that the results could contain errors (e.g., d(q, o 1 ) < d(q, o 2 ), o 2 is in the results and o 1 is not), gaining efficiency. Approximation is acceptable, e.g., when d is an approximation of a complex, human-perceived concept of similarity. It (obviously) scales! Typically derived from relaxed exact methods. Natively approximated proposals, e.g.: local similarity hashing (LSH) index and permutation-based index (the PP-Index takes inspiration from both). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 7 / 48
8 Introduction Approximate similarity search Similarity search Approximation quality: What have we missed? What have we included? How much have we saved? o 2 o 1 o 3 o 10 o 0 o 5 q o 4 o 8 o 9 o o 11 7 o 6 o 12 Figure 3: Approximate result for a k-nn query (k = 5). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 8 / 48
9 Outline Introduction Permutation based methods 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 9 / 48
10 Introduction Permutation based methods Permutation based methods Independently proposed by Amato and Savino 1 and Chavez et al. 2, using different data structures. The idea: an object is represented by its view of the surrounding world. Intuively, if two objects see the elements of a set of reference objects R in the same order of (increasing) distance, they are likely to be close one to the other. Example Where am I likely to live if I see the main European cities in the following order? Rome, Milan, Bern, Marseilles, Munich, Luxembourg, Bonn, Vienna, Belgrade, Brussels, Barcelona, Paris, Berlin, Amsterdam, London, Copenhagen, Madrid, Istanbul, Dublin, Athens, Oslo, Stockholm, Lisbon, Helsinki. 1 G. Amato and P. Savino, Approximate similarity search in metric spaces using inverted files, INFOSCALE 2008, pages E. Chavez, K. Figueroa, and G. Navarro, Effective proximity retrieval by ordering permutations, IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(9): , Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 10 / 48
11 Introduction Permutation based methods Permutation based methods The method: A set of reference objects R = {r 0,..., r R 1 } O is defined (e.g., by randomly selecting R objects from D). Every object o i D is then represented by a permutation Π oi of 0,..., R 1, i.e., the list of the identifiers of reference objects, so that the identifiers are sorted by the distance of their relative reference objects with respect to o i. The search process mainly consists in computing Π q and estimating the true distance d(q, o i ) using a permutation-based distance d (Π q, Π oi ), e.g., the Spearman s footrule distance. Amato and Savino have shown that using only the prefix Π l o i of the permutation Π oi (e.g., l = 100 when R = 500) improves both efficiency and effectiveness. The PP-Index adopts a permutation-based data representation model, using very short prefixes (e.g., l = 6 when R = 1000). Differently from previous approaches, the permutation prefixes are used just to quickly find a small set of candidate objects from D for inclusion into results, not to estimate their relative order. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 11 / 48
12 Introduction Permutation based methods Permutation based methods r 0 r 0 <0,1,3> <1,3,0> <1,3,4> <0,2,3,1,5,4> <1,3,2,0,4,5> r 1 <1,4,3,2,0,5> <0,2,3> <1,3,2> r 1 <1,4,3> <5,2,3,0,1,4> r2 r 3 <4,1,3,2,5,0> r 4 <2,3,0> <3,2,1> <2,0,3> <2,5,3> <2,3,5> r 2 r 3 <3,2,5> <4,1,3> r 4 r 5 <4,3,1,2,5,0> r 5 <5,2,3> <4,3,1> <4,3,5> <5,2,3,4,1,0> Figure 4: Regions of the 2-dimensional space identified by 6 randomly selected reference points, using the Euclidean distance, and full-lenght permutations (left) or permutation prefixes of lenght 3. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 12 / 48
13 Outline Introduction Local similarity hashing methods 1 Introduction Similarity search Permutation based methods Local similarity hashing methods 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 13 / 48
14 Introduction Local similarity hashing methods Local similarity hashing methods A family H of hash functions f : O U is called (r, ɛ, p 1, p 2 )-sensitive, with r, ɛ > 0, p 1 > p 2 > 0, if for any p, q O: if d(p, q) r then P[h(p) = h(q)] p 1 if d(p, q) > r(1 + ɛ) then P[h(p) = h(q)] p 2 for any function h randomly selected from H. Intuitively: two objects have a (high) probability x 1 p 1 to collide if they are closer than r, and a (low) probability x 2 p 2 if they are more distant than r(1 + ɛ). LSH-Index 3 : j randomly chosen functions h i H define a hash function g(x) = (h 1 (x)h 2 (x)... h j (x)), i.e. bad collision probability is significantly lowered to p j 2. t different hash tables are built, based on randomly generated g 1... g t functions, in order to increase good collision probability. 3 P. Indyk and R. Motwani, Approximate nearest neighbors: towards removing the curse of dimensionality, STOC 1998, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 14 / 48
15 Introduction Local similarity hashing methods Local similarity hashing methods It is hard to tune LSH-Index (length of hash keys) in order to obtain good efficacy, due to the dependence between data distribution and hash length. LSH-Forest 4 : Use of variable length hash keys. Long hash key are indexed in a prefix tree (LSH-Tree). At search time the key length is varied in order to retrieve a given number of candidate objects. Candidate objects are retrieved sequentially from a data storage on disk. Multiple LSH-Tree, i.e., a forest, are used to improve effectiveness. The PP-Index uses similar data structure. 4 M. Bawa, T. Condie, and P. Ganesan, LSH-Forest: self-tuning indexes for similarity search, WWW 2005, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 15 / 48
16 Outline The PP-Index 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 16 / 48
17 Outline The PP-Index Data structures 1 Introduction 2 The PP-Index Data structures Algorithms 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 17 / 48
18 The PP-Index Data structures PP-Index: data structures The PP-Index represents each indexed object with a permutation prefix of length l. Data structures: a prefix tree kept in main memory, indexing the permutation prefixes, and a data storage kept on disk, storing the information required to compute real distances between objects in D and any object in O. The prefix tree is used in order to rapidly identify a set of at least z candidates (z k), leaving to the original distance function the task of determining the final k-nn result from such set of candidates. Candidates are retrieved from the data storage with a few sequential disk accesses. The PP-Index adopts a bulk data processing model, similar to the one used for text-based inverted list indexes (assumption on the static nature of data). It is easy to provide update capabilities (i.e., insert, delete, modify). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 18 / 48
19 Outline The PP-Index Algorithms 1 Introduction 2 The PP-Index Data structures Algorithms 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 19 / 48
20 The PP-Index PP-Index: building the index Algorithms BuildIndex(D, d, R, l) 1 prefixt ree EmptyPrefixTree() 2 datastorage EmptyDataStorage() 3 for i 0 to D 1 4 do o i GetObject(D, i) 5 datablock oi GetDataBlock(o i) 6 p oi Append(dataBlock oi, datastorage) 7 w oi ComputePrefix(o i, R, d, l) 8 h oi i 9 Insert(w oi, h oi, p oi, prefixt ree) 10 L ListPointersByOrderedVisit(prefixT ree) 11 P CreateInvertedList(L) 12 ReorderStorage(dataStorage, P ) 13 CorrectLeafValues(prefixT ree, datastorage) 14 index NewIndex(d, R, l, prefixt ree, datastorage) 15 return index Figure 5: The BuildIndex function. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 20 / 48
21 The PP-Index Algorithms PP-Index: building the index Input: dataset D, distance function d, reference objects R, prefix length l. Indexing process: Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Data storage reordering: data blocks are sorted to reflect the order of prefixes. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 21 / 48
22 The PP-Index PP-Index: building the index Algorithms Index characteristics D =10, R =6, l=3 Permutation prefixes w o 0 =<1, 3, 2> w o 1 =<2, 3, 0> w o 2 =<5, 2, 3> w o 3 =<4, 1, 3> w o 4 =<1, 3, 2> w o 5 =<4, 1, 3> w o 6 =<1, 3, 4> w o 7 =<5, 2, 3> w =<1, 3, 2> w =<4, 3, 5> o 8 o 9 Prefix tree root h p 0 h 0 p 4 h 4 p 8 h 8 p 6 h 6 p 1 h h h p p p 9 h h p p main memory secondary memory Figure 6: Sample data. o 0... o o... 5 o... o... 4 o... o 6... o 7... o 8... o 9... Data storage Figure 7: Index data structure after the first phase of object insertion. Main loop: permutation prefixes are inserted into the prefix tree, data blocks are appended to data storage. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 22 / 48
23 The PP-Index PP-Index: building the index Algorithms Prefix tree root Prefix tree root h p 0 h 0 p 4 h 4 p 8 h 8 p 6 h 6 p 1 h h h p p p 9 h h p p h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end main memory secondary memory main memory secondary memory o 0... o o... 5 o... o... 4 o... o 6... o 7... o 8... o 9... Data storage o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage Figure 8: Index data structure after the first phase of object insertion. Figure 9: Index data structure after the first phase of object insertion. Data storage reordering: data blocks are sorted to reflect the order of prefixes. The leaves of the final prefix tree point to intervals of the data storage. Efficiency alert: performed using a m-way merge sort algorithm. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 23 / 48
24 PP-Index: search function The PP-Index Algorithms FindCandidates(q, prefixt ree, R, d, l, z) 1 w q ComputePrefix(q, R, d, l) 2 for i l to 1 3 do wq i SubPrefix(wq, i) 4 node SearchPath(wq i, prefixt ree) 5 if node nil 6 then minleaf GetMin(node, prefixt ree) 7 maxleaf GetMax(node, prefixt ree) 8 if (maxleaf.h end minleaf.h start + 1) z i = 1 9 then return (minleaf.p start, maxleaf.h end ) 10 return (0, 0) Figure 10: The FindCandidates function. Given the prefix representing the query, FindCandidates searches for the smallest subtree of the prefix tree pointing to at least z data blocks (z ). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 24 / 48
25 PP-Index: search function The PP-Index Algorithms Search(q, k, z, index) 1 (p start, p end ) FindCandidates(q, index.prefixt ree, index.r, index.d, index.l, z) 2 resultsheap EmptyHeap() 3 cursor p start 4 while cursor p end 5 do datablock Read(cursor, index.datastorage) 6 AdvanceCursor(cursor) 7 distance index.d(q, datablock.data) 8 if resultsheap.size < k 9 then Insert(resultsHeap, distance, datablock.id) 10 else if distance < resultsheap.top.distance 11 then ReplaceTop(resultsHeap, distance, datablock.id) 12 Sort(resultsHeap) 13 return resultsheap Figure 11: The Search function. The z candidate data blocks are sequentially read from the data storage. A heap is used to keep track of the best k results. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 25 / 48
26 The PP-Index Algorithms PP-Index: improving the search effectiveness The basic search strategy is designed for efficiency. Effectiveness can be boosted using various search strategies: Multiple index: building n PP-Index using different R sets, R 1... R n (LSH-Forest style). Projecting different R-induced grids on the objects, helps to approximate a better (less skewed) partitioning of the space. Can be implemented using data replication (faster/more storage) or using data referencing (slower/less storage). k-nn results from the various indexes are merged together in the final one. Multiple query: generating m perturbed versions of w q in order to explore the neighborhood of w q. The perturbed w i q prefixes are generated by swapping pairs of elements of w q, first selecting those with the smaller distance difference with respect to q. All the w i q prefixes are used to find candidates on the same index. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 26 / 48
27 The PP-Index Algorithms PP-Index: prefix tree optimizations Prefix tree root Prefix tree root 1, h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end h h start p start end p end p h p h h h start p start end p end p h h h start p start end p end main memory secondary memory main memory secondary memory o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage o 0... o o... 3 o... o... 1 o... o 5... o 9... o 2... o 7... Data storage Figure 12: Pruning of only-child paths to leaves. Figure 13: Only-child paths compression. Scalability alert: reducing to a single leaf any subtree pointing to less than z data blocks. Applicable when z is hardcoded into the search function. Does not affect search results quality. Lossy with respect to index update operations. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 27 / 48
28 The PP-Index Algorithms PP-Index: merging (and updating) the index Scalability alert: the index (prefix tree) reaches its maximum memory requirement at the end of the main loop of the indexing process. Could not fit into memory, when the final index will (after optimizations). Strategy: building many smaller indexes, using the same R set, then merging them together. The merge process is efficient: The source prefix trees are merged into the final prefix tree by performing a parallel ordered visit on them. Data storages are merged into the final data storage while building the final prefix tree. Can be done from-disk-to-disk, minimum memory occupation. Linear cost with respect to index size (if not done in an m-way style). Uses only sequential reads/writes. Update operations can be supported by keeping track of such operations by using a small all-in-memory index and performing periodic merge operations. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 28 / 48
29 Outline Experiments 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 29 / 48
30 Outline Experiments The CoPhIR collection 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 30 / 48
31 Experiments The CoPhIR collection The CoPhIR collection The CoPhIR 5 consists of a crawl of 106 millions images from the Flickr photo sharing website. Textual data + five MPEG-7 visual descriptors (240 GB of XML description data). Visual similarity measure: linear combination of distance functions defined on the MPEG-7 descriptors. MPEG-7 Visual Descriptor Distance type Dimension Weight Scalable Color L Color Structure L Color Layout sum of L Edge Histogram L Homogeneous Texture L Table 1: Details on the five MPEG-7 visual descriptors used in CoPhIR, and the weights used in the linear combination. The Dim. column refer to the specific dimension for visual descriptors adopted by the CoPhIR data set. Experiments made on 1, 10, and 100 millions images, using 100 randomly selected images (excluded from indexes) Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 31 / 48
32 Outline Experiments Evaluation measures 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 32 / 48
33 Evaluation measures Experiments Evaluation measures Effectiveness measures: Recall (ranking-based 6 ): Recall(k) = Dk q P k q k Relative Distance Error (distance-based): RDE(k) = 1 k d(q, Pq k (i)) k d(q, Dq k (i)) 1 (2) where Dq k is the list of the k closest elements of D to q, sorted by their distance with respect to q, and Pq k is the list returned by the algorithm. Efficiency measures: index time. index size (RAM, disk). i=1 number of candidates retrieved from disk (z ). average search time. 6 M. Patella and P. Ciaccia, The many facets of approximate similarity search, SISAP 2008, pages Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 33 / 48 (1)
34 Outline Experiments Results 1 Introduction 2 The PP-Index 3 Experiments The CoPhIR collection Evaluation measures Results 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 34 / 48
35 Experiments Results Results Indexing time (s) M 10M 1M Figure 14: R Indexing time w.r.t. to the size of R and the data set size. D indexing prefix tree size data l time (sec) full comp. storage 1M MB 91 kb 349 MB M MB 848 kb 3.4 GB M MB 6.5 MB 34 GB 3.5 Table 2: Indexing times (with R = 100), resulting index sizes, and average prefix tree depth l (after prefix tree compression with z = 1, 000), for the various data set sizes. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 35 / 48
36 Results Experiments Results Search time (s) M 10M 1M R Figure 15: Search time w.r.t. to the size of R and the data set size. Search performed with z = 1, 000 and k = 100 (single index, single query). R D 1M 10M 100M 100 4,075 5,817 7, ,320 5,571 7, ,803 5,065 6,853 1,000 1,091 4,748 6,644 Table 3: Average z value (z = 1, 000), i.e., average number of retrieved candidate objects for a query, with respect to the size of the reference objects set and data set size. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 36 / 48
37 Results 0.20 Experiments Results M 10M 1M 0.15 k=100 k=10 k=1 RDE(k) 0.10 RDE(k) R R Recall(k) M 10M 1M Recall(k) k=100 k=10 k= R R Figure 16: Effectiveness with respect of the size of R set, on various index sizes, using k = 100, and z = 1, 000 (single index, single query). Figure 17: Effectiveness with respect of the size of R set, on the 100M index, using z = 1, 000 (single index, single query). Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 37 / 48
38 Results 0.10 Experiments Results 0.10 RDE(k) k=100 k=10 k=1 RDE(k) k=100 k=10 k= indexes queries Recall(k) k=100 k=10 k=1 Recall(k) k=100 k=10 k= indexes queries Figure 18: Effectiveness of the multiple index search strategy on the 100M index, using R = 1, 000 and z = 1, 000. Figure 19: Effectiveness of the multiple query search strategy on the 100M index, using R = 1, 000 and z = 1, 000. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 38 / 48
39 Results 0.01 Experiments Results 1e-3 RDE(k) 8e-3 6e-3 4e-3 100M 10M 1M RDE(k) 8e-4 6e-4 4e-4 100M 10M 1M 2e-3 2e-4 Recall(k) k 100M 10M 1M k Recall(k) k 100M 10M 1M k Figure 20: Effectiveness of the combined multiple Figure 21: Effectiveness of the combined multiple query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using R = 100, and z = 1, 000. query and multiple index search strategies, using eight queries and eight indexes, on various data set sizes, using R = 1, 000, and z = 1, 000. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 39 / 48
40 Outline Demo 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 40 / 48
41 Demo Demo Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 41 / 48
42 Outline Conclusions 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 42 / 48
43 Outline Conclusions Summary 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Summary Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 43 / 48
44 Conclusions Summary Summary The PP-Index: is a simple but effective data structure for approximate similarity search. scales well, both at indexing time and at search time. can be kept updated with minor additional effort. has good parallelization properties. relates well with other data structures (i.e., inverted lists). There is a lot still to investigate: policies for reference points selection. studing the relations between l, R, z, k, l, and z. giving a theoretical foundation to the permutation based methods. applicability to other domains and similarity space types. policies for data partitioning Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 44 / 48
45 Outline Conclusions Questions 1 Introduction 2 The PP-Index 3 Experiments 4 Demo 5 Conclusions Summary Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 45 / 48
46 Questions? Conclusions Questions Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 46 / 48
47 FAQ Conclusions Questions Q: How does the PP-Index differ from the orthodox metric approach X? A: Please help yourself: No explicit requirement of metric properties. Use of a predetermined (i.e., fixed) set of reference points. Any reference point has a global influence. Different data access model. Different data update model.... Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 47 / 48
48 FAQ Conclusions Questions Q: What are the key differences between the permutation-based methods and the LSH-based methods? A: The permutation-based methods are mostly based on geometrical considerations, while LSH-based methods are mostly based on probabilistic considerations. The permutation-based methods are able to take into account how data is distributed in the similarity space (by means of R), while the LSH hash functions are derived only from the distance function. Each element of the hash key generated by an LSH hash function is independent from the others, while the order relation between the elements of a permutation is the crucial information for a permutation-based method. Andrea Esuli (ISTI-CNR) PP-Index ISTI:Science 48 / 48
Pivot Selection Strategies for Permutation-Based Similarity Search
Pivot Selection Strategies for Permutation-Based Similarity Search Giuseppe Amato, Andrea Esuli, and Fabrizio Falchi Istituto di Scienza e Tecnologie dell Informazione A. Faedo, via G. Moruzzi 1, Pisa
More informationAn approach to Content-Based Image Retrieval based on the Lucene search engine library (Extended Abstract)
An approach to Content-Based Image Retrieval based on the Lucene search engine library (Extended Abstract) Claudio Gennaro, Giuseppe Amato, Paolo Bolettieri, Pasquale Savino {claudio.gennaro, giuseppe.amato,
More informationLecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013
Lecture 24: Image Retrieval: Part II Visual Computing Systems Review: K-D tree Spatial partitioning hierarchy K = dimensionality of space (below: K = 2) 3 2 1 3 3 4 2 Counts of points in leaf nodes Nearest
More informationLarge-scale visual recognition Efficient matching
Large-scale visual recognition Efficient matching Florent Perronnin, XRCE Hervé Jégou, INRIA CVPR tutorial June 16, 2012 Outline!! Preliminary!! Locality Sensitive Hashing: the two modes!! Hashing!! Embedding!!
More informationGeometric data structures:
Geometric data structures: Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Sham Kakade 2017 1 Announcements: HW3 posted Today: Review: LSH for Euclidean distance Other
More informationSearching in one billion vectors: re-rank with source coding
Searching in one billion vectors: re-rank with source coding Hervé Jégou INRIA / IRISA Romain Tavenard Univ. Rennes / IRISA Laurent Amsaleg CNRS / IRISA Matthijs Douze INRIA / LJK ICASSP May 2011 LARGE
More informationdoc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague
Praha & EU: Investujeme do vaší budoucnosti Evropský sociální fond course: Searching the Web and Multimedia Databases (BI-VWM) Tomáš Skopal, 2011 SS2010/11 doc. RNDr. Tomáš Skopal, Ph.D. Department of
More informationCrawling, Indexing, and Similarity Searching Images on the Web
Crawling, Indexing, and Similarity Searching Images on the Web (Extended Abstract) M. Batko 1 and F. Falchi 2 and C. Lucchese 2 and D. Novak 1 and R. Perego 2 and F. Rabitti 2 and J. Sedmidubsky 1 and
More informationDistributed Multi-modal Similarity Retrieval
Distributed Multi-modal Similarity Retrieval David Novak Seminar of DISA Lab, October 14, 2014 David Novak Multi-modal Similarity Retrieval DISA Seminar 1 / 17 Outline of the Talk 1 Motivation Similarity
More informationIf you wish to cite this paper, please use the following reference:
This is an accepted version of a paper published in Proceedings of the 4th ACM International Conference on SImilarity Search and Applications (SISAP 11). If you wish to cite this paper, please use the
More informationBranch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits
Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology 1 / 36 Outline 1 Welcome
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors Classic Ideas, New Ideas Yury Lifshits Steklov Institute of Mathematics at St.Petersburg http://logic.pdmi.ras.ru/~yura University of Toronto, July 2007 1 / 39 Outline
More informationCLSH: Cluster-based Locality-Sensitive Hashing
CLSH: Cluster-based Locality-Sensitive Hashing Xiangyang Xu Tongwei Ren Gangshan Wu Multimedia Computing Group, State Key Laboratory for Novel Software Technology, Nanjing University xiangyang.xu@smail.nju.edu.cn
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part V Lecture 13, March 10, 2014 Mohammad Hammoud Today Welcome Back from Spring Break! Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+
More informationMUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1
MUFIN Basics MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic mufin@fi.muni.cz SEMWEB 1 Search problem SEARCH index structure data & queries infrastructure SEMWEB 2 The thesis
More informationOutline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI
Other Use of Triangle Ineuality Algorithms for Nearest Neighbor Search: Lecture 2 Yury Lifshits http://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Outline
More informationAlgorithms for Nearest Neighbors
Algorithms for Nearest Neighbors State-of-the-Art Yury Lifshits Steklov Institute of Mathematics at St.Petersburg Yandex Tech Seminar, April 2007 1 / 28 Outline 1 Problem Statement Applications Data Models
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part V Lecture 15, March 15, 2015 Mohammad Hammoud Today Last Session: DBMS Internals- Part IV Tree-based (i.e., B+ Tree) and Hash-based (i.e., Extendible
More informationSuccinct Nearest Neighbor Search
Succinct Nearest Neighbor Search Eric Sadit Tellez a, Edgar Chavez b, Gonzalo Navarro c,1 a División de Estudios de Posgrado, Facultad de Ingeniería Eléctrica, Universidad Michoacana de San Nicolas de
More informationRandom Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search
Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search Dror Aiger, Efi Kokiopoulou, Ehud Rivlin Google Inc. aigerd@google.com, kokiopou@google.com, ehud@google.com Abstract
More informationFull-Text Search on Data with Access Control
Full-Text Search on Data with Access Control Ahmad Zaky School of Electrical Engineering and Informatics Institut Teknologi Bandung Bandung, Indonesia 13512076@std.stei.itb.ac.id Rinaldi Munir, S.T., M.T.
More informationVK Multimedia Information Systems
VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at Dienstags, 16.oo Uhr c.t., E.1.42 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Indexing
More informationProximity Searching in High Dimensional Spaces with a Proximity Preserving Order
Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order E. Chávez 1, K. Figueroa 1,2, and G. Navarro 2 1 Facultad de Ciencias Físico-Matemáticas, Universidad Michoacana, México.
More informationOn the Least Cost For Proximity Searching in Metric Spaces
On the Least Cost For Proximity Searching in Metric Spaces Karina Figueroa 1,2, Edgar Chávez 1, Gonzalo Navarro 2, and Rodrigo Paredes 2 1 Universidad Michoacana, México. {karina,elchavez}@fismat.umich.mx
More informationPermutation Search Methods are Efficient, Yet Faster Search is Possible
Permutation Search Methods are Efficient, Yet Faster Search is Possible Bilegsaikhan Naidan Norwegian University of Science and Technology Trondheim, Norway bileg@idi.ntnu.no Leonid Boytsov Carnegie Mellon
More informationTHE B+ TREE INDEX. CS 564- Spring ACKs: Jignesh Patel, AnHai Doan
THE B+ TREE INDEX CS 564- Spring 2018 ACKs: Jignesh Patel, AnHai Doan WHAT IS THIS LECTURE ABOUT? The B+ tree index Basics Search/Insertion/Deletion Design & Cost 2 INDEX RECAP We have the following query:
More informationSecure Metric-Based Index for Similarity Cloud
Secure Metric-Based Index for Similarity Cloud Stepan Kozak, David Novak, and Pavel Zezula Masaryk University, Brno, Czech Republic {xkozak1,david.novak,zezula}@fi.muni.cz Abstract. We propose a similarity
More informationKathleen Durant PhD Northeastern University CS Indexes
Kathleen Durant PhD Northeastern University CS 3200 Indexes Outline for the day Index definition Types of indexes B+ trees ISAM Hash index Choosing indexed fields Indexes in InnoDB 2 Indexes A typical
More informationTask Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval
Case Study 2: Document Retrieval Task Description: Finding Similar Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 11, 2017 Sham Kakade 2017 1 Document
More informationLocality- Sensitive Hashing Random Projections for NN Search
Case Study 2: Document Retrieval Locality- Sensitive Hashing Random Projections for NN Search Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 18, 2017 Sham Kakade
More informationClustering Billions of Images with Large Scale Nearest Neighbor Search
Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton
More informationTreaps. 1 Binary Search Trees (BSTs) CSE341T/CSE549T 11/05/2014. Lecture 19
CSE34T/CSE549T /05/04 Lecture 9 Treaps Binary Search Trees (BSTs) Search trees are tree-based data structures that can be used to store and search for items that satisfy a total order. There are many types
More information1 o Semestre 2007/2008
Efficient Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 6 7 Outline 1 2 3 4 5 6 7 Text es An index is a mechanism to locate a given term in
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 9 2. Information Retrieval:
More informationIEEE Infocom 2005 Demonstration ITEA TBONES Project
IEEE Infocom 2005 Demonstration ITEA TBONES Project A GMPLS Unified Control Plane for Multi-Area Networks D.Papadimitriou (Alcatel), B.Berde (Alcatel), K.Casier (IMEC), and R.Theillaud (Atos Origin) ITEA
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive
More informationIndexing and Searching
Indexing and Searching Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Modern Information Retrieval, chapter 8 2. Information Retrieval:
More informationover Multi Label Images
IBM Research Compact Hashing for Mixed Image Keyword Query over Multi Label Images Xianglong Liu 1, Yadong Mu 2, Bo Lang 1 and Shih Fu Chang 2 1 Beihang University, Beijing, China 2 Columbia University,
More informationCRE investment weakens in Q as investors struggle to find product in prime markets
2006.3 2007.1 2007.3 2008.1 2008.3 2009.1 2009.3 2010.1 2010.3 2011.1 2011.3 2012.1 2012.3 2013.1 2013.3 2014.1 2014.3 2015.1 2015.3 2016.1 Europe Capital Markets, Q1 2016 CRE investment weakens in Q1
More informationDatenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser
Datenbanksysteme II: Multidimensional Index Structures 2 Ulf Leser Content of this Lecture Introduction Partitioned Hashing Grid Files kdb Trees kd Tree kdb Tree R Trees Example: Nearest neighbor image
More informationAdaptive Binary Quantization for Fast Nearest Neighbor Search
IBM Research Adaptive Binary Quantization for Fast Nearest Neighbor Search Zhujin Li 1, Xianglong Liu 1*, Junjie Wu 1, and Hao Su 2 1 Beihang University, Beijing, China 2 Stanford University, Stanford,
More informationA GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions
A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions Lee A Carraher, Philip A Wilsey, and Fred S Annexstein School of Electronic and Computing Systems University of Cincinnati
More informationLarge scale object/scene recognition
Large scale object/scene recognition Image dataset: > 1 million images query Image search system ranked image list Each image described by approximately 2000 descriptors 2 10 9 descriptors to index! Database
More informationOnline Document Clustering Using the GPU
Online Document Clustering Using the GPU Benjamin E. Teitler, Jagan Sankaranarayanan, Hanan Samet Center for Automation Research Institute for Advanced Computer Studies Department of Computer Science University
More informationNearest Neighbor Search by Branch and Bound
Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30 Outline 1 Short Intro to
More informationTrees. Reading: Weiss, Chapter 4. Cpt S 223, Fall 2007 Copyright: Washington State University
Trees Reading: Weiss, Chapter 4 1 Generic Rooted Trees 2 Terms Node, Edge Internal node Root Leaf Child Sibling Descendant Ancestor 3 Tree Representations n-ary trees Each internal node can have at most
More informationCourse Content. Objectives of Lecture? CMPUT 391: Spatial Data Management Dr. Jörg Sander & Dr. Osmar R. Zaïane. University of Alberta
Database Management Systems Winter 2002 CMPUT 39: Spatial Data Management Dr. Jörg Sander & Dr. Osmar. Zaïane University of Alberta Chapter 26 of Textbook Course Content Introduction Database Design Theory
More informationCSE 530A. B+ Trees. Washington University Fall 2013
CSE 530A B+ Trees Washington University Fall 2013 B Trees A B tree is an ordered (non-binary) tree where the internal nodes can have a varying number of child nodes (within some range) B Trees When a key
More informationPivoting M-tree: A Metric Access Method for Efficient Similarity Search
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB Technical University of Ostrava, tř. 17. listopadu 15, Ostrava, Czech Republic tomas.skopal@vsb.cz
More informationMultidimensional Indexing The R Tree
Multidimensional Indexing The R Tree Module 7, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Single-Dimensional Indexes B+ trees are fundamentally single-dimensional indexes. When we create
More informationSpatial Data Management
Spatial Data Management [R&G] Chapter 28 CS432 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite imagery, where each pixel stores a measured value
More informationPrinciples of Data Management. Lecture #14 (Spatial Data Management)
Principles of Data Management Lecture #14 (Spatial Data Management) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s Notable News v Project
More informationImage Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.
Image Analysis & Retrieval CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W 4-5:15pm@Bloch 0012 Lec 18 Image Hashing Zhu Li Dept of CSEE, UMKC Office: FH560E, Email: lizhu@umkc.edu, Ph:
More informationQuantifying the Specificity of Near-Duplicate Image Classification Functions
Quantifying the Specificity of Near-Duplicate Image Classification Functions Richard Connor 1 and Franco Alberto Cardillo 2 1 Department of Computer and Information Sciences, University of Strathclyde,
More informationSpatial Data Management
Spatial Data Management Chapter 28 Database management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 Types of Spatial Data Point Data Points in a multidimensional space E.g., Raster data such as satellite
More informationLocality-Sensitive Hashing
Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a query q (or not), how do we find similar items from a large search set quickly? Can t do all pairwise comparisons;
More informationExternal Memory Algorithms and Data Structures Fall Project 3 A GIS system
External Memory Algorithms and Data Structures Fall 2003 1 Project 3 A GIS system GSB/RF November 17, 2003 1 Introduction The goal of this project is to implement a rudimentary Geographical Information
More informationCombining Local and Global Visual Feature Similarity using a Text Search Engine
Combining Local and Global Visual Feature Similarity using a Text Search Engine Giuseppe Amato, Paolo Bolettieri, Fabrizio Falchi, Claudio Gennaro, Fausto Rabitti {giuseppe.amato, fabrizio.falchi, paolo.bolettieri,
More informationNear Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri
Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions
More informationA Metric Cache for Similarity Search
A Metric Cache for Similarity Search Fabrizio Falchi ISTI-CNR Pisa, Italy fabrizio.falchi@isti.cnr.it Raffaele Perego ISTI-CNR Pisa, Italy raffaele.perego@isti.cnr.it Claudio Lucchese ISTI-CNR Pisa, Italy
More informationCSIT5300: Advanced Database Systems
CSIT5300: Advanced Database Systems L08: B + -trees and Dynamic Hashing Dr. Kenneth LEUNG Department of Computer Science and Engineering The Hong Kong University of Science and Technology Hong Kong SAR,
More informationPanel. Progress in Description and Retrieval of Complex Multimedia Information
Panel Progress in Description and Retrieval of Complex Multimedia Information The Fifth International Conferences on Advances in Multimedia The Fourth International Conference on Models and Ontology-based
More informationEngineering Efficient Similarity Search Methods
Doctoral thesis Bilegsaikhan Naidan Doctoral theses at NTNU, 2017:84 NTNU Norwegian University of Science and Technology Thesis for the Degree of Philosophiae Doctor Faculty of Information Technology and
More informationCompressed local descriptors for fast image and video search in large databases
Compressed local descriptors for fast image and video search in large databases Matthijs Douze2 joint work with Hervé Jégou1, Cordelia Schmid2 and Patrick Pérez3 1: INRIA Rennes, TEXMEX team, France 2:
More informationPhysical Level of Databases: B+-Trees
Physical Level of Databases: B+-Trees Adnan YAZICI Computer Engineering Department METU (Fall 2005) 1 B + -Tree Index Files l Disadvantage of indexed-sequential files: performance degrades as file grows,
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered Indexing and Hashing Index Definition in SQL
More informationMachine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016
Machine Learning 10-701, Fall 2016 Nonparametric methods for Classification Eric Xing Lecture 2, September 12, 2016 Reading: 1 Classification Representing data: Hypothesis (classifier) 2 Clustering 3 Supervised
More informationSISTEMI PER LA RICERCA
SISTEMI PER LA RICERCA DI DATI MULTIMEDIALI Claudio Lucchese e Salvatore Orlando Terzo Workshop di Dipartimento Dipartimento di Informatica Università degli studi di Venezia Ringrazio le seguenti persone,
More informationImproved nearest neighbor search using auxiliary information and priority functions
Improved nearest neighbor search using auxiliary information and priority functions Omid Keivani 1 Kaushik Sinha 1 Abstract Nearest neighbor search using random projection trees has recently been shown
More informationLarge Scale Nearest Neighbor Search Theories, Algorithms, and Applications. Junfeng He
Large Scale Nearest Neighbor Search Theories, Algorithms, and Applications Junfeng He Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Graduate School
More informationNearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017
Nearest neighbors Focus on tree-based methods Clément Jamin, GUDHI project, Inria March 2017 Introduction Exact and approximate nearest neighbor search Essential tool for many applications Huge bibliography
More informationAlgorithms and Data Structures
Algorithms and Data Structures Priority Queues Ulf Leser Special Scenarios for Searching Up to now, we assumed that all elements of a list are equally important and that any of them could be searched next
More informationData Mining and Machine Learning: Techniques and Algorithms
Instance based classification Data Mining and Machine Learning: Techniques and Algorithms Eneldo Loza Mencía eneldo@ke.tu-darmstadt.de Knowledge Engineering Group, TU Darmstadt International Week 2019,
More informationDL-Media: An Ontology Mediated Multimedia Information Retrieval System
DL-Media: An Ontology Mediated Multimedia Information Retrieval System ISTI-CNR, Pisa, Italy straccia@isti.cnr.it What is DLMedia? Multimedia Information Retrieval (MIR) Retrieval of those multimedia objects
More informationQuery-Driven Document Partitioning and Collection Selection
Query-Driven Document Partitioning and Collection Selection Diego Puppin, Fabrizio Silvestri and Domenico Laforenza Institute for Information Science and Technologies ISTI-CNR Pisa, Italy May 31, 2006
More informationSpatial Queries. Nearest Neighbor Queries
Spatial Queries Nearest Neighbor Queries Spatial Queries Given a collection of geometric objects (points, lines, polygons,...) organize them on disk, to answer efficiently point queries range queries k-nn
More informationData Structure. A way to store and organize data in order to support efficient insertions, queries, searches, updates, and deletions.
DATA STRUCTURES COMP 321 McGill University These slides are mainly compiled from the following resources. - Professor Jaehyun Park slides CS 97SI - Top-coder tutorials. - Programming Challenges book. Data
More informationHigh Dimensional Indexing by Clustering
Yufei Tao ITEE University of Queensland Recall that, our discussion so far has assumed that the dimensionality d is moderately high, such that it can be regarded as a constant. This means that d should
More informationClustered Pivot Tables for I/O-optimized Similarity Search
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško Charles University in Prague, Faculty of Mathematics and Physics, SIRET research group mosko.juro@centrum.sk Jakub Lokoč Charles University
More informationChapter 11: Indexing and Hashing
Chapter 11: Indexing and Hashing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 11: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files B-Tree
More informationApproximate Similarity Search in Metric Spaces using Inverted Files
Approximate Similarity Search in Metric Spaces using Inverted Files Giuseppe Amato ISTI-CNR Via G. Moruzzi, 56, Pisa, Italy giuseppe.amato@isti.cnr.it ABSTRACT We propose a new approach to perform approximate
More informationThe B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland
Yufei Tao ITEE University of Queensland Before ascending into d-dimensional space R d with d > 1, this lecture will focus on one-dimensional space, i.e., d = 1. We will review the B-tree, which is a fundamental
More informationChapter 17 Indexing Structures for Files and Physical Database Design
Chapter 17 Indexing Structures for Files and Physical Database Design We assume that a file already exists with some primary organization unordered, ordered or hash. The index provides alternate ways to
More informationIndexing and Hashing
C H A P T E R 1 Indexing and Hashing This chapter covers indexing techniques ranging from the most basic one to highly specialized ones. Due to the extensive use of indices in database systems, this chapter
More informationEXTERNAL SORTING. Sorting
EXTERNAL SORTING 1 Sorting A classic problem in computer science! Data requested in sorted order (sorted output) e.g., find students in increasing grade point average (gpa) order SELECT A, B, C FROM R
More informationHash-Based Indexing 165
Hash-Based Indexing 165 h 1 h 0 h 1 h 0 Next = 0 000 00 64 32 8 16 000 00 64 32 8 16 A 001 01 9 25 41 73 001 01 9 25 41 73 B 010 10 10 18 34 66 010 10 10 18 34 66 C Next = 3 011 11 11 19 D 011 11 11 19
More informationSpatial Data Structures
CSCI 420 Computer Graphics Lecture 17 Spatial Data Structures Jernej Barbic University of Southern California Hierarchical Bounding Volumes Regular Grids Octrees BSP Trees [Angel Ch. 8] 1 Ray Tracing Acceleration
More informationReal-time Collaborative Filtering Recommender Systems
Real-time Collaborative Filtering Recommender Systems Huizhi Liang 1,2 Haoran Du 2 Qing Wang 2 1 Department of Computing and Information Systems, The University of Melbourne, Victoria 3010, Australia Email:
More informationIMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1
IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1 : Presentation structure : 1. Brief overview of talk 2. What does Object Recognition involve? 3. The Recognition Problem 4. Mathematical background:
More informationMultidimensional Indexes [14]
CMSC 661, Principles of Database Systems Multidimensional Indexes [14] Dr. Kalpakis http://www.csee.umbc.edu/~kalpakis/courses/661 Motivation Examined indexes when search keys are in 1-D space Many interesting
More informationThe Boundary Graph Supervised Learning Algorithm for Regression and Classification
The Boundary Graph Supervised Learning Algorithm for Regression and Classification! Jonathan Yedidia! Disney Research!! Outline Motivation Illustration using a toy classification problem Some simple refinements
More informationDSH: Data Sensitive Hashing for High-Dimensional k-nn Search
DSH: Data Sensitive Hashing for High-Dimensional k-nn Search Jinyang Gao, H. V. Jagadish, Wei Lu, Beng Chin Ooi School of Computing, National University of Singapore, Singapore Department of Electrical
More informationProblem 1: Complexity of Update Rules for Logistic Regression
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 16 th, 2014 1
More informationSpatial Data Structures
CSCI 480 Computer Graphics Lecture 7 Spatial Data Structures Hierarchical Bounding Volumes Regular Grids BSP Trees [Ch. 0.] March 8, 0 Jernej Barbic University of Southern California http://www-bcf.usc.edu/~jbarbic/cs480-s/
More informationIntroduction to Indexing R-trees. Hong Kong University of Science and Technology
Introduction to Indexing R-trees Dimitris Papadias Hong Kong University of Science and Technology 1 Introduction to Indexing 1. Assume that you work in a government office, and you maintain the records
More informationSparkBOOST, an Apache Spark-based boosting library
SparkBOOST, an Apache Spark-based boosting library Tiziano Fagni (tiziano.fagni@isti.cnr.it) Andrea Esuli (andrea.esuli@isti.cnr.it) Istituto di Scienze e Tecnologie dell Informazione (ISTI) Italian National
More informationUsing Natural Clusters Information to Build Fuzzy Indexing Structure
Using Natural Clusters Information to Build Fuzzy Indexing Structure H.Y. Yue, I. King and K.S. Leung Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin, New Territories,
More informationInstances on a Budget
Retrieving Similar or Informative Instances on a Budget Kristen Grauman Dept. of Computer Science University of Texas at Austin Work with Sudheendra Vijayanarasimham Work with Sudheendra Vijayanarasimham,
More informationUnsupervised Learning Partitioning Methods
Unsupervised Learning Partitioning Methods Road Map 1. Basic Concepts 2. K-Means 3. K-Medoids 4. CLARA & CLARANS Cluster Analysis Unsupervised learning (i.e., Class label is unknown) Group data to form
More informationUniversity of Florida CISE department Gator Engineering. Clustering Part 4
Clustering Part 4 Dr. Sanjay Ranka Professor Computer and Information Science and Engineering University of Florida, Gainesville DBSCAN DBSCAN is a density based clustering algorithm Density = number of
More information