Branch and Bound Algorithms for Nearest Neighbor Search: Lecture Yury Lifshits htt://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Outline Welcome to Nearest Neighbors! Branch and Bound Methodology Around Vantage-Point Trees 4 Generalized Hyerlane Trees and Relatives 5 M-Trees / 6 / 6 Informal Statement Chater I Welcome to Nearest Neighbors! To rerocess a database of n objects so that given a uery object, one can effectively determine its nearest neighbors in database / 6 4 / 6
More Formally Search sace: object domain U, similarity function σ Inut: database S = {,..., n } U Query: U Task: find argmax i σ( i, ) Alications (/5) Information Retrieval Content-based retrieval (magnetic resonance images, tomograhy, CAD shaes, time series, texts) Selling correction Geograhic databases (ost-office roblem) 5 4 6 Searching for similar DNA seuences Related ages web search Semantic search, concet matching 5 / 6 6 / 6 Alications (/5) Machine Learning Alications (/5) Data Mining knn classification rule: classify by majority of k nearest training examles. E.g. recognition of faces, fingerrints, seaker identity, otical characters Nearest-neighbor interolation Near-dulicate detection Plagiarism detection Comuting co-occurrence similarity (for detecting synonyms, uery extension, machine translation...) Key difference: Mostly, off-line roblems 7 / 6 8 / 6
Alications (4/5) Biartite Problems Alications (5/5) As a Subroutine Recommendation systems (most relevant movie to a set of already watched ones) Personalized news aggregation (most relevant news articles to a given user s rofile of interests) Behavioral targeting (most relevant ad for dislaying to a given user) Coding theory (maximum likelihood decoding) MPEG comression (searching for similar fragments in already comressed art) Clustering Key differences: Query and database objects have different nature Objects are described by features and connections 9 / 6 0 / 6 Variations of the Comutation Task Solution asects: Aroximate nearest neighbors Dynamic nearest neighbors: moving objects, deletes/inserts, changing similarity function Related roblems: Nearest neighbor: nearest museum to my hotel Reverse nearest neighbor: all museums for which my hotel is the nearest one Range ueries: all museums u to km from my hotel Closest air: closest air of museum and hotel Satial join: airs of hotels and museums which are at most km aart Multile nearest neighbors: nearest museums for each of these hotels Metric facility location: how to build hotels to minimize the sum of museum nearest hotel distances / 6 Brief History 908 Voronoi diagram 967 knn classification rule by Cover and Hart 97 Post-office roblem osed by Knuth 997 The aer by Kleinberg, beginning of rovable uer/lower bounds 006 Similarity Search book by Zezula, Amato, Dohnal and Batko 008 First International Worksho on Similarity Search. Consider submitting! / 6
Tutorial Outline Four lectures: Branch-and-bound: various tree-based data structures for general metric sace Other use of triangle ineuality: Walks, matrix methods, secific tricks for Euclidean sace Maing-based techniues: Locality-sensitive hashing, random rojections 4 Restrictions on inut: Intrinsic dimension, robabilistic analysis and oen roblems Chater II Branch and Bound Methodology Not covered: low-dimensional solutions, exerimental results, arallelization, I/O comlexity, lower bounds, alications / 6 4 / 6 General Metric Sace Tell me definition of metric sace M = (U, d), distance function d satisfies: Non negativity: s, t U : d(s, t) 0 Symmetry: s, t U : d(s, t) = d(t, s) Identity: d(s, t) = 0 s = t Triangle ineuality: r, s, t U : d(r, t) d(r, s) + d(s, t) Basic Examles: Arbitrary metric sace, oracle access to distance function k-dimensional Euclidean sace with Euclidean, weighted Euclidean, Manhattan or L metric Strings with Hamming or Levenshtein distance Metric Saces: More Examles Finite sets with Jaccard metric d(a, B) = A B A B Correlated dimensions: x M ȳ distance Hausdorff distance for sets Similarity saces (no triangle ineuality): Multidimensional vectors with scalar roduct similarity Biartite grah, co-citations similarity for vertices in one art Social networks with number of joint friends similarity 5 / 6 6 / 6
Branch and Bound: Search Hierarchy Branch and Bound: Range Search Database S = {,..., n } is reresented by a tree: (,,, 4, 5 ) Task: find all i d( i, ) r: (,,, 4, 5 ) Every node corresonds to a subset of S Root corresonds to S itself Children s sets cover arent s set Every node contains a descrition of its subtree roviding easy-comutable lower bound for d(, ) in the corresonding subset (,, ) ( 4, 5 ) (, ) 4 5 Make a deth-first traversal of search hierarchy At every node comute the lower bound for its subtree Prune branches with lower bounds above r (,, ) ( 4, 5 ) (, ) 4 5 7 / 6 8 / 6 B&B: Nearest Neighbor Search Task: find argmin i d( i, ): Pick a random i, set NN := i, r NN := d( i, ) Start range search with r NN range Whenever meet such that d(, ) < r NN, udate NN :=, r NN := d(, ) B&B: Best Bin First Task: find argmin i d( i, ): Pick a random i, set NN := i, r NN := d( i, ) Put the root node into insection ueue Every time: take the node with a smallest lower bound from insection ueue, comute lower bounds for children subtrees 4 Insert children with lower bound below r NN into insection ueue; rune other children branches 5 Whenever meet such that d(, ) < r NN, udate NN :=, r NN := d(, ) 9 / 6 0 / 6
Some Tree-Based Data Structures Shere Rectangle Tree k-d-b tree Geometric near-neighbor access tree Excluded middle vantage oint forest mv-tree Fixed-height fixed-ueries tree Vantage-oint tree R -tree Burkhard-Keller tree BBD tree Voronoi tree Balanced asect ratio tree Metric tree v s -tree M-tree SS-tree R-tree Satial aroximation tree Multi-vantage oint tree Bisector tree mb-tree Generalized hyerlane tree Hybrid tree Slim tree Sill Tree Fixed ueries tree X-tree k-d tree Balltree Quadtree Octree SR-tree Post-office tree / 6 Chater III Vantage-Point Trees and Relatives / 6 Vantage-Point Partitioning Uhlmann 9, Yianilos 9: Choose some object in database (called ivot) Choose artitioning radius r Put all i such that d( i, ) r into inner art, others to the outer art 4 Recursively reeat Pruning Conditions For r-range search: If d(, ) > r + r rune the inner branch If d(, ) < r r rune the outer branch For r r d(, ) r + r we have to insect both branches r r r r r r r / 6 4 / 6
Variations of Vantage-Point Trees Burkhard-Keller tree: ivot used to divide the sace into m rings Burkhard&Keller 7 MVP-tree: use the same ivot for different nodes in one level Bozkaya&Ozsoyoglu 97 Post-office tree: use r + δ for inner branch, r δ for outer branch McNutt 7 Chater IV Generalized Hyerlane Trees and Relatives 5 / 6 6 / 6 Generalized Hyerlane Tree Partitioning techniue (Uhlmann 9): Pick two objects (called ivots) and Put all objects that are closer to than to to the left branch, others to the right branch Recursively reeat GH-Tree: Pruning Conditions For r-range search: If d(, ) > d(, ) + r rune the left branch If d(, ) < d(, ) r rune the right branch For d(, ) d(, ) r we have to insect both branches d(,) d(,) 7 / 6 8 / 6
Bisector trees Let s kee the covering radius for and left branch, for and right branch: useful information for stronger runing conditions Geometric Near-Neighbor Access Tree Brin 95: Variation: monotonous bisector tree (Noltemeier, Verbarg, Zirkelbach 9) always uses arent ivot as one of two children ivots Exercise: rove that covering radii are monotonically decrease in mb-trees 9 / 6 Use m ivots Branch i consists of objects for which i is the closest ivot Stores minimal and maximal distances from ivots to all brother -branches 0 / 6 M-tree: Data structure Chater V Ciaccia, Patella, Zezula 97: All database objects are stored in leaf nodes (buckets of fixed size) M-trees Every internal nodes has associated ivot, covering radius and legal range for number of children (e.g. -) Usual deth-first or best-first search Secial algorithms for insertions and deletions a-la B-tree / 6 / 6
M-tree: Insertions All insertions haen at the leaf nodes: Choose the leaf node using minimal exansion of covering radius rincile If the leaf node contains fewer than the maximum legal number of elements, there is room for one more. Insert; udate all covering radii Otherwise the leaf node is slit into two nodes Use two ivots generalized hyerlane artitioning Both ivots are added to the node s arent, which may cause it to be slit, and so on Exercises Prove that Jaccard distance d(a, B) = A B A B satisfies triangle ineuality Prove that covering radii are monotonically decrease in mb-trees Construct a database and a set of otential ueries in some multidimensional Euclidean sace for which all described data structures reuire Ω(n) nearest neighbor search time / 6 4 / 6 Highlights References Nearest neighbor search is fundamental for information retrieval, data mining, machine learning and recommendation systems Balls, generalized hyerlanes and Voronoi cells are used for sace artitioning Deth-first and Best-first strategies are used for search Thanks for your attention! Questions? Course homeage htt://simsearch.yury.name/tutorial.html Y. Lifshits The Homeage of Nearest Neighbors and Similarity Search htt://simsearch.yury.name P. Zezula, G. Amato, V. Dohnal, M. Batko Similarity Search: The Metric Sace Aroach. Sringer, 006. htt://www.nmis.isti.cnr.it/amato/similarity-search-book/ E. Chávez, G. Navarro, R. Baeza-Yates, J. L. Marrouín Searching in Metric Saces. ACM Comuting Surveys, 00. htt://www.cs.ust.hk/~leichen/courses/com60j/readings/acm-survey/searchinmetric.df G.R. Hjaltason, H. Samet Index-driven similarity search in metric saces. ACM Transactions on Database Systems, 00 htt://www.cs.utexas.edu/~abhinay/ee8v/project/paers/ft gateway.cfm.df 5 / 6 6 / 6