Nearest Neighbor Search by Branch and Bound

Similar documents
Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Outline. Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1. Informal Statement. Chapter I

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors

Geometric data structures:

Clustering-Based Similarity Search in Metric Spaces with Sparse Spatial Centers

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Index-Driven Similarity Search in Metric Spaces

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Finding the k-closest pairs in metric spaces

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order

Practical Construction of k Nearest Neighbor Graphs in Metric Spaces

Locality- Sensitive Hashing Random Projections for NN Search

Fully Dynamic Metric Access Methods based on Hyperplane Partitioning

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs

Foundations of Multidimensional and Metric Data Structures

(2016) ISBN

Scalability Comparison of Peer-to-Peer Similarity-Search Structures

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Machine Learning: Symbol-based

Math 734 Aug 22, Differential Geometry Fall 2002, USC

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Spatial Queries. Nearest Neighbor Queries

Course 16 Geometric Data Structures for Computer Graphics. Voronoi Diagrams

DISTANCE MATRIX APPROACH TO CONTENT IMAGE RETRIEVAL. Dmitry Kinoshenko, Vladimir Mashtalir, Elena Yegorova

TheListofClustersRevisited

Similarity Searching:

A Pivot-based Index Structure for Combination of Feature Vectors

Speeding up Queries in a Leaf Image Database

Chapter 4: Non-Parametric Techniques

Nearest Neighbor Classification

Handling Multiple K-Nearest Neighbor Query Verifications on Road Networks under Multiple Data Owners

9.1. K-means Clustering

Fractional Cascading in Wireless. Jie Gao Computer Science Department Stony Brook University

Data Mining and Machine Learning: Techniques and Algorithms

CS 340 Lec. 4: K-Nearest Neighbors

Engineering Efficient Similarity Search Methods

DBM-Tree: A Dynamic Metric Access Method Sensitive to Local Density Data

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Nearest Neighbor Methods

Exploring intersection trees for indexing metric spaces

Machine Learning and Pervasive Computing

High Dimensional Indexing by Clustering

Large Scale Graph Algorithms

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Document Clustering: Comparison of Similarity Measures

Non-Parametric Modeling

Combinatorial Search; Monte Carlo Methods

Data Mining in Bioinformatics Day 1: Classification

Spatial Data Structures for Computer Graphics

Spatial Data Management

DB-GNG: A constructive Self-Organizing Map based on density

VK Multimedia Information Systems

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

Planar Graphs. 1 Graphs and maps. 1.1 Planarity and duality

Voronoi Diagrams in the Plane. Chapter 5 of O Rourke text Chapter 7 and 9 of course text

Object Recognition Using Pictorial Structures. Daniel Huttenlocher Computer Science Department. In This Talk. Object recognition in computer vision

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Similarity Search without Tears: the OMNI-Family of All-Purpose Access Methods

A Fast Algorithm for Finding k-nearest Neighbors with Non-metric Dissimilarity

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Foundations of Nearest Neighbor Queries in Euclidean Space

Nearest neighbor classification DSE 220

Search. The Nearest Neighbor Problem

Faster metric-space nearest-neighbor search using dispersion trees (DRAFT)

Multidimensional Indexing The R Tree

Approximate Nearest Neighbor Problem: Improving Query Time CS468, 10/9/2006

Nearest Neighbors Classifiers

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

On the Least Cost For Proximity Searching in Metric Spaces

Hierarchical Ordering for Approximate Similarity Ranking

Supervised Learning: Nearest Neighbors

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Spatial Data Management

Voronoi-Based K Nearest Neighbor Search for Spatial Network Databases

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Problem 1: Complexity of Update Rules for Logistic Regression

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Introduction to Similarity Search in Multimedia Databases

26 The closest pair problem

6. Concluding Remarks

Distance Browsing in Spatial Databases

Instance-Based Learning.

MUFIN Basics. MUFIN team Faculty of Informatics, Masaryk University Brno, Czech Republic SEMWEB 1

DiFi: Distance Fields - Fast Computation Using Graphics Hardware

In what follows, we will focus on Voronoi diagrams in Euclidean space. Later, we will generalize to other distance spaces.

Clustering. Distance Measures Hierarchical Clustering. k -Means Algorithms

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

VQ Encoding is Nearest Neighbor Search

Ch5. Divide-and-Conquer

Query Processing and Advanced Queries. Advanced Queries (2): R-TreeR

Clustered Pivot Tables for I/O-optimized Similarity Search

Nearest Pose Retrieval Based on K-d tree

Transcription:

Nearest Neighbor Search by Branch and Bound Algorithmic Problems Around the Web #2 Yury Lifshits http://yury.name CalTech, Fall 07, CS101.2, http://yury.name/algoweb.html 1 / 30

Outline 1 Short Intro to Nearest Neighbors 2 / 30

Outline 1 Short Intro to Nearest Neighbors 2 Branch and Bound Methodology 2 / 30

Outline 1 Short Intro to Nearest Neighbors 2 Branch and Bound Methodology 3 Around Vantage-Point Trees 2 / 30

Outline 1 Short Intro to Nearest Neighbors 2 Branch and Bound Methodology 3 Around Vantage-Point Trees 4 Generalized Hyperplane Trees and Relatives 2 / 30

Part I Short Intro to Nearest Neighbors 3 / 30

Problem Statement Search space: object domain U, similarity function σ Input: database S = {p 1,..., p n } U Query: q U Task: find argmax pi σ(p i, q) p 2 p p 5 4 p 3 p 1 p 6

Problem Statement Search space: object domain U, similarity function σ Input: database S = {p 1,..., p n } U Query: q U Task: find argmax pi σ(p i, q) p 2 p p 5 4 p 3 q p 1 p 6

Problem Statement Search space: object domain U, similarity function σ Input: database S = {p 1,..., p n } U Query: q U Task: find argmax pi σ(p i, q) p 2 p p 5 4 p 3 q p 1 p 6 4 / 30

Applications (1/5) Information Retrieval Content-based retrieval (magnetic resonance images, tomography, CAD shapes, time series, texts) Spelling correction Geographic databases (post-office problem) Searching for similar DNA sequences Related pages web search Semantic search, concept matching 5 / 30

Applications (2/5) Machine Learning knn classification rule: classify by majority of k nearest training examples. E.g. recognition of faces, fingerprints, speaker identity, optical characters Nearest-neighbor interpolation 6 / 30

Applications (3/5) Data Mining Near-duplicate detection Plagiarism detection Computing co-occurrence similarity (for detecting synonyms, query extension, machine translation...) 7 / 30

Applications (3/5) Data Mining Near-duplicate detection Plagiarism detection Computing co-occurrence similarity (for detecting synonyms, query extension, machine translation...) Key difference: Mostly, off-line problems 7 / 30

Applications (4/5) Bipartite Problems Recommendation systems (most relevant movie to a set of already watched ones) Personalized news aggregation (most relevant news articles to a given user s profile of interests) Behavioral targeting (most relevant ad for displaying to a given user) 8 / 30

Applications (4/5) Bipartite Problems Recommendation systems (most relevant movie to a set of already watched ones) Personalized news aggregation (most relevant news articles to a given user s profile of interests) Behavioral targeting (most relevant ad for displaying to a given user) Key differences: Query and database objects have different nature Objects are described by features and connections 8 / 30

Applications (5/5) As a Subroutine Coding theory (maximum likelihood decoding) MPEG compression (searching for similar fragments in already compressed part) Clustering 9 / 30

Variations of the Computation Task Solution aspects: Approximate nearest neighbors Dynamic nearest neighbors: moving objects, deletes/inserts, changing similarity function 10 / 30

Variations of the Computation Task Solution aspects: Approximate nearest neighbors Dynamic nearest neighbors: moving objects, deletes/inserts, changing similarity function Related problems: Nearest neighbor: nearest museum to my hotel Reverse nearest neighbor: all museums for which my hotel is the nearest one Range queries: all museums up to 2km from my hotel Closest pair: closest pair of museum and hotel Spatial join: pairs of hotels and museums which are at most 1km apart Multiple nearest neighbors: nearest museums for each of these hotels Metric facility location: how to build hotels to minimize the sum of museum nearest hotel distances 10 / 30

Brief History 1908 Voronoi diagram 11 / 30

Brief History 1908 Voronoi diagram 1967 knn classification rule by Cover and Hart 11 / 30

Brief History 1908 Voronoi diagram 1967 knn classification rule by Cover and Hart 1973 Post-office problem posed by Knuth 11 / 30

Brief History 1908 Voronoi diagram 1967 knn classification rule by Cover and Hart 1973 Post-office problem posed by Knuth 1997 The paper by Kleinberg, beginning of provable upper/lower bounds 11 / 30

Brief History 1908 Voronoi diagram 1967 knn classification rule by Cover and Hart 1973 Post-office problem posed by Knuth 1997 The paper by Kleinberg, beginning of provable upper/lower bounds 2006 Similarity Search book by Zezula, Amato, Dohnal and Batko 11 / 30

Brief History 1908 Voronoi diagram 1967 knn classification rule by Cover and Hart 1973 Post-office problem posed by Knuth 1997 The paper by Kleinberg, beginning of provable upper/lower bounds 2006 Similarity Search book by Zezula, Amato, Dohnal and Batko 2008 First International Workshop on Similarity Search. Consider submitting! 11 / 30

Part II Branch and Bound Methodology 12 / 30

General Metric Space Tell me definition of metric space 13 / 30

General Metric Space Tell me definition of metric space M = (U, d), distance function d satisfies: Non negativity: s, t U : d(s, t) 0 Symmetry: s, t U : d(s, t) = d(t, s) Identity: d(s, t) = 0 s = t Triangle inequality: r, s, t U : d(r, t) d(r, s) + d(s, t) 13 / 30

General Metric Space Tell me definition of metric space M = (U, d), distance function d satisfies: Non negativity: s, t U : d(s, t) 0 Symmetry: s, t U : d(s, t) = d(t, s) Identity: d(s, t) = 0 s = t Triangle inequality: r, s, t U : d(r, t) d(r, s) + d(s, t) Basic Examples: Arbitrary metric space, oracle access to distance function k-dimensional Euclidean space with Euclidean, weighted Euclidean, Manhattan or L p metric Strings with Hamming or Levenshtein distance 13 / 30

Metric Spaces: More Examples Finite sets with Jaccard metric d(a, B) = 1 A B A B Correlated dimensions: x M ȳ distance Hausdorff distance for sets 14 / 30

Metric Spaces: More Examples Finite sets with Jaccard metric d(a, B) = 1 A B A B Correlated dimensions: x M ȳ distance Hausdorff distance for sets Similarity spaces (no triangle inequality): Multidimensional vectors with scalar product similarity Bipartite graph, co-citations similarity for vertices in one part Social networks with number of joint friends similarity 14 / 30

Branch and Bound: Search Hierarchy Database S = {p 1,..., p n } is represented by a tree: Every node corresponds to a subset of S Root corresponds to S itself Children s sets cover parent s set (p 1, p 2, p 3, p 4, p 5 ) (p 1, p 2, p 3 ) (p 4, p 5 ) Every node contains a description of its subtree providing easy-computable lower bound for d(q, ) in the corresponding subset (p 1, p 3 ) p 2 p 4 p 5 p 3 p 1 15 / 30

Branch and Bound: Range Search Task: find all i d(p i, q) r: (p 1, p 2, p 3, p 4, p 5 ) 1 Make a depth-first traversal of search hierarchy 2 At every node compute the lower bound for its subtree 3 Prune branches with lower bounds above r (p 1, p 2, p 3 ) (p 4, p 5 ) (p 1, p 3 ) p 2 p 4 p 5 p 3 p 1 16 / 30

B&B: Nearest Neighbor Search Task: find argmin pi d(p i, q): 1 Pick a random p i, set p NN := p i, r NN := d(p i, q) 2 Start range search with r NN range 3 Whenever meet p such that d(p, q) < r NN, update p NN := p, r NN := d(p, q) 17 / 30

B&B: Best Bin First Task: find argmin pi d(p i, q): 1 Pick a random p i, set p NN := p i, r NN := d(p i, q) 2 Put the root node into inspection queue 3 Every time: take the node with a smallest lower bound from inspection queue, compute lower bounds for children subtrees 4 Insert children with lower bound below r NN into inspection queue; prune other children branches 5 Whenever meet p such that d(p, q) < r NN, update p NN := p, r NN := d(p, q) 18 / 30

Part III Vantage-Point Trees and Relatives 19 / 30

Vantage-Point Partitioning Uhlmann 91, Yianilos 93: 1 Choose some object p in database (called pivot) 2 Choose partitioning radius r p 3 Put all p i such that d(p i, p) r into inner part, others to the outer part 4 Recursively repeat

Vantage-Point Partitioning Uhlmann 91, Yianilos 93: 1 Choose some object p in database (called pivot) 2 Choose partitioning radius r p 3 Put all p i such that d(p i, p) r into inner part, others to the outer part 4 Recursively repeat r p p 20 / 30

Pruning Conditions For r-range search: If d(q, p) > r p + r prune the inner branch If d(q, p) < r p r prune the outer branch For r p r d(q, p) r p + r we have to inspect both branches q r p r p 21 / 30

Pruning Conditions For r-range search: If d(q, p) > r p + r prune the inner branch If d(q, p) < r p r prune the outer branch For r p r d(q, p) r p + r we have to inspect both branches q r p r p q r p r p 21 / 30

Pruning Conditions For r-range search: If d(q, p) > r p + r prune the inner branch If d(q, p) < r p r prune the outer branch For r p r d(q, p) r p + r we have to inspect both branches q r p r p q r p r p q r p r p 21 / 30

Variations of Vantage-Point Trees Burkhard-Keller tree: pivot used to divide the space into m rings Burkhard&Keller 73 MVP-tree: use the same pivot for different nodes in one level Bozkaya&Ozsoyoglu 97 Post-office tree: use r p + δ for inner branch, r p δ for outer branch McNutt 72 p 22 / 30

Variations of Vantage-Point Trees Burkhard-Keller tree: pivot used to divide the space into m rings Burkhard&Keller 73 MVP-tree: use the same pivot for different nodes in one level Bozkaya&Ozsoyoglu 97 Post-office tree: use r p + δ for inner branch, r p δ for outer branch McNutt 72 p p 1 p 2 22 / 30

Variations of Vantage-Point Trees Burkhard-Keller tree: pivot used to divide the space into m rings Burkhard&Keller 73 MVP-tree: use the same pivot for different nodes in one level Bozkaya&Ozsoyoglu 97 Post-office tree: use r p + δ for inner branch, r p δ for outer branch McNutt 72 p p 1 p 2 p 22 / 30

Part IV Generalized Hyperplane Trees and Relatives 23 / 30

Generalized Hyperplane Tree Partitioning technique (Uhlmann 91): Pick two objects (called pivots) p 1 and p 2 Put all objects that are closer to p 1 than to p 2 to the left branch, others to the right branch Recursively repeat

Generalized Hyperplane Tree Partitioning technique (Uhlmann 91): Pick two objects (called pivots) p 1 and p 2 Put all objects that are closer to p 1 than to p 2 to the left branch, others to the right branch Recursively repeat

Generalized Hyperplane Tree Partitioning technique (Uhlmann 91): Pick two objects (called pivots) p 1 and p 2 Put all objects that are closer to p 1 than to p 2 to the left branch, others to the right branch Recursively repeat 24 / 30

GH-Tree: Pruning Conditions For r-range search: If d(q, p 1 ) > d(q, p 2 ) + 2r prune the left branch If d(q, p 1 ) < d(q, p 2 ) 2r prune the right branch For d(q, p 1 ) d(q, p 2 ) 2r we have to inspect both branches p 1 p 2

GH-Tree: Pruning Conditions For r-range search: If d(q, p 1 ) > d(q, p 2 ) + 2r prune the left branch If d(q, p 1 ) < d(q, p 2 ) 2r prune the right branch For d(q, p 1 ) d(q, p 2 ) 2r we have to inspect both branches p 1 p 2 q d(p 1,q) d(p 2,q) 2 25 / 30

Bisector trees Let s keep the covering radius for p 1 and left branch, for p 2 and right branch: useful information for stronger pruning conditions p 1 p 2 26 / 30

Bisector trees Let s keep the covering radius for p 1 and left branch, for p 2 and right branch: useful information for stronger pruning conditions p 1 p 2 Variation: monotonous bisector tree (Noltemeier, Verbarg, Zirkelbach 92) always uses parent pivot as one of two children pivots 26 / 30

Bisector trees Let s keep the covering radius for p 1 and left branch, for p 2 and right branch: useful information for stronger pruning conditions p 1 p 2 Variation: monotonous bisector tree (Noltemeier, Verbarg, Zirkelbach 92) always uses parent pivot as one of two children pivots Exercise: prove that covering radii are monotonically decrease in mb-trees 26 / 30

Geometric Near-Neighbor Access Tree Brin 95: Use m pivots Branch i consists of objects for which p i is the closest pivot Stores minimal and maximal distances from pivots to all brother -branches 27 / 30

Exercises Prove that Jaccard distance d(a, B) = 1 A B A B satisfies triangle inequality 28 / 30

Exercises Prove that Jaccard distance d(a, B) = 1 A B A B satisfies triangle inequality Prove that covering radii are monotonically decrease in mb-trees 28 / 30

Exercises Prove that Jaccard distance d(a, B) = 1 A B A B satisfies triangle inequality Prove that covering radii are monotonically decrease in mb-trees Construct a database and a set of potential queries in some multidimensional Euclidean space for which all described data structures require Ω(n) nearest neighbor search time 28 / 30

Highlights Nearest neighbor search is fundamental for information retrieval, data mining, machine learning and recommendation systems 29 / 30

Highlights Nearest neighbor search is fundamental for information retrieval, data mining, machine learning and recommendation systems Balls, generalized hyperplanes and Voronoi cells are used for space partitioning 29 / 30

Highlights Nearest neighbor search is fundamental for information retrieval, data mining, machine learning and recommendation systems Balls, generalized hyperplanes and Voronoi cells are used for space partitioning Depth-first and Best-first strategies are used for search 29 / 30

Highlights Nearest neighbor search is fundamental for information retrieval, data mining, machine learning and recommendation systems Balls, generalized hyperplanes and Voronoi cells are used for space partitioning Depth-first and Best-first strategies are used for search Thanks for your attention! Questions? 29 / 30

References Course homepage http://yury.name/algoweb.html P. Zezula, G. Amato, V. Dohnal, M. Batko Similarity Search: The Metric Space Approach. Springer, 2006. http://www.nmis.isti.cnr.it/amato/similarity-search-book/ E. Chávez, G. Navarro, R. Baeza-Yates, J. L. Marroquín Searching in Metric Spaces. ACM Computing Surveys, 2001. http://www.cs.ust.hk/~leichen/courses/comp630j/readings/acm-survey/searchinmetric.pdf G.R. Hjaltason, H. Samet Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 2003 http://www.cs.utexas.edu/~abhinay/ee382v/project/papers/ft gateway.cfm.pdf 30 / 30