Outline. Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1. Informal Statement. Chapter I

Similar documents
Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Nearest Neighbor Search by Branch and Bound

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Algorithms for Nearest Neighbors

Algorithms for Nearest Neighbors

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Lecture 8: Orthogonal Range Searching

Geometric data structures:

Range Searching. Data structure for a set of objects (points, rectangles, polygons) for efficient range queries.

We will then introduce the DT, discuss some of its fundamental properties and show how to compute a DT directly from a given set of points.

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field

Clustering-Based Similarity Search in Metric Spaces with Sparse Spatial Centers

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.)

Foundations of Multidimensional and Metric Data Structures

The Spatial Skyline Queries

The Spatial Skyline Queries

COT5405: GEOMETRIC ALGORITHMS

Pivot Selection for Dimension Reduction Using Annealing by Increasing Resampling *

CMSC 425: Lecture 16 Motion Planning: Basic Concepts

Continuous Visible k Nearest Neighbor Query on Moving Objects

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Computational Geometry: Proximity and Location

Efficient Parallel Hierarchical Clustering

Introduction to Visualization and Computer Graphics

CS2 Algorithms and Data Structures Note 8

Cross products. p 2 p. p p1 p2. p 1. Line segments The convex combination of two distinct points p1 ( x1, such that for some real number with 0 1,

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Randomized algorithms: Two examples and Yao s Minimax Principle

Experiments on Patent Retrieval at NTCIR-4 Workshop

Directed File Transfer Scheduling

split split (a) (b) split split (c) (d)

Index-Driven Similarity Search in Metric Spaces

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE

CS2 Algorithms and Data Structures Note 8

Relations with Relation Names as Arguments: Algebra and Calculus. Kenneth A. Ross. Columbia University.

DB-GNG: A constructive Self-Organizing Map based on density

INFLUENCE POWER-BASED CLUSTERING ALGORITHM FOR MEASURE PROPERTIES IN DATA WAREHOUSE

Algorithms for Euclidean TSP

Analyzing Borders Between Partially Contradicting Fuzzy Classification Rules

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Locality- Sensitive Hashing Random Projections for NN Search

Image Segmentation Using Topological Persistence

Privacy Preserving Moving KNN Queries

Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order

Chapter 1 - Basic Equations

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Problem 1: Complexity of Update Rules for Logistic Regression

K-Nearest Neighbor Finding Using the MaxNearestDist Estimator

Fully Dynamic Metric Access Methods based on Hyperplane Partitioning

Constrained Empty-Rectangle Delaunay Graphs

521493S Computer Graphics Exercise 3 (Chapters 6-8)

Scalability Comparison of Peer-to-Peer Similarity-Search Structures

Physically Based Rendering ( ) Intersection Acceleration

Cross products Line segments The convex combination of two distinct points p

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Stereo Disparity Estimation in Moment Space

Figure 8.1: Home age taken from the examle health education site (htt:// Setember 14, 2001). 201

Bayesian Oil Spill Segmentation of SAR Images via Graph Cuts 1

Multi-robot SLAM with Unknown Initial Correspondence: The Robot Rendezvous Case

An accurate and fast point-to-plane registration technique

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Multidimensional Indexing The R Tree

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs

Nearest Neighbor Methods

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

High Dimensional Indexing by Clustering

Notes on Binary Dumbbell Trees

Data Mining and Machine Learning: Techniques and Algorithms

Spatial Data Structures

Efficient Sequence Generator Mining and its Application in Classification

Data Mining in Bioinformatics Day 1: Classification

Introduction to Image Compresion

CMSC 754 Computational Geometry 1

Finding the k-closest pairs in metric spaces

Handling Multiple K-Nearest Neighbor Query Verifications on Road Networks under Multiple Data Owners

DBM-Tree: A Dynamic Metric Access Method Sensitive to Local Density Data

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Similarity Searching:

Wavelet Based Statistical Adapted Local Binary Patterns for Recognizing Avatar Faces

Machine Learning: Symbol-based

Spatial Data Structures

Face Recognition Using Legendre Moments

Chapter 4: Non-Parametric Techniques

Data Structures for Approximate Proximity and Range Searching

Balanced Box-Decomposition trees for Approximate nearest-neighbor. Manos Thanos (MPLA) Ioannis Emiris (Dept Informatics) Computational Geometry

A Morphological LiDAR Points Cloud Filtering Method based on GPGPU

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

Large Scale Graph Algorithms

The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree

Signature File Hierarchies and Signature Graphs: a New Index Method for Object-Oriented Databases

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Nearest Neighbor Classification

A Novel Iris Segmentation Method for Hand-Held Capture Device

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

Transcription:

Branch and Bound Algorithms for Nearest Neighbor Search: Lecture Yury Lifshits htt://yury.name Steklov Institute of Mathematics at St.Petersburg California Institute of Technology Outline Welcome to Nearest Neighbors! Branch and Bound Methodology Around Vantage-Point Trees 4 Generalized Hyerlane Trees and Relatives 5 M-Trees / 6 / 6 Informal Statement Chater I Welcome to Nearest Neighbors! To rerocess a database of n objects so that given a uery object, one can effectively determine its nearest neighbors in database / 6 4 / 6

More Formally Search sace: object domain U, similarity function σ Inut: database S = {,..., n } U Query: U Task: find argmax i σ( i, ) Alications (/5) Information Retrieval Content-based retrieval (magnetic resonance images, tomograhy, CAD shaes, time series, texts) Selling correction Geograhic databases (ost-office roblem) 5 4 6 Searching for similar DNA seuences Related ages web search Semantic search, concet matching 5 / 6 6 / 6 Alications (/5) Machine Learning Alications (/5) Data Mining knn classification rule: classify by majority of k nearest training examles. E.g. recognition of faces, fingerrints, seaker identity, otical characters Nearest-neighbor interolation Near-dulicate detection Plagiarism detection Comuting co-occurrence similarity (for detecting synonyms, uery extension, machine translation...) Key difference: Mostly, off-line roblems 7 / 6 8 / 6

Alications (4/5) Biartite Problems Alications (5/5) As a Subroutine Recommendation systems (most relevant movie to a set of already watched ones) Personalized news aggregation (most relevant news articles to a given user s rofile of interests) Behavioral targeting (most relevant ad for dislaying to a given user) Coding theory (maximum likelihood decoding) MPEG comression (searching for similar fragments in already comressed art) Clustering Key differences: Query and database objects have different nature Objects are described by features and connections 9 / 6 0 / 6 Variations of the Comutation Task Solution asects: Aroximate nearest neighbors Dynamic nearest neighbors: moving objects, deletes/inserts, changing similarity function Related roblems: Nearest neighbor: nearest museum to my hotel Reverse nearest neighbor: all museums for which my hotel is the nearest one Range ueries: all museums u to km from my hotel Closest air: closest air of museum and hotel Satial join: airs of hotels and museums which are at most km aart Multile nearest neighbors: nearest museums for each of these hotels Metric facility location: how to build hotels to minimize the sum of museum nearest hotel distances / 6 Brief History 908 Voronoi diagram 967 knn classification rule by Cover and Hart 97 Post-office roblem osed by Knuth 997 The aer by Kleinberg, beginning of rovable uer/lower bounds 006 Similarity Search book by Zezula, Amato, Dohnal and Batko 008 First International Worksho on Similarity Search. Consider submitting! / 6

Tutorial Outline Four lectures: Branch-and-bound: various tree-based data structures for general metric sace Other use of triangle ineuality: Walks, matrix methods, secific tricks for Euclidean sace Maing-based techniues: Locality-sensitive hashing, random rojections 4 Restrictions on inut: Intrinsic dimension, robabilistic analysis and oen roblems Chater II Branch and Bound Methodology Not covered: low-dimensional solutions, exerimental results, arallelization, I/O comlexity, lower bounds, alications / 6 4 / 6 General Metric Sace Tell me definition of metric sace M = (U, d), distance function d satisfies: Non negativity: s, t U : d(s, t) 0 Symmetry: s, t U : d(s, t) = d(t, s) Identity: d(s, t) = 0 s = t Triangle ineuality: r, s, t U : d(r, t) d(r, s) + d(s, t) Basic Examles: Arbitrary metric sace, oracle access to distance function k-dimensional Euclidean sace with Euclidean, weighted Euclidean, Manhattan or L metric Strings with Hamming or Levenshtein distance Metric Saces: More Examles Finite sets with Jaccard metric d(a, B) = A B A B Correlated dimensions: x M ȳ distance Hausdorff distance for sets Similarity saces (no triangle ineuality): Multidimensional vectors with scalar roduct similarity Biartite grah, co-citations similarity for vertices in one art Social networks with number of joint friends similarity 5 / 6 6 / 6

Branch and Bound: Search Hierarchy Branch and Bound: Range Search Database S = {,..., n } is reresented by a tree: (,,, 4, 5 ) Task: find all i d( i, ) r: (,,, 4, 5 ) Every node corresonds to a subset of S Root corresonds to S itself Children s sets cover arent s set Every node contains a descrition of its subtree roviding easy-comutable lower bound for d(, ) in the corresonding subset (,, ) ( 4, 5 ) (, ) 4 5 Make a deth-first traversal of search hierarchy At every node comute the lower bound for its subtree Prune branches with lower bounds above r (,, ) ( 4, 5 ) (, ) 4 5 7 / 6 8 / 6 B&B: Nearest Neighbor Search Task: find argmin i d( i, ): Pick a random i, set NN := i, r NN := d( i, ) Start range search with r NN range Whenever meet such that d(, ) < r NN, udate NN :=, r NN := d(, ) B&B: Best Bin First Task: find argmin i d( i, ): Pick a random i, set NN := i, r NN := d( i, ) Put the root node into insection ueue Every time: take the node with a smallest lower bound from insection ueue, comute lower bounds for children subtrees 4 Insert children with lower bound below r NN into insection ueue; rune other children branches 5 Whenever meet such that d(, ) < r NN, udate NN :=, r NN := d(, ) 9 / 6 0 / 6

Some Tree-Based Data Structures Shere Rectangle Tree k-d-b tree Geometric near-neighbor access tree Excluded middle vantage oint forest mv-tree Fixed-height fixed-ueries tree Vantage-oint tree R -tree Burkhard-Keller tree BBD tree Voronoi tree Balanced asect ratio tree Metric tree v s -tree M-tree SS-tree R-tree Satial aroximation tree Multi-vantage oint tree Bisector tree mb-tree Generalized hyerlane tree Hybrid tree Slim tree Sill Tree Fixed ueries tree X-tree k-d tree Balltree Quadtree Octree SR-tree Post-office tree / 6 Chater III Vantage-Point Trees and Relatives / 6 Vantage-Point Partitioning Uhlmann 9, Yianilos 9: Choose some object in database (called ivot) Choose artitioning radius r Put all i such that d( i, ) r into inner art, others to the outer art 4 Recursively reeat Pruning Conditions For r-range search: If d(, ) > r + r rune the inner branch If d(, ) < r r rune the outer branch For r r d(, ) r + r we have to insect both branches r r r r r r r / 6 4 / 6

Variations of Vantage-Point Trees Burkhard-Keller tree: ivot used to divide the sace into m rings Burkhard&Keller 7 MVP-tree: use the same ivot for different nodes in one level Bozkaya&Ozsoyoglu 97 Post-office tree: use r + δ for inner branch, r δ for outer branch McNutt 7 Chater IV Generalized Hyerlane Trees and Relatives 5 / 6 6 / 6 Generalized Hyerlane Tree Partitioning techniue (Uhlmann 9): Pick two objects (called ivots) and Put all objects that are closer to than to to the left branch, others to the right branch Recursively reeat GH-Tree: Pruning Conditions For r-range search: If d(, ) > d(, ) + r rune the left branch If d(, ) < d(, ) r rune the right branch For d(, ) d(, ) r we have to insect both branches d(,) d(,) 7 / 6 8 / 6

Bisector trees Let s kee the covering radius for and left branch, for and right branch: useful information for stronger runing conditions Geometric Near-Neighbor Access Tree Brin 95: Variation: monotonous bisector tree (Noltemeier, Verbarg, Zirkelbach 9) always uses arent ivot as one of two children ivots Exercise: rove that covering radii are monotonically decrease in mb-trees 9 / 6 Use m ivots Branch i consists of objects for which i is the closest ivot Stores minimal and maximal distances from ivots to all brother -branches 0 / 6 M-tree: Data structure Chater V Ciaccia, Patella, Zezula 97: All database objects are stored in leaf nodes (buckets of fixed size) M-trees Every internal nodes has associated ivot, covering radius and legal range for number of children (e.g. -) Usual deth-first or best-first search Secial algorithms for insertions and deletions a-la B-tree / 6 / 6

M-tree: Insertions All insertions haen at the leaf nodes: Choose the leaf node using minimal exansion of covering radius rincile If the leaf node contains fewer than the maximum legal number of elements, there is room for one more. Insert; udate all covering radii Otherwise the leaf node is slit into two nodes Use two ivots generalized hyerlane artitioning Both ivots are added to the node s arent, which may cause it to be slit, and so on Exercises Prove that Jaccard distance d(a, B) = A B A B satisfies triangle ineuality Prove that covering radii are monotonically decrease in mb-trees Construct a database and a set of otential ueries in some multidimensional Euclidean sace for which all described data structures reuire Ω(n) nearest neighbor search time / 6 4 / 6 Highlights References Nearest neighbor search is fundamental for information retrieval, data mining, machine learning and recommendation systems Balls, generalized hyerlanes and Voronoi cells are used for sace artitioning Deth-first and Best-first strategies are used for search Thanks for your attention! Questions? Course homeage htt://simsearch.yury.name/tutorial.html Y. Lifshits The Homeage of Nearest Neighbors and Similarity Search htt://simsearch.yury.name P. Zezula, G. Amato, V. Dohnal, M. Batko Similarity Search: The Metric Sace Aroach. Sringer, 006. htt://www.nmis.isti.cnr.it/amato/similarity-search-book/ E. Chávez, G. Navarro, R. Baeza-Yates, J. L. Marrouín Searching in Metric Saces. ACM Comuting Surveys, 00. htt://www.cs.ust.hk/~leichen/courses/com60j/readings/acm-survey/searchinmetric.df G.R. Hjaltason, H. Samet Index-driven similarity search in metric saces. ACM Transactions on Database Systems, 00 htt://www.cs.utexas.edu/~abhinay/ee8v/project/paers/ft gateway.cfm.df 5 / 6 6 / 6