Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja (UBC), David G. Lowe (Google) IEEE 2014

Similar documents
Geometric data structures:

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Large-scale visual recognition Efficient matching

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Locality- Sensitive Hashing Random Projections for NN Search

Problem 1: Complexity of Update Rules for Logistic Regression

CSC 411: Lecture 05: Nearest Neighbors

Nearest Neighbor with KD Trees

Big Data Analytics. Special Topics for Computer Science CSE CSE April 14

Nearest Neighbor with KD Trees

Balanced Box-Decomposition trees for Approximate nearest-neighbor. Manos Thanos (MPLA) Ioannis Emiris (Dept Informatics) Computational Geometry

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Planning Techniques for Robotics Planning Representations: Skeletonization- and Grid-based Graphs

Approximate Nearest Neighbors. CS 510 Lecture #24 April 18 th, 2014

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Predictive Indexing for Fast Search

Exploiting Parallelism to Support Scalable Hierarchical Clustering

CLSH: Cluster-based Locality-Sensitive Hashing

Computational Geometry

Approximate Nearest Neighbor Search. Deng Cai Zhejiang University

Optimizing Out-of-Core Nearest Neighbor Problems on Multi-GPU Systems Using NVLink

Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Classifiers and Detection. D.A. Forsyth

NEarest neighbor search plays an important role in

NEarest neighbor search plays an important role in

A comparative study of k-nearest neighbour techniques in crowd simulation

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Searching in one billion vectors: re-rank with source coding

Going nonparametric: Nearest neighbor methods for regression and classification

Geometric Algorithms. Geometric search: overview. 1D Range Search. 1D Range Search Implementations

IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1

Orthogonal Range Queries

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Going nonparametric: Nearest neighbor methods for regression and classification

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Data Mining and Machine Learning: Techniques and Algorithms

Improved nearest neighbor search using auxiliary information and priority functions

Data Structures for Approximate Proximity and Range Searching

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Large scale object/scene recognition

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

7 Approximate Nearest Neighbors

Nearest Neighbor Methods

Clustering Documents. Case Study 2: Document Retrieval

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Page 1. Area-Subdivision Algorithms z-buffer Algorithm List Priority Algorithms BSP (Binary Space Partitioning Tree) Scan-line Algorithms

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Multi-label Classification. Jingzhou Liu Dec

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Approximate Nearest Neighbor Problem: Improving Query Time CS468, 10/9/2006

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Sampling-based Planning 2

Fast Nearest Neighbor Search in the Hamming Space

Document Clustering: Comparison of Similarity Measures

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

An Efficient Method of Partitioning High Volumes of Multidimensional Data for Parallel Clustering Algorithms

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Testing Bipartiteness of Geometric Intersection Graphs David Eppstein

Hi everyone. I hope everyone had a good Fourth of July. Today we're going to be covering graph search. Now, whenever we bring up graph algorithms, we

Large Scale Nearest Neighbor Search Theories, Algorithms, and Applications. Junfeng He

Overview. Efficient Simplification of Point-sampled Surfaces. Introduction. Introduction. Neighborhood. Local Surface Analysis

Based on Raymond J. Mooney s slides

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Robust PDF Table Locator

Fast Indexing and Search. Lida Huang, Ph.D. Senior Member of Consulting Staff Magma Design Automation

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

High-Dimensional Computational Geometry. Jingbo Shang University of Illinois at Urbana-Champaign Mar 5, 2018

ADVANCED MACHINE LEARNING MACHINE LEARNING. Kernel for Clustering kernel K-Means

Introduction to Spatial Database Systems

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

CS 229 Midterm Review

5/15/16. Computational Methods for Data Analysis. Massimo Poesio UNSUPERVISED LEARNING. Clustering. Unsupervised learning introduction

CMSC 754 Computational Geometry 1

Data Structures for Moving Objects

Adaptive Binary Quantization for Fast Nearest Neighbor Search

Nearest Neighbor Classification. Machine Learning Fall 2017

PARALLEL ALGORITHMS FOR NEAREST NEIGHBOR SEARCH PROBLEMS IN HIGH DIMENSIONS.

Department of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang

APPENDIX: DETAILS ABOUT THE DISTANCE TRANSFORM

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Foundations of Multidimensional and Metric Data Structures

Unsupervised Learning

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

University of Florida CISE department Gator Engineering. Clustering Part 4

Spatio-temporal Range Searching Over Compressed Kinetic Sensor Data. Sorelle A. Friedler Google Joint work with David M. Mount

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

Scalability Comparison of Peer-to-Peer Similarity-Search Structures

Lecture 12 Recognition

Transcription:

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja (UBC), David G. Lowe (Google) IEEE 2014 Presenter: Derrick Blakely Department of Computer Science, University of Virginia https://qdata.github.io/deep2read/

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Nearest Neighbor Search Input: Set of points P, d dimensions, and query q

Nearest Neighbor Search Input: Set of points P, d dimensions, and query q Output: Nearest neighbor p*

Exact Solutions Linear search: O(nd), where n = P

Exact Solutions Linear search: O(nd), where n = P Voronoi diagrams/delaunay triangulation for d = 2 -- O(logn) querytime

Exact Solutions Linear search: O(nd), where n = P Voronoi diagrams/delaunay triangulation for d = 2 -- O(logn) querytime kd-trees and other partitioning trees

Exact Solutions Linear search: O(nd), where n = P Voronoi diagrams/delaunay triangulation for d = 2 -- O(logn) querytime kd-trees and other partitioning trees BSP-trees (binary space partitions)

Exact Solutions Linear search: O(nd), where n = P Voronoi diagrams/delaunay triangulation for d = 2 -- O(logn) querytime kd-trees and other partitioning trees BSP-trees (binary space partitions) R-Trees (overlapping boxes)

Exact Solutions Linear search: O(nd), where n = P Voronoi diagrams/delaunay triangulation for d = 2 -- O(logn) querytime kd-trees and other partitioning trees BSP-trees (binary space partitions) R-Trees (overlapping boxes) Ball trees (partition into hyperspheres)

Exact Solutions: kd-trees

Exact Solutions: kd-trees

Exact Solutions: kd-trees

Exact Solutions: kd-trees

Exact Solutions: kd-trees Linear once roughly d > 20

Approximate Nearest Neighbor (ANN) 1. Partitioning Trees (kd-trees, etc.) 2. Locality Sensitive Hashing 3. Nearest Neighbor Graphs

Partitioning Trees All are different flavors of using a BST for point location

Partitioning Trees All are different flavors of using a BST for point location Kd-trees, BSP-trees, R-trees, ball trees, etc.

Partitioning Trees All are different flavors of using a BST for point location Kd-trees, BSP-trees, R-trees, ball trees, etc. Randomly perturb query point and return the point whose cell the query lands in

Partitioning Trees All are different flavors of using a BST for point location Kd-trees, BSP-trees, R-trees, ball trees, etc. Randomly perturb query point and return the point whose cell the query lands in Randomized forests, with trees searched in parallel

Kd-Tree Random Forests Build multiple randomized kd-trees and search them in parallel

Kd-Tree Random Forests Build multiple randomized kd-trees and search them in parallel Splitting dimension sampled from top N dimensions for each remaining subset whenever a split is made

Kd-Tree Random Forests Build multiple randomized kd-trees and search them in parallel Splitting dimension sampled from top N dimensions for each remaining subset whenever a split is made Single priority queue shared among all trees

Kd-Tree Random Forests Build multiple randomized kd-trees and search them in parallel Splitting dimension sampled from top N dimensions for each remaining subset whenever a split is made Single priority queue shared among all trees Ordered by increasing distance to decision boundary

Kd-Tree Random Forests Build multiple randomized kd-trees and search them in parallel Splitting dimension sampled from top N dimensions for each remaining subset whenever a split is made Single priority queue shared among all trees Ordered by increasing distance to decision boundary Mitigates tendency of tree search to become linear as d increases

kd-tree Random Forests

kd-tree Random Forests

Locality Sensitive Hashing

Nearest Neighbor Graphs

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Motivation Lots of different ANN algos

Motivation Lots of different ANN algos Which ones tend to work the best?

Motivation Lots of different ANN algos Which ones tend to work the best? Introduce a new algo that tends to work well

Motivation Lots of different ANN algos Which ones tend to work the best? Introduce a new algo that tends to work well Create an ANN library for C++: FLANN

Motivation Lots of different ANN algos Which ones tend to work the best? Introduce a new algo that tends to work well Create an ANN library for C++: FLANN Automatic algo selection

Motivation Lots of different ANN algos Which ones tend to work the best? Introduce a new algo that tends to work well Create an ANN library for C++: FLANN Automatic algo selection Distributing ANN with compute clusters and map reduce

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Priority Search K-Means Tree Based on K-means clustering

Priority Search K-Means Tree Based on K-means clustering Uses full distance metric for partitioning, rather than using one dimension at a time

Priority Search K-Means Tree Based on K-means clustering Uses full distance metric for partitioning, rather than using one dimension at a time Goal: better exploit distribution/structure of data points in P

Priority K-means Tree Construction

Priority K-means Tree Construction

Priority K-means Tree Construction

Priority K-means Tree Construction

Priority K-means Tree Construction

Priority K-means Tree Querying

Time Complexity Construction Query Randomized kd-tree O(ndK*logkn) Expected O(dlogn) Priority k-means tree Single level: O(ndKI) Total tree: O(ndKI*logkn), where I = max iterations for k-means clustering At each level, finding closest centroid: O(Kd) Traversal: O(dlogkn)

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Results

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Roadmap 1. Background 2. Motivation 3. Scalable Nearest Neighbor Algorithms for High Dimensional Data 4. Results 5. Conclusion and Take-Aways

Conclusions and Take-Aways Different ANN algos work better for different datasets

Conclusions and Take-Aways Different ANN algos work better for different datasets If data is naturally distributed into clusters k-means tree

Conclusions and Take-Aways Different ANN algos work better for different datasets If data is naturally distributed into clusters k-means tree Strong correlations between features kd-trees

Conclusions and Take-Aways Different ANN algos work better for different datasets If data is naturally distributed into clusters k-means tree Strong correlations between features kd-trees Generally, search tree-based algos are the most scalable

Conclusions and Take-Aways Different ANN algos work better for different datasets If data is naturally distributed into clusters k-means tree Strong correlations between features kd-trees Generally, search tree-based algos are the most scalable A lot of potential for LSH; difficult obtaining good hash functions

Conclusions and Take-Aways Different ANN algos work better for different datasets If data is naturally distributed into clusters k-means tree Strong correlations between features kd-trees Generally, search tree-based algos are the most scalable A lot of potential for LSH; difficult obtaining good hash functions Large scale ANN is a big data problem Want good performance? Distribute with compute clusters and Map Reduce

FLANN Fast Library for Approximate Nearest Neighbor C++ source code Contains implementations of a bunch of nearest neighbor algos Included in the OpenCV package Very fast But has bugs, finicky, hard to use Annoy is more popular and is recommended by Radim Rehurek (Gensim creator)

Annoy Randomized forest of BSP-Trees Instead of splitting subsets based on a particular dimension it: Samples two points from the subset The boundary is chosen as the hyperplane equidistant from the two points Repeat the above k times (hyperparameter for making speed-precision tradeoffs)

What ANN Library Should We Use? See Gensim creator Radim Řehůřek's comparison FLANN is extremely fast (0.20-0.30 ms/query), but has bugs (reported 4 years ago but still unfixed), and is difficult to use Annoy is slower (6-7 ms/query), but much more usable; Radim declares it the winner LSH implementations (e.g., Yahoo, NearPy) very finicky; problem surrounding finding good hash functions; very low recall (~2 approx nearest neighbors per query) Important note: embedding spaces have clusters and linear substructure Trying out Annoy for the time being