Algorithms for Nearest Neighbors

Similar documents
Algorithms for Nearest Neighbors

Large Scale Graph Algorithms

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Nearest Neighbor Search by Branch and Bound

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Geometric data structures:

Problem 1: Complexity of Update Rules for Logistic Regression

Locality- Sensitive Hashing Random Projections for NN Search

Large Scale Nearest Neighbor Search Theories, Algorithms, and Applications. Junfeng He

Balanced Box-Decomposition trees for Approximate nearest-neighbor. Manos Thanos (MPLA) Ioannis Emiris (Dept Informatics) Computational Geometry

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Random Grids: Fast Approximate Nearest Neighbors and Range Searching for Image Search

Nearest Neighbor with KD Trees

Fast Indexing and Search. Lida Huang, Ph.D. Senior Member of Consulting Staff Magma Design Automation

Fast Indexing Strategies for Robust Image Hashes

CS 340 Lec. 4: K-Nearest Neighbors

Compact Data Representations and their Applications. Moses Charikar Princeton University

Nearest Neighbor with KD Trees

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Fast Indexing Method. Dongliang Xu 22th.Feb.2008

Metric Learning Applied for Automatic Large Image Classification

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Outline. Branch and Bound Algorithms for Nearest Neighbor Search: Lecture 1. Informal Statement. Chapter I

Classifiers and Detection. D.A. Forsyth

Going nonparametric: Nearest neighbor methods for regression and classification

Predictive Indexing for Fast Search

Query Processing and Advanced Queries. Advanced Queries (2): R-TreeR

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Approximate Line Nearest Neighbor in High Dimensions

Similarity Searching Techniques in Content-based Audio Retrieval via Hashing

Orthogonal Range Queries

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Datenbanksysteme II: Multidimensional Index Structures 2. Ulf Leser

Data Structures for Approximate Proximity and Range Searching

Computational Geometry

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Parallel Algorithms for Geometric Graph Problems Grigory Yaroslavtsev

Going nonparametric: Nearest neighbor methods for regression and classification

Spatial Queries. Nearest Neighbor Queries

Orthogonal range searching. Orthogonal range search

6. Concluding Remarks

Geometric Algorithms. Geometric search: overview. 1D Range Search. 1D Range Search Implementations

Approximate Nearest Neighbor via Point- Location among Balls

7 Approximate Nearest Neighbors

Adaptive Binary Quantization for Fast Nearest Neighbor Search

CS 229 Midterm Review

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Hashing with Graphs. Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011

Approximate Nearest Neighbor Problem: Improving Query Time CS468, 10/9/2006

HOT asax: A Novel Adaptive Symbolic Representation for Time Series Discords Discovery

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

VK Multimedia Information Systems

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Nearest Pose Retrieval Based on K-d tree

Data Structures for Moving Objects

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Approximate Nearest Neighbor Search. Deng Cai Zhejiang University

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

A GPGPU Algorithm for c-approximate r-nearest Neighbor Search in High Dimensions

Based on Raymond J. Mooney s slides

Nearest Neighbor Classification

Supervised Learning: Nearest Neighbors

Week 8 Voronoi Diagrams

Real-time Collaborative Filtering Recommender Systems

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Predictive Indexing for Fast Search

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

Divide-and-Conquer. Combine solutions to sub-problems into overall solution. Break up problem of size n into two equal parts of size!n.

Geometric Computation: Introduction. Piotr Indyk

PROBABILISTIC SIMHASH MATCHING. A Thesis SADHAN SOOD

Chapter 4: Non-Parametric Techniques

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CLSH: Cluster-based Locality-Sensitive Hashing

Nearest neighbor classification DSE 220

Nonparametric Clustering of High Dimensional Data

Maximal Monochromatic Geodesics in an Antipodal Coloring of Hypercube

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Hidden Surface Elimination: BSP trees

New Issues in Near-duplicate Detection

Data Organization B trees

Rongrong Ji (Columbia), Yu Gang Jiang (Fudan), June, 2012

Nearest Neighbor Predictors

Large-scale visual recognition Efficient matching

Machine Learning and Pervasive Computing

CSC 411: Lecture 05: Nearest Neighbors

Nearest Neighbors Classifiers

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Highway Dimension and Provably Efficient Shortest Paths Algorithms

Introduction to Information Retrieval

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Naïve Bayes for text classification

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

VECTOR SPACE CLASSIFICATION

Exam Review Session. William Cohen

Drawing the Visible Objects

Large scale object/scene recognition

Transcription:

Algorithms for Nearest Neighbors State-of-the-Art Yury Lifshits Steklov Institute of Mathematics at St.Petersburg Yandex Tech Seminar, April 2007 1 / 28

Outline 1 Problem Statement Applications Data Models Variations of the Computation Task 2 / 28

Outline 1 Problem Statement Applications Data Models Variations of the Computation Task 2 Overview of Algorithmic Techniques Partitioning idea Look-up idea Embedding idea 2 / 28

Outline 1 Problem Statement Applications Data Models Variations of the Computation Task 2 Overview of Algorithmic Techniques Partitioning idea Look-up idea Embedding idea 3 Further Work 2 / 28

Part I What are nearest neighbors about? Short overview of applications Variations of the problem 3 / 28

Informal Problem Statement To preprocess a database of n objects so that given a query object, one can effectively determine its nearest neighbors in database 4 / 28

First Application (1960s) Nearest neighbors for classification: Picture from http://cgm.cs.mcgill.ca/ soss/cs644/projects/perrier/image25.gif 5 / 28

Applications in Web Technologies Text classification (YaCa) 6 / 28

Applications in Web Technologies Text classification (YaCa) Personalized news aggregation (Yandex Lenta) 6 / 28

Applications in Web Technologies Text classification (YaCa) Personalized news aggregation (Yandex Lenta) Recommendation systems (MoiKrug, Yandex Market) 6 / 28

Applications in Web Technologies Text classification (YaCa) Personalized news aggregation (Yandex Lenta) Recommendation systems (MoiKrug, Yandex Market) On-line advertisement distribution systems (Yandex Direct) 6 / 28

Applications in Web Technologies Text classification (YaCa) Personalized news aggregation (Yandex Lenta) Recommendation systems (MoiKrug, Yandex Market) On-line advertisement distribution systems (Yandex Direct) Long queries in web search (Yandex Search) 6 / 28

Applications in Web Technologies Text classification (YaCa) Personalized news aggregation (Yandex Lenta) Recommendation systems (MoiKrug, Yandex Market) On-line advertisement distribution systems (Yandex Direct) Long queries in web search (Yandex Search) Content-similar pages, near-duplicates (Yandex Search) 6 / 28

Applications in Web Technologies Text classification (YaCa) Personalized news aggregation (Yandex Lenta) Recommendation systems (MoiKrug, Yandex Market) On-line advertisement distribution systems (Yandex Direct) Long queries in web search (Yandex Search) Content-similar pages, near-duplicates (Yandex Search) Semantic search 6 / 28

Data Model Formalization for nearest neighbors consists of: Representation format for objects Similarity function 7 / 28

Data Model Formalization for nearest neighbors consists of: Representation format for objects Similarity function Remark 1: Usually there is original and reduced representation for every object 7 / 28

Data Model Formalization for nearest neighbors consists of: Representation format for objects Similarity function Remark 1: Usually there is original and reduced representation for every object Remark 2: Accuracy of NN-based classification, prediction or recommendations depends solely on a data model, no matter what specific exact NN algorithm we use. 7 / 28

Variations of Data Model (1/3) Vector Model 8 / 28

Variations of Data Model (1/3) Vector Model Similarity: scalar product, cosine 8 / 28

Variations of Data Model (1/3) Vector Model Similarity: scalar product, cosine Set Model 8 / 28

Variations of Data Model (1/3) Vector Model Similarity: scalar product, cosine Set Model Similarity: size of intersection 8 / 28

Variations of Data Model (1/3) Vector Model Similarity: scalar product, cosine Set Model Similarity: size of intersection String Model 8 / 28

Variations of Data Model (1/3) Vector Model Similarity: scalar product, cosine Set Model Similarity: size of intersection String Model Similarity: Hamming distance 8 / 28

Variations of Data Model (2/3) Sparse Vector Model: query time should be in o(d) 9 / 28

Variations of Data Model (2/3) Sparse Vector Model: query time should be in o(d) Similarity: scalar product 9 / 28

Variations of Data Model (2/3) Sparse Vector Model: query time should be in o(d) Similarity: scalar product Small graphs 9 / 28

Variations of Data Model (2/3) Sparse Vector Model: query time should be in o(d) Similarity: scalar product Small graphs Similarity: structure/labels matching 9 / 28

Variations of Data Model (2/3) Sparse Vector Model: query time should be in o(d) Similarity: scalar product Small graphs Similarity: structure/labels matching 9 / 28

Variations of Data Model (2/3) Sparse Vector Model: query time should be in o(d) Similarity: scalar product Small graphs Similarity: structure/labels matching More data models? 9 / 28

Variations of Data Model (3/3) New 3-step bipartite model Similarity: number of 3-step paths 10 / 28

Variations of Data Model (3/3) New 3-step bipartite model Similarity: number of 3-step paths Boys Girls

Variations of Data Model (3/3) New 3-step bipartite model Similarity: number of 3-step paths Boys v Girls

Variations of Data Model (3/3) New 3-step bipartite model Similarity: number of 3-step paths Boys v Girls u 10 / 28

Variations of the Computation Task Approximate nearest neighbors Multiple nearest neighbors Nearest assignment All over-threshold neighbor pairs Nearest neighbors in dynamically changing database: moving objects, deletes/inserts, changing similarity function 11 / 28

Part II Overview of Algorithmic Techniques Partitioning, look-up and embedding-based approaches for vector model New rare-point method (joint work with Hinrich Schütze) 12 / 28

Linear Scan What is the most obvious solution for nearest neighbors? 13 / 28

Linear Scan What is the most obvious solution for nearest neighbors? Answer: compare query object with every object in database 13 / 28

Linear Scan What is the most obvious solution for nearest neighbors? Answer: compare query object with every object in database Advantages: No preprocessing Exact solution Works in any data model 13 / 28

Linear Scan What is the most obvious solution for nearest neighbors? Answer: compare query object with every object in database Advantages: No preprocessing Exact solution Works in any data model Directions for improvement: order of scanning, pruning 13 / 28

Voronoi diagrams Voronoi diagram is a mapping transforming every point p in database to the polygon of points for which p is the nearest neighbor 14 / 28

Voronoi diagrams Voronoi diagram is a mapping transforming every point p in database to the polygon of points for which p is the nearest neighbor Can we generalize one-dimensional binary search? 14 / 28

KD-Trees 15 / 28

KD-Trees Preprocessing: Build a kd-tree: for every internal node on level l we make partitioning based on the value of l mod d-th coordinate 15 / 28

KD-Trees Preprocessing: Build a kd-tree: for every internal node on level l we make partitioning based on the value of l mod d-th coordinate Query processing: Go down to the leaf corresponding to the the query point and compute the distance; 15 / 28

KD-Trees Preprocessing: Build a kd-tree: for every internal node on level l we make partitioning based on the value of l mod d-th coordinate Query processing: Go down to the leaf corresponding to the the query point and compute the distance; (Recursively) Go one step up, check whether the distance to the second branch is larger than that to current candidate neighbor if yes go up, else check this second branch 15 / 28

BSP-Trees Generalization: BSP-tree allows to use any hyperplanes in tree construction 16 / 28

VP-Trees Partitioning condition: d(p, x) <? r Inner branch: B(p, r(1 + ε)) Outer branch: R d /B(p, r(1 δ)) 17 / 28

VP-Trees Partitioning condition: d(p, x) <? r Inner branch: B(p, r(1 + ε)) Outer branch: R d /B(p, r(1 δ)) Search: If d(p, q) < r go to inner branch If d(p, q) > r go to outer branch 17 / 28

VP-Trees Partitioning condition: d(p, x) <? r Inner branch: B(p, r(1 + ε)) Outer branch: R d /B(p, r(1 δ)) Search: If d(p, q) < r go to inner branch If d(p, q) > r go to outer branch and return minimum between obtained result and d(p, q) 17 / 28

Inverted Index How can we use sparseness of vectors in database? 18 / 28

Inverted Index How can we use sparseness of vectors in database? Preprocessing: For very coordinate store a list of all points in database with nonzero value on it 18 / 28

Inverted Index How can we use sparseness of vectors in database? Preprocessing: For very coordinate store a list of all points in database with nonzero value on it Query processing: Retrieve all point that have at least one common nonzero component with the query point; Perform linear scan on them 18 / 28

Locality-Sensitive Hashing Desired hash family H: If p q R then Pr H [h(p) = h(q)] p 1 If p q cr then Pr H [h(p) = h(q)] p 2 19 / 28

Locality-Sensitive Hashing Desired hash family H: If p q R then Pr H [h(p) = h(q)] p 1 If p q cr then Pr H [h(p) = h(q)] p 2 Preprocessing: Choose at random several hash functions from H Build inverted index for hash values of object in database 19 / 28

Locality-Sensitive Hashing Desired hash family H: If p q R then Pr H [h(p) = h(q)] p 1 If p q cr then Pr H [h(p) = h(q)] p 2 Preprocessing: Choose at random several hash functions from H Build inverted index for hash values of object in database Query processing: Retrieve all object that have at least one common hash value with query object; Perform linear scan on them 19 / 28

Rare-Point Method Cheating: we will search only for neighbors that have at least one common rare feature with query object 20 / 28

Rare-Point Method Cheating: we will search only for neighbors that have at least one common rare feature with query object Preprocessing: For very rare feature store a list of all objects in database having it 20 / 28

Rare-Point Method Cheating: we will search only for neighbors that have at least one common rare feature with query object Preprocessing: For very rare feature store a list of all objects in database having it Query processing: Retrieve all point that have at least one common rare feature with the query object; Perform linear scan on them 20 / 28

Kleinberg Embedding-Based Approach Preprocessing: Choose at random linear transformation A : R d R l, l d Apply A to all points in database and make some tricky preprocessing for images 21 / 28

Kleinberg Embedding-Based Approach Preprocessing: Choose at random linear transformation A : R d R l, l d Apply A to all points in database and make some tricky preprocessing for images Query processing: Compute A(q); Find its nearest neighbor in R l ; 21 / 28

Part IV Further Work Directions for Applied Research Directions for Theoretical Research Questions to Practitioners 22 / 28

Directions for Applied Research Attractive goals: NN-based recommendation system, NN-based ads distribution system Choose reasonable data model, add some assumptions about the nature of database and queries Find (theoretically) the best solutions in the resulting formalization Perform experimental analysis of obtained solutions Develop a prototype product 23 / 28

Directions for Theoretical Research Develop techniques for proving hardness of some computational problems with preprocessing. Find theoretical limits for some specific families of algorithms Extend classical NN algorithms to new data models and new task variations Develop theoretical analysis of existing heuristics. Average case complexity is particulary promising. Find subcases for which we can construct provably efficient solutions Compare NN-based approach with other methods for classification/recognition/prediction problems 24 / 28

Questions to Practitioners What is your experience in using nearest neighbors? What algorithm/data model are used? Do you face any scalability/accuracy problems? What is a bottleneck subproblem? 25 / 28

Questions to Practitioners What is your experience in using nearest neighbors? What algorithm/data model are used? Do you face any scalability/accuracy problems? What is a bottleneck subproblem? Are you interested to apply NN approach in any of your future products? 25 / 28

Questions to Practitioners What is your experience in using nearest neighbors? What algorithm/data model are used? Do you face any scalability/accuracy problems? What is a bottleneck subproblem? Are you interested to apply NN approach in any of your future products? Give us benchmark data Give us names and contacts of potentially interested engineers 25 / 28

Summary Nearest neighbors is one of the key algorithmic problems for web technologies Key ideas: look-up tables, partitioning techniques, embeddings Further work: from algorithms to prototype products, from heuristics to theory, from canonical problem to new data models and new search tasks 26 / 28

Summary Nearest neighbors is one of the key algorithmic problems for web technologies Key ideas: look-up tables, partitioning techniques, embeddings Further work: from algorithms to prototype products, from heuristics to theory, from canonical problem to new data models and new search tasks Thanks for your attention! Questions? 26 / 28

References (1/2) Contact: http://logic.pdmi.ras.ru/~yura B. Hoffmann, Y. Lifshits and D. Nowotka Maximal Intersection Queries in Randomized Graph Models http://logic.pdmi.ras.ru/~yura/en/maxint-draft.pdf P.N. Yianilos Data structures and algorithms for nearest neighbor search in general metric spaces http://www.pnylab.com/pny/papers/vptree/vptree.ps J. Zobel and A. Moffat Inverted files for text search engines http://www.cs.mu.oz.au/~alistair/abstracts/zm06compsurv.html K. Teknomo Links to nearest neighbors implementations http://people.revoledu.com/kardi/tutorial/knn/resources.html 27 / 28

References (2/2) J. Kleinberg Two Algorithms for Nearest-Neighbor Search in High Dimensions http://www.ece.tuc.gr/~vsam/csalgo/kleinberg-stoc97-nn.ps P. Indyk and R. Motwani Approximate nearest neighbors: towards removing the curse of dimensionality http://theory.csail.mit.edu/~indyk/nndraft.ps A. Andoni and P. Indyk Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions http://theory.lcs.mit.edu/~indyk/focs06final.ps P. Indyk Nearest Neighbors Bibliography http://theory.lcs.mit.edu/~indyk/bib.html 28 / 28