Clustering Billions of Images with Large Scale Nearest Neighbor Search

Similar documents
Clustering Billions of Images with Large Scale Nearest Neighbor Search

A Scalable Index Mechanism for High-Dimensional Data in Cluster File Systems

Distributed K-Nearest Neighbors

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

CSE 530A. B+ Trees. Washington University Fall 2013

An Investigation of Practical Approximate Nearest Neighbor Algorithms

Geometric data structures:

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Data Mining and Machine Learning: Techniques and Algorithms

CMSC 754 Computational Geometry 1

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

CS350: Data Structures B-Trees

Balanced Box-Decomposition trees for Approximate nearest-neighbor. Manos Thanos (MPLA) Ioannis Emiris (Dept Informatics) Computational Geometry

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Background: disk access vs. main memory access (1/2)

Network Traffic Measurements and Analysis

Balanced Search Trees

CS 350 : Data Structures B-Trees

Orthogonal range searching. Range Trees. Orthogonal range searching. 1D range searching. CS Spring 2009

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

Spatial Queries. Nearest Neighbor Queries

Physical Level of Databases: B+-Trees

Data Structures and Algorithms

Multidimensional Indexing The R Tree

Clustering in Data Mining

Using Natural Clusters Information to Build Fuzzy Indexing Structure

Computational Geometry

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Range Searching and Windowing

Big Data Analytics. Special Topics for Computer Science CSE CSE April 14

Hidden Surface Elimination: BSP trees

Chapter 12: Indexing and Hashing. Basic Concepts

Unsupervised Learning

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Multi-way Search Trees. (Multi-way Search Trees) Data Structures and Programming Spring / 25

Chapter 12: Indexing and Hashing

Algorithms for GIS:! Quadtrees

High Dimensional Indexing by Clustering

VQ Encoding is Nearest Neighbor Search

Nearest Neighbor Search by Branch and Bound

Machine Learning and Pervasive Computing

The Curse of Dimensionality

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Chapter 12: Query Processing

Search. The Nearest Neighbor Problem

Trees. Courtesy to Goodrich, Tamassia and Olga Veksler

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

M-ary Search Tree. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. B-Trees (4.7 in Weiss)

9/23/2009 CONFERENCES CONTINUOUS NEAREST NEIGHBOR SEARCH INTRODUCTION OVERVIEW PRELIMINARY -- POINT NN QUERIES

Introduction to Indexing R-trees. Hong Kong University of Science and Technology

Unsupervised Learning

Accelerating Geometric Queries. Computer Graphics CMU /15-662, Fall 2016

CS F-11 B-Trees 1

Universiteit Leiden Computer Science

CSIT5300: Advanced Database Systems

Lecture 3 February 9, 2010

CSCE 411 Design and Analysis of Algorithms

Nearest Neighbors Classifiers

Accelerating Ray-Tracing

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Speeding up Queries in a Leaf Image Database

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Computational Geometry

Clustering and Visualisation of Data

Spatial Data Management

Chapter 11: Indexing and Hashing

Lecture 7: Decision Trees

CSE 190D Spring 2017 Final Exam Answers

arxiv:physics/ v2 [physics.data-an] 16 Aug 2004

Spatial Data Management

External-Memory Algorithms with Applications in GIS - (L. Arge) Enylton Machado Roberto Beauclair

M-ary Search Tree. B-Trees. B-Trees. Solution: B-Trees. B-Tree: Example. B-Tree Properties. Maximum branching factor of M Complete tree has height =

Comparing Implementations of Optimal Binary Search Trees

Clustering Part 4 DBSCAN

The B-Tree. Yufei Tao. ITEE University of Queensland. INFS4205/7205, Uni of Queensland

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Robust PDF Table Locator

Nearest neighbor classification DSE 220

A Framework for Efficient Fingerprint Identification using a Minutiae Tree

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Repeating Segment Detection in Songs using Audio Fingerprint Matching

Midterm Examination CS540-2: Introduction to Artificial Intelligence

CS127: B-Trees. B-Trees

Intro to DB CHAPTER 12 INDEXING & HASHING

Warped parallel nearest neighbor searches using kd-trees

Kathleen Durant PhD Northeastern University CS Indexes

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

CSE 326: Data Structures B-Trees and B+ Trees

Chapter 17 Indexing Structures for Files and Physical Database Design

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 6 - Storage and Indexing

Tree-Structured Indexes

Pattern Recognition Chapter 3: Nearest Neighbour Algorithms

University of Florida CISE department Gator Engineering. Clustering Part 4

Decision tree learning

CS 229 Midterm Review

Accelerated Raytracing

Transcription:

Clustering Billions of Images with Large Scale Nearest Neighbor Search Ting Liu, Charles Rosenberg, Henry A. Rowley IEEE Workshop on Applications of Computer Vision February 2007 Presented by Dafna Bitton on May 6 th, 2008 for CSE 291 1

Problem Statement Goal 1: Find approximate nearest neighbors for a repository of over one billion images Goal 2: Perform clustering based on the results 2

Context of the Task Billions of images on the web Modern image search is text-based, largely due to so many images! Scale makes most computer vision tasks infeasible in real time 3

Nearest Neighbor Search (NNS): Applications First step for Image clustering Object recognition and classification Useful for Organizing the images on the web by finding near duplicate images of items such as CD covers 4

Background Outline Brute-force nearest neighbor search k-d trees Metric Trees Spill Trees Hybrid Spill Trees Image preprocessing Parallel computing framework and data partition MapReduce Using MapReduce for parallel version of Hybrid Spill Trees Results 5

NNS: Math Framework Assume a d-dimensional space S Assume a set of points T S Assume a distance measure Given a new point p S, we want to find the point v T that is most similar to p 6

Brute-force NNS Given a new point p S, compute the distance between p and every point v T. Whichever point in T has the smallest distance is the nearest neighbor 7

k-d Trees Axis-parallel partitions of the data Root of the tree represents the entire space Invariant: the union of each level of the tree represents the entire space 8

Example of k-d Trees Status of k-d tree 9

Example of k-d Trees (cont.) Status of k-d tree 10

Example of k-d Trees (cont.) Status of k-d tree 11

Example of k-d Trees (cont.) Ideal case when searching: nearest neighbor falls into the same node as the query NN query point 12

Example: k-d Trees (cont.) Unfortunate case when searching: nearest neighbor falls into a different node as the query query point NN Must do backtracking! 13

Metric (ball) Trees Same as k-d trees except we use hyperspheres to partition the data 14

Example of Metric Trees Status of metric tree 15

Example of Metric Trees (cont.) Status of metric tree *Child ownership cannot overlap, but spheres can 16

Example of Metric Trees (cont.) Status of metric tree Invariant: x sphere d(center of sphere, x) < radius of sphere 17

Example of Metric Trees (cont.) query point 18

Searching with Metric Trees Guided depth first search (DFS) with pruning Descend the tree to reach the hypersphere leaf node where the query lies Assign a candidate NN, x, with distance r from the query. If DFS is about to visit a node v, but no member of v can be within distance r from the query, prune this node (do not visit it or any of its children) This is whenever vcenter. q vradius. r 19

Spill Trees Similar to Metric Trees except that the children of a single node can share data points. 20

Metric vs. Spill Let N(v) denote the set of points represented by node v Let v.lc and v.rc denote the left and right children of v In Metric Trees: In Spill Trees: N( v) = N( vlc. ) N( vrc. ) = N( vlc. ) N( vrc. ) N( v) = N( vlc. ) N( vrc. ) N( vlc. ) N( vrc. ) 21

Constructing a Spill Tree Given a node v, we choose two pivot points v.lpv N(v) and v.rpv N(v), ideally such that they are maximally separated. Specifically, vlpv. vrpv. = max p1 p2 p1, p2 N( v) 22

Constructing a Spill Tree (cont.) Project all the data points down to the vector u= vrpv. vlpv. Find the midpoint A along L denotes the d-1 dimensional plane orthogonal to, which goes through A. u L is known as the decision boundary u 23

Constructing a Spill Tree (cont.) We define two separating planes LL and LR, both parallel to and at distance τ from L LL and LR define a stripe, also known as the overlap buffer Metric Trees have empty stripes All data points to the right of LL belong in v.rc All data points to the left of LR belong in v.lc All data points in the stripe are shared by v.lc and v.rc 24

25

Spill Tree NN Search Use defeatist search, which descends the tree according the the decision boundary L at each node, without backtracking, outputting the point x in the first leaf node visited. Not guaranteed to find the correct NN Wider stripe means slower search, but more accurate 26

Drawbacks of Spill Trees The depth of Spill Trees varies considerably depending on τ (where 2 τ is the overlap buffer size) τ If =0, the Spill Tree acts as a Metric Tree If τ vrpv. vlpv. /2, then N( vlc. ) = N( vrc. ) = N( v) and construction of a Spill Tree does not even terminate, giving it a depth of To address this, we use Hybrid Spill Trees 27

Hybrid Spill Trees ρ<1 Define a balance threshold usually set to 70% For each node v, we first split the data points using the overlapping buffer If either of its children contains more than fraction of the total data points in v, we undo the overlapping splitting, instead use a conventional metric-tree partition, and mark v as a nonoverlapping node This ensures that each split reduces the number of data points of a node by a constant factor, maintaining logarithmic depth of the tree ρ 28

Hybrid Spill Tree Search Hybrid of Metric Tree DFS and defeatist search Only do defeatist search on overlapping nodes For non-overlapping nodes, we still do backtracking as Metric Tree DFS 29

The Drawbacks All of these algorithms were designed to run on a single machine In our case, our data cannot all fit on a single machine, and disk access is too slow Noise affects distance metric Curse of dimensionality Authors will address these drawbacks using a new variant of spill trees 30

Background Outline Brute-force nearest neighbor search k-d trees Metric Trees Spill Trees Hybrid Spill Trees Image preprocessing Parallel computing framework and data partition MapReduce Using MapReduce for parallel version of Hybrid Spill Trees Results 31

Image Preprocessing Normalize each image Scale the image to a fixed size of 64x64 pixels (each pixel is 3 bytes) Convert image to Haar wavelet domain All but the largest 60 magnitude coefficients are set to 0, and the remaining ones are quantized to +/- 1 So far, the feature vector is 64x64x3, which is still fairly large 32

Image Preprocessing (cont.) Random projection using random unitlength vectors is used to reduce the dimensionality to 100 dimensions 4 additional features are added: The average of each color channel The aspect ratio w/(w+h) Now the feature vectors are of dimensionality 104 33

Background Outline Brute-force nearest neighbor search k-d trees Metric Trees Spill Trees Hybrid Spill Trees Image preprocessing Parallel computing framework and data partition MapReduce Using MapReduce for parallel version of Hybrid Spill Trees Results 34

Parallel Computing Framework Main challenge: all feature vectors must be in main memory In our case, feature vector = 104 floating point numbers = 416 bytes On a machine with 4GB, we could fit 8 million images However, we are dealing with 1 billion images, so we would need at least 100 machines 35

How to Partition the Data? One option: random partition, building a separate spill tree for each partition More intelligent option: use a metric tree structure Why Metric Trees? Non-overlapping children Saves space 36

Metric Trees to Partition Data Take a random sample of all of the data, small enough to fit on one machine (1/M of the data), and build a metric tree for this data Each leaf node in this top tree defines a partition, for which a spill tree can be built on a separate machine 37

Building the Top Tree Stopping condition for the leaf nodes is an upper bound on the leaf size We need each partition to fit on a single machine, so we set the upper limit to roughly this There is also a lower bound to prevent partitions from being too small 38

Background Outline Brute-force nearest neighbor search k-d trees Metric Trees Spill Trees Hybrid Spill Trees Image preprocessing Parallel computing framework and data partition MapReduce Using MapReduce for parallel version of Hybrid Spill Trees Results 39

MapReduce Map A user-defined Map Operation is performed on each input key-value pair, generating zero or more key-value pairs. This phase works in parallel, with the input pairs being arbitrarily distributed across machines. 40

MapReduce (cont.) Shuffle Each key-value pair generated by the Map phase is distributed to a subset of machines, based on a user defined Shuffle Operation of their keys. Within each machine the key-value pairs are grouped by their keys 41

MapReduce (cont.) Reduce A user-defined Reduce Operation is applied to all key-value pairs having the same key, producing zero or more output key-value pairs. 42

Background Outline Brute-force nearest neighbor search k-d trees Metric Trees Spill Trees Hybrid Spill Trees Image preprocessing Parallel computing framework and data partition MapReduce Using MapReduce for parallel version of Hybrid Spill Trees Results 43

Generating the Sample Data Map: For each input object, output it with probability 1/M Shuffle: All objects are taken to a single machine Reduce: Copy all objects to the output 44

Building the Top Tree On one machine, build the top tree using the standard metric tree building procedure, with an upper bound U on the cardinality of the leaf nodes, as well as a lower bound L. 45

Partitioning of Data and Creation of Leaf Subtrees Map: For each object, find which leaf subtree it falls into, and output this number as the key along with the object Shuffle: Each unique key is mapped to a different machine Reduce: On all objects in each leaf subtree, apply the serial hybrid spill tree algorithm to create the leaf subtree 46

Efficient Queries of Parallel Hybrid Spill Trees On top tree, speculatively send each query object to multiple leaf subtrees when the query is close to a boundary This is a runtime version of the overlap buffer The benefit is that fewer machines are required to hold the leaf subtrees since there are no duplicates 47

Finding Neighbors in Each Leaf Subtree Map: For each input query, descend the top metric tree. At each node in the top tree, the query may be sent to both children if it falls within the pseudo-overlap buffer. Generate one key-value pair for each leaf subtree that is searched. Shuffle: Each distinct key is mapped to a different machine that holds the appropriate subtree. Reduce: Standard hybrid spill tree search is used for the objects routed to each of the subtrees, and the k- NN lists for each query are generated. 48

Combining the k-nn Lists Map: Copy each query and k-nn list pair to the output Shuffle: The queries are partitioned randomly (by their numerical value) Reduce: The k-nn lists for each query are merged, keeping only the k objects closest to the query 49

Clustering Procedure 1. Compute knn lists for each image 2. Apply a threshold to drop images that are too far apart 3. Drop singleton images from the 1.5 billion image set, leaving around 200 million images 4. The result is 200 million prototype clusters, which are further combined 5. Union-find algorithm is then applied on one machine 50

Clustering Procedure (MapReduce) Map: Input is the knn list for each image, as well as the distance to each of those images. 1. Apply a threshold to the distances, which shortens the neighbor list. 2. The list is treated as a prototype cluster, and reordered such that the lowest image number is first. 3. Generated output consists of this lowest number as the key, and the value is the whole set. 4. Images with no neighbors within the threshold are dropped. 51

Clustering Procedure (MapReduce cont.) Shuffle: The keys (image numbers) are partitioned randomly (by their numerical value) Reduce: Within a single set of results, the standard union-find algorithm is used to combine the prototype clusters 52

Background Outline Brute-force nearest neighbor search k-d trees Metric Trees Spill Trees Hybrid Spill Trees Image preprocessing Parallel computing framework and data partition MapReduce Using MapReduce for parallel version of Hybrid Spill Trees Results 53

Experiments Two main datasets used A smaller and labeled dataset that has 3385 images A larger, unlabeled dataset containing around 1.5 billion images 54

Clustering Results On the smaller set, for each pair of images, we compute the distance between their feature vectors After varying the distance threshold, we compute clusters by joining all pairs of images which are within the threshold Each image pair within each cluster is then checked against manual labeling 55

56

Clustering Results for Large Set Entire processing time for 1.5 billion images was less than 10 hours on 2000 CPUs A significant part of the time was spent just on a few machines as the sizes of the subtrees varied considerably 50 million clusters found, containing 200 million duplicated images 57

Clustering Results for Large Set (cont.) The most common cluster size is two (because there are often thumbnail and full size image pairs) Usually the clusters are accurate but Sometimes clusters contain images that are far apart 58

Visual Results 59