VK Multimedia Information Systems

Similar documents
VK Multimedia Information Systems

Geometric data structures:

Branch and Bound. Algorithms for Nearest Neighbor Search: Lecture 1. Yury Lifshits

Locality- Sensitive Hashing Random Projections for NN Search

Nearest Neighbor Search by Branch and Bound

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Fast Indexing Method. Dongliang Xu 22th.Feb.2008

VK Multimedia Information Systems

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Hot topics and Open problems in Computational Geometry. My (limited) perspective. Class lecture for CSE 546,

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Algorithms for Nearest Neighbors

Similarity Searching:

Nearest Neighbor with KD Trees

Approximate Nearest Neighbor Search. Deng Cai Zhejiang University

Locality-Sensitive Hashing

CLSH: Cluster-based Locality-Sensitive Hashing

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

Algorithms for Nearest Neighbors

Outline. Other Use of Triangle Inequality Algorithms for Nearest Neighbor Search: Lecture 2. Orchard s Algorithm. Chapter VI

Computer Graphics. Bing-Yu Chen National Taiwan University The University of Tokyo

kd-trees Idea: Each level of the tree compares against 1 dimension. Let s us have only two children at each node (instead of 2 d )

Robust Shape Retrieval Using Maximum Likelihood Theory

SIMILARITY SEARCH The Metric Space Approach. Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko

Quadrant-Based MBR-Tree Indexing Technique for Range Query Over HBase

Pivoting M-tree: A Metric Access Method for Efficient Similarity Search

Indexing by Shape of Image Databases Based on Extended Grid Files

Content-Based Image Retrieval Readings: Chapter 8:

MeshValmet 3.0. Christine Xu,

ShadowDraw Real-Time User Guidance for Freehand Drawing. Harshal Priyadarshi

Classifiers and Detection. D.A. Forsyth

Clustering Billions of Images with Large Scale Nearest Neighbor Search

CSE 6242 A / CX 4242 DVA. March 6, Dimension Reduction. Guest Lecturer: Jaegul Choo

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Engineering Efficient Similarity Search Methods

7 Approximate Nearest Neighbors

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering

Nearest Neighbor with KD Trees

An approach to Content-Based Image Retrieval based on the Lucene search engine library (Extended Abstract)

MeshValmet 2.0. Christine Xu,

Foundations of Multidimensional and Metric Data Structures

CS 340 Lec. 4: K-Nearest Neighbors

A locality-aware similar information searching scheme

Nearest neighbors. Focus on tree-based methods. Clément Jamin, GUDHI project, Inria March 2017

Large-scale visual recognition The bag-of-words representation

Introduction to Similarity Search in Multimedia Databases

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

High Dimensional Indexing by Clustering

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja (UBC), David G. Lowe (Google) IEEE 2014

Computer Graphics. Bing-Yu Chen National Taiwan University

Real-time Collaborative Filtering Recommender Systems

A Pivot-based Index Structure for Combination of Feature Vectors

Spatial Index Keyword Search in Multi- Dimensional Database

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Motion illusion, rotating snakes

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search

2. LITERATURE REVIEW

Chap4: Spatial Storage and Indexing. 4.1 Storage:Disk and Files 4.2 Spatial Indexing 4.3 Trends 4.4 Summary

Feature descriptors. Alain Pagani Prof. Didier Stricker. Computer Vision: Object and People Tracking

Searching non-text information objects

Problem 1: Complexity of Update Rules for Logistic Regression

Challenge... Ex. Maps are 2D Ex. Images are (with * height) D (assuming that each pixel is a feature)

Fast and Robust Earth Mover s Distances

Geometric Algorithms. Geometric search: overview. 1D Range Search. 1D Range Search Implementations

Computer Games 2012 Game Development

Metric Learning Applied for Automatic Large Image Classification

CSE547: Machine Learning for Big Data Spring Problem Set 1. Please read the homework submission policies.

Distribution Distance Functions

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels

Content-based Image Retrieval (CBIR)

Large Scale Nearest Neighbor Search Theories, Algorithms, and Applications. Junfeng He

Hashing with Graphs. Sanjiv Kumar (Google), and Shih Fu Chang (Columbia) June, 2011

Hard clustering. Each object is assigned to one and only one cluster. Hierarchical clustering is usually hard. Soft (fuzzy) clustering

Data-Intensive Computing with MapReduce

Predictive Indexing for Fast Search

Trademark Matching and Retrieval in Sport Video Databases

Feature descriptors and matching

Nonparametric Clustering of High Dimensional Data

Recall precision graph

VK Multimedia Information Systems

Based on Raymond J. Mooney s slides

Textural Features for Image Database Retrieval

Fast Indexing Strategies for Robust Image Hashes

Algorithms for GIS:! Quadtrees

If you wish to cite this paper, please use the following reference:

Web- Scale Mul,media: Op,mizing LSH. Malcolm Slaney Yury Li<shits Junfeng He Y! Research

CSE 6242 A / CS 4803 DVA. Feb 12, Dimension Reduction. Guest Lecturer: Jaegul Choo

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Chapter ML:XI (continued)

The Boundary Graph Supervised Learning Algorithm for Regression and Classification

Robot Motion Planning

Big Data Analytics. Special Topics for Computer Science CSE CSE April 14

Segmentation and Grouping

Spatial Interpolation & Geostatistics

An Introduction to Content Based Image Retrieval

Similarity Multidimensional Indexing

Transcription:

VK Multimedia Information Systems Mathias Lux, mlux@itec.uni-klu.ac.at Dienstags, 16.oo Uhr c.t., E.1.42 This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0

Indexing Spatial Indexes MDS FastMap Hashing Metric Indexes Inverted Lists

Indexing Visual Information Text is indexed in inverted lists Search time depends on # of terms Visual information expressed by vectors Combined with a metric capturing the semantics of similarity Inverted list does not work here An index of vectors is needed

Indexing Visual Information Vectors describe points in a space Space is n-dimensional n might be rather big Distance (metric) between points E.g. L1 or L2 Query is also a vector := point Searching for points (vectors) near to query Idea for index: Index neighborhood

Spatial Indexes Using equally sized rectangles (Optimal for L1 )

Spatial Indexes Using overlapping rectangles

Spatial Indexes Common data structures R Tree R*, R+,. Overlapping rectangles Search is a rectangle Quadtree (Octtree) Equally sized regions, subdivided 4 quadrants or 8 octants Search selects quadrants

R-Tree

Quadtree

Spatial Indexes: Drawbacks Data structures must minimize false negatives (-> maximizes recall) false positives (-> search time) Features, distance function & parameters need to be selected at index time Search combining multiple descriptors is complicated issue Works best for spaces with small dimension n MDS has to be applied

Indexing Spatial Indexes MDS FastMap Hashing Metric Indexes Inverted Lists

Multidimensional Scaling ( MDS ) Reducing the dimensions of a feature space E.g. From 64 dimensions to 8 Without loosing too much information about neighborhoods Applications in multimedia retrieval Indexing based on coordinates Spatial Indexes: Data structures to find nearest neighbors fast

Multidimensional Scaling ( MDS ) Interpolation: FastMap Linear in terms of objects Used e.g. in IBM QBIC Iterative: Force Directed Placement Iterative optimization of initial placement Cubic runtime

FastMap For Each dimension d Find Pivots (the most distant objects) For each object, which is not a pivot Interpolate position between pivots in this dimension Next object Next Pivot

FastMap x-position of p3 Pivot 1 d(pivot1, p3) d(pivot2, p3) Pivot 2

FastMap y-position of p3

FastMap: Pivots

How to find optimal pivots? Select one object randomly -> P 1 Select Object P 2 with maximum distance from P 1 to P 2 If d(p 1, P 2 ) < t Set P 1 = P 2 Goto (2) Normally no threshold is used but this is done x times.

Force Directed Placement 1. All objects are assigned coordinates 2. For each object o Movement vector v = 0 For each object p Calculate repulsion & attraction forces between o & p Compute movement vector v(o, p) depending on the forces v = v + v(o, p) 3. If overall movement is still high goto 2.

FDP: Parameters Gravity as overall attraction Prevents uncontrolled spread Overall repulsion Prevents coming objects from coming too close Minimum distance If objects are on the coordinates Spring parameters Repulsion stronger close up Attraction stronger if far away

FDP

Demo Emir

Indexing Spatial Indexes MDS FastMap Hashing Metric Indexes Inverted Lists

Locality Sensitive Hashing (LSH) Algorithm to determine the Approximate Near(est) Neighbor Given: a set P of points in R d Nearest Neighbor: query q returns point p P minimizing p-q r-near Neighbor: query q returns point p P so that p-q r src. http://people.csail.mit.edu/indyk/mmds.pdf

LSH - Idea Construct hash functions g:r d U so that If p-q r then Pr[g(p)=g(q)] is not so small If p-q >cr then Pr[g(p)=g(q)] is small

LSH - Process A family H of hash functions h: R d U is called (P1, P2, r, cr)-sensitive if If p-q r then Pr[h(p)=h(q)] > P1 If p-q >cr then Pr[h(p)=h(q)] <P2 LSH uses functions g(p)=<h 1 (p), h k (p)> Preprocessing Select functions g 1 g L Hash all p P to buckets g 1 (p) g L (p) Query Retrieve points from buckets g 1 (q) g L (q)

LSH LSH solves c-approximate NN with: Number of hash functions: L=n ρ, ρ=log(1/p1)/log(1/p2) E.g., for the Hamming distance we have ρ=1/c Constant success probability per query q

LSH - Random Projection Approximates cosine distance One bit per hash function For two feature vectors number of matching bits is proportional to the probability of matching

LSH Random Projection Hash function h(v) = signum(v*r) whereas r is a random hyperplane h(v) is in {-1, 1} so h(v) produces 1 bit. k hash functions produce a k bit hash value

LSH Random Projection Once: generate k hyperplanes (uniform distribution) with d dimensions each At index time compute hash for each feature At search time filter index based on hashes at least m bits must match

LSH - Stable Distribution Approximate other metrics Entries of a are drawn from stable distribution. b is real number drawn from [0, r) with uniform distribution. Gaussian for a -> L2

LSH Stable Distribution Employ Box-Muller transformation U1, U2 drawn from [0,1] with uniform distribution Z0, Z1 are independent random variables with standard normal distribution.

LSH Stable Distribution private static double drawnumber() { return Math.sqrt(-2 * Math.log(Math.random())) * Math.cos(2 * Math.PI * Math.random()); } public static void generatehashfunctions() throws IOException { ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(name)); oos.writeint(dimensions); oos.writeint(numfunctionbundles); for (int c = 0; c < numfunctionbundles; c++) { oos.writedouble(math.random()* binlength); } for (int c = 0; c < numfunctionbundles; c++) { for (int j = 0; j < dimensions; j++) oos.writedouble(drawnumber()); } oos.close(); }

LSH - Stable Distribution Generate hashes from histogram public static int[] generatehashes(double[] histogram) { } double product = 0d; int[] result = new int[numfunctionbundles]; for (int k = 0; k < numfunctionbundles; k++) { for (int i = 0; i < histogram.length; i++) { } product += histogram[i] * hasha[k][i]; result[k] = (int) Math.floor((product + hashb[k])/binlength); } return result;

LSH - Stable Distribution For histograms of dimension d if m of k hash functions match m*k has to be greater than d Bin length r defines the length of hashes Elements of a & v are bounded Therefore hash is bounded

LSH - Stable Distribution CEDD with r=1 On Wang Simplicity: min = -203, max = 167 CEDD with r=0.1 On Wang Simplicity: min = -1746, max = 2130 Smaller r leads to a lot more bins Necessary for trade-off time vs. space

Indexing Spatial Indexes MDS FastMap Locality Sensitive Hashing Metric Indexes Inverted Lists

Metric Indexes A metric index is a tree of nodes Each node containing a fixed maximum number of entries Each entry is constituted by a routing entry D src. Berretti, S.; del Bimbo, A. & Vicario, E. Efficient Matching and Indexing of Graph Models in Content-Based Retrieval IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001, 23, 1089-1105

Metric Indexes D is the root of a sub index in the covering region of D Also defines a radius r D being the maximum distance from D to any entry in the covering region.

Metric Index: Construction Top-down: Indexing an entire archive at once All documents to index are known No iterative additions Bottom-up: Indexing on insertion Documents are indexed as they are added to the collection Optimizations (e.g. splitting) have to be done

Metric Index: Searching

Indexing Spatial Indexes MDS FastMap Locality Sensitive Hashing Metric Indexes Inverted Lists

Metric Spaces M = (D,d) Data domain D Total (distance) function d: D D R (metric function or metric) The metric space postulates: Non negativity Symmetry Identity Triangle inequality ), ( ), ( ), (,,, 0 ), (,, ), ( ), (,, 0 ), (,, z y d y x d z x d z y x y x d y x y x x y d y x d y x y x d y x D D D D src. G. Amato & P. Savino, Approximate Similarity Search in Metric Spaces Using Inverted Files, Infoscale 2008

Similarity Search in Metric Spaces Objects close to one another see the space in a similar way Choose a set of reference objects RO Orderings of RO according to the distances from two similar data objects are similar as well Represent every data object o as an ordering of RO from o Measure similarity between two data objects by measuring the similarity between the corresponding orderings

Similarity Search in Metric Spaces O1 := <5, 3, 4, 1, 2> O2 := <1, 5, 3, 4, 2> O3 :=

Similarity Search in Metric Spaces Spearman Footrule Distance SFD ( ro) ( S1, S2) S2( ro) S1 ro RO

Similarity Search in Metric Spaces [Slides G. Amato ]

Thanks