Similarity Estimation Techniques from Rounding Algorithms. Moses Charikar Princeton University

Size: px

Start display at page:

Download "Similarity Estimation Techniques from Rounding Algorithms. Moses Charikar Princeton University"

Hector Baldwin
6 years ago
Views:

1 Similarity Estimation Techniques from Rounding Algorithms Moses Charikar Princeton University 1

2 Compact sketches for estimating similarity Collection of objects, e.g. mathematical representation of documents, images. Implicit similarity/distance function. Want to estimate similarity without looking at entire objects. Compute compact sketches of objects so that similarity/distance can be estimated from them. 2

3 Similarity Preserving Hashing Similarity function sim(x,y) Family of hash functions F with probability distribution such that Pr [ hx ( ) = hy ( )] = simxy (, ) h F 3

4 Applications Compact representation scheme for estimating similarity x ( h( x), h ( x),, h ( x)) 1 2 y ( h ( y), h ( y),, h ( y)) 1 2 Approximate nearest neighbor search [Indyk,Motwani] [Kushilevitz,Ostrovsky,Rabani]] k k 4

5 Estimating Set Similarity [Broder,Manasse,Glassman,Zweig] [Broder,C,Frieze,Mitzenmacher] Collection of subsets S 1 S 2 S S 1 2 similarity = S1 S2 5

6 Minwise Independent Permutations S 1 min( σ( S )) σ 1 S 1 S 2 S 2 σ min( σ( S )) 2 prob(min( σ( S ) = min( σ( S )) = 1 2 S1 S2 S S 1 2 6

7 Streaming algorithms Related Work Compute f(data) in one pass using small space. Implicitly construct sketch of data seen so far. Synopsis data structures [Gibbons,Matias Matias] Compact distance oracles, distance labels. Hash functions with similar properties: [Linial,Sassoon] [Indyk,Motwani,Raghavan,Vempala] [Feige, Krauthgamer] 7

8 Results Necessary conditions for existence of similarity preserving hashing (SPH). SPH schemes from rounding algorithms Hash function for vectors based on random hyperplane rounding. Hash function for estimating Earth Mover Distance based on rounding schemes for classification with pairwise relationships. 8

9 Existence of SPH schemes sim(x,y) admits an SPH scheme if family of hash functions F such that Pr [ hx ( ) = hy ( )] = simxy (, ) h F 9

10 Theorem: : If sim(x,y) admits an SPH scheme then 1-sim(x,y) satisfies triangle inequality. Proof: 1 sim( x, y) = Pr ( h( x) h( y)) h h F ( x, y): indicator variable for hx ( ) hy ( ) ( xy, ) + ( yz, ) ( xz, ) h h h 1 sim( x, y) = E [ ( x, y)] h F h 10

11 Stronger Condition Theorem: : If sim(x,y) admits an SPH scheme then (1+sim sim(x,y) )/2 has an SPH scheme with hash functions mapping objects to {0,1}. Theorem: : If sim(x,y) admits an SPH scheme then 1-sim(x,y) is isometrically embeddable in the Hamming cube. 11

12 Random Hyperplane Rounding based SPH Collection of vectors sim( u, v) = 1 ( uv, ) π Pick random hyperplane through origin (normal r ) 1 if r u 0 h r ( u) = 0 if r u< 0 [Goemans,Williamson] 12

13 Earth Mover Distance (EMD) P Q EMD(P,Q) 13

14 Earth Mover Distance Set of points L={l 1,l 2, l n } Distance function d(i,j) (assume metric) Distribution P(L) : non-negative negative weights (p 1,p 2, p n ). Earth Mover Distance (EMD( EMD): distance between distributions P and Q. Proposed as metric in graphics and vision for distance between images. [Rubner,Tomasi,Guibas] 14

15 min f di (, j) i, j j i i, j i f = i, j i, j i, j f 0 i, j p j f = q i j 15

16 Relaxation of SPH Estimate distance measure, not similarity measure in [0,1]. Allow hash functions to map objects to points in metric space and measure E[d(h(P),h(Q) d(h(p),h(q)]. (SPH: d(x,y) = 1 if x y) Estimator will approximate EMD. 16

17 Classification with pairwise relationships [Kleinberg,Tardos] Assignment cost separation cost w e 17

18 Classification with pairwise relationships Collection of objects V Labels L={l 1,l 2, l n } Assignment of labels h : V LV Cost of assigning label to u : c(u,h(u)) Graph of related objects; for edge e=(u,v), cost paid: w e.d(h(u),h(v)) Find assignment of labels to minimize cost. 18

19 LP Relaxation and Rounding [Kleinberg,Tardos] [Chekuri,Khanna,Naor,Zosin] P Q Separation cost measured by EMD(P,Q) Rounding algorithm guarantees Pr[h(P)= h(p)=l ] i = p i E[d(h(P),h(Q) d(h(p),h(q)] O(log n log log n) EMD(P,Q) 19

20 Rounding details Probabilistically approximate metric on L by tree metric (HST) Expected distortion O(log n log log n) EMD on tree metric has nice form: T: subtree P(T): sum of probabilities for leaves in T l T : length of edge leading up from T EMD(P,Q) = l T P(T)-Q(T) 20

21 Theorem: : The rounding scheme gives a hashing scheme such that EMD(P,Q) E[d(h(P),h(Q)] O(log n log log n) EMD(P,Q) Proof:, y : Probability that h( P) = l, h( Q) = l y i j i j i, j give feasible solution to LP for EMD Cost of this solution = E[ dhp ( ( ), hq ( )] Hence EMD( P, Q) E[ d( h( P), h( Q)] 21

22 SPH for weighted sets Weighted Set: (p 1,p 2, p n ), weights in [0,1] Kleinberg-Tardos rounding scheme for uniform metric can be thought of as a hashing scheme for weighted sets with sim( P, Q) = min( p, q ) max( p, q ) i i i i Generalization of minwise independent permutations 22

23 Conclusions and Future Work Interesting connection between rounding procedures for approximation algorithms and hash functions for estimating similarity. Better estimators for Earth Mover Distance Ignored variance of estimators: related to dimensionality reduction in L 1 Study compact representation schemes in general 23

Compact Data Representations and their Applications. Moses Charikar Princeton University

Compact Data Representations and their Applications Moses Charikar Princeton University Lots and lots of data AT&T Information about who calls whom What information can be got from this data? Network router