Distribution Distance Functions

COMP 875 November 10, 2009 Matthew O Meara

Question How similar are these?

Outline Motivation Protein Score Function Object Retrieval Kernel Machines 1 Motivation Protein Score Function Object Retrieval Kernel Machines 2 3

Protein Score Function Object Retrieval Kernel Machines Parametrization of H-bond geometry Parametrization of H-bond geometry H-bonds have 4 degrees of freedom H-bonds in Ubiquitin protein. H-bond geometry.

Protein Score Function Object Retrieval Kernel Machines Example H-bond Classification Decision Do H-bonds have different geometry in sheets and helices? Each point corresponds to a hydrogen bond: (Left) H-bonds in beta sheets. (Right) H-bonds in helices. AHDist is the bond length and cosbah is the cosine of the angle at the acceptor.

Outline Motivation Protein Score Function Object Retrieval Kernel Machines 1 Motivation Protein Score Function Object Retrieval Kernel Machines 2 3

Protein Score Function Object Retrieval Kernel Machines Image Retrieval Requires similarity measure Problem: Given new image return similar images from a database Browse through an image library (Rubner2000)

Protein Score Function Object Retrieval Kernel Machines Represent images as distributions Image features have relationships Inherent qualities: eg, color, texture, edges Spatial qualities: eg, where in the picture Visualizing shape context. (Grauman and Darrell 2004)

Protein Score Function Object Retrieval Kernel Machines Retrieve similar songs to a given song from a database Music Descriptors Mel-frequency cepstral coefficients (Spectrum of the spectrum scaled for humans) Pandora music genome descriptors etc. phoneme for security group Comparing music similarity (Typke2003) Pandora.com internet radio builds play lists from example songs

Outline Motivation Protein Score Function Object Retrieval Kernel Machines 1 Motivation Protein Score Function Object Retrieval Kernel Machines 2 3

ML Algorithms Require Kernels Protein Score Function Object Retrieval Kernel Machines Many machine Learning algorithms use kernels Unsupervised Learning: clustering nearest neighbor etc... Supervised Learning: support vector machines classification etc...

Bags of Features: Example Images can be reprented as histograms over texture features Extract features Learn visual vocabulary quantize features using visual vocabulary Julesz, 1981; Cula & Dana, 2001; Leung & Malik 2001; Mori, Belongie & Malik, 2001; Schmid 2001; Varma & Zisserman, 2002, 2003; Lazebnik, Schmid & Ponce, 2003, (slide from Lazebnik2009)

Outline Motivation 1 Motivation Protein Score Function Object Retrieval Kernel Machines 2 3

compare each feature separately Examples: χ 2 goodness of fit test Kullback Leibler divergence

χ 2 test for goodness of fit χ 2 test for goodness of fit Test null hypothesis that is no significant deviation from expected results. Let O and E be observed and expected distributions with n 1 degrees of freedom. n Let χ 2 (O i E i ) 2 =. E i i=1 Compare with χ 2 distribution to get goodness of fit Can be made symmetric

Kullback-Leibler Divergence Kullback-Leibler Divergence Let P and Q be distributions Support of P has to be subset of support of Q KL-divergence is the expected number of bit needed to encode Q given P. D KL (P Q) = P i log P i. Q i i It can be made symmetric.

Bin-Bin Comparison usefulness The good: Simple concepts Fast to evaluate O(n) Good at assessing equivlence lots of variants, well studied The bad: Sensitive to variance of signal All far things are very far

Bin-Bin vs Cross-Bin Metrics (upper)bin-bin comparison wrongly classifies left as begin less similar than right. (lower) Cross-Bin comparison correctly classifies left as being more similar than right (Rubner1997)

Quantization Error Quantization error is error due to splitting data into categories that is not seperable Example: Creating a cluster codebook with non separable data

Quantization Error Example Lighting and deformation exacerbate quantization error: Image pairs with deformation and lighting changes (Ling2007)

Outline Motivation 1 Motivation Protein Score Function Object Retrieval Kernel Machines 2 3

Earth Mover Distance Example Minimal flow from white signature to black signature (Rubner 2000)

Earth Mover Distance Definition Definition (Rubner 1997): Earth Mover Distance Represent each object as a signature S with centers {m i } and weights {w i }. The Earth Mover Distanc has the form EMD(S 1, S 2 ) = 1 fij fij d(m i, m j ) where the ground distances d(m i, m j ) are given and the flows f ij are solved for by an optomization problem.

Earth Mover Distance Flow Constraints EMD(S 1, S 2 ) = 1 fij fij d(m i, m j ) All flow must be positive, f ij 0. The flow to or from each center is at most the weight there, f ij w j f ij w i. i j The total flow is the weight of the lighter of S 1 and S 2, fij = min w i, w j. i j

EMD as a Max Flow problem EMD can be represented as a graph max flow problem (Ling2007)

EMD History Motivation Rediscovered many times (1942) Kantorovich proposes mass transport problem (1972) Mallows: Given two marginal distributions, find particular minimal joint distribution. (Levina, Bickel showed equivelence in 2001) (1996) Rubner et. al. propose EMD for image retrieval

EMD usefulness Motivation The good: Simple concept Results often seem natural Robust to noise The Bad: LP based solution has n 3 log(n) running time. (Orlin 1988)

Outline Motivation 1 Motivation Protein Score Function Object Retrieval Kernel Machines 2 3

Measure diffusion between histograms h 1 (x) and h 2 (x): The diffusion equation for the temperature field T is T t = 2 T x 2. Set the boundary conditions to be T (x, 0) = h 1 (x) h 2 (x) T (x, ) = 0. The solution is convolution with gaussian filters T (x, t) = T (x, 0) φ(x, t). Define diffusion distance between h 1 (x) and h 2 (x) to be K (h 1, h 2 ) = 0 T (x, t) dt.

Examples The columns are diffusion processes for difference histograms. Notice the left column diffuses faster. (Ling2006)

Outline Motivation 1 Motivation Protein Score Function Object Retrieval Kernel Machines 2 3

EMD Approximation by Embedding (Indyk and Thaper 2003) Ideas: Quantize with multilevel histogram Randomize bin offsets P(# bin p and q are in) d(p, q) Nearest neighbor via Locality Sensitive hashing

Example multilevel histogram 0 To compare point sets P and Q with minimum separation 1: Compute randomly offset multilevel distance histogram 0 4 0 4 K (P, Q) = i,l 2 l v il (P) v il (Q) 1 1 0 1 0 2 2 1 2 2 0 0 where v il gives the number of points in bin i of level l. 2 2 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 0 0 1 0 Multilevel histogram for orange and blue point sets. The numbers in

How LSH for Nearest Neighbor Search Preparation: Define randomly offset multilevel histogram V = {v il } Pick LSH parameters, range r R, and k random lines defined by a 1,..., a k V, and b 1,..., b k [0, r]. Preprocessing for each image in database: Compute coordinates x = {x il } Compute hash values h aj,b j For each test image: compute hashes = a j x+b j r look up images with same k hash values

Outline Motivation 1 Motivation Protein Score Function Object Retrieval Kernel Machines 2 3

Wavelet Approach Motivation EMD via Embedding looks like haar wavelet encoding. What about other wavelet basis? (Left)First 3 Haar Wavelets (www.wavelet.org), (Right)Daubechies4 father wavelet (www.wikimediacommons.org)

Approximation Using the Wavelet Domain Shirdhonkar and Jacobs achieve O(n) approximation using the wavelet domain. Duel Kantorovich-Rubinstein transhipment problem has wavelet domain representation with explicit solution, d(p) emd = λ 2 j(1+n/2) p λ Where p is difference histogram, p λ are its wavelet coefficients with shifts λ at scale j.

Comparison of Linear Approximations Shirdhonkar and Jacobs compared with their Wavelet EMD to EMD. Comparison over 100 random 16 16 histograms.

Comparison of Linear Approximations Shirdhonkar and Jacobs compared with their Wavelet EMD to EMD. Comparison over SIMPLIcity database: 10 image classes with 100 images each Each image is 16 16 16 in LAB color space with Euclidean ground distance Method Bounds Normalized Preproc. Compare ratio RMS Error time(s) time(ms) EMD 0.92 63 Wavelet EMD 7.03 18% 2.35 0.11 Indyk-Thaper 11.00 43% 0.51 22

Thanks Motivation Thanks! Questions?