Fundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le

Fudametals of Media Processig Shi'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dih Le

Today's topics Noparametric Methods Parze Widow k-nearest Neighbor Estimatio Clusterig Techiques k-meas Agglomerative Hierarchical Clusterig

Bayesia decisio theory A posteriori probability (posterior): the probability of the state of ature give that the feature value has bee observed e.g., P(ω x) Likelihood: the likelihood of the state of ature with respect to the feature value e.g., p(x ω) Bayes formula P(ω x)=p(x ω)p(ω)/p(x)

Bayesia decisio theory

Normal distributio

Covariace matrix ad its algebraic/geometric iterpretatio What is the quadratic form? x1 x2 x3 μ φ y1 y2 y3

Classificatio Usig PCA e T D? y 1 x 2 y 2 y y i T R? Σ U u Y U E{ XX 1 2 1 T u 2 T }, u m Σu u m X, e X U Y i i i x 1 Detectio of faces based o distace from face space Recogitio of faces based o distace withi face space J. M. Rehg 2002

Noparametric Methods So far we studied "parametric" methods. Probability distributio fuctios (or equivaletly decisio boudaries) ca be represeted by parametric forms. Normal desity case: mea ad variace (or covariace matrix) PCA case: low-dimesioal subspace ad its spa These methods assume that the uderlyig probability distributio of the actual observatios is kow ad yields parametric forms. However, i may cases this assumptio is suspect.

Noparametric Methods Simple approach is to compose histogram Kowig sample data, we ca compose histogram with certai bi size (divisio of each axis) Treat the histogram as probability distributio fuctio

Noparametric Methods The optimal umber of bis M (or bi size) is the issue. If bi width is small (i.e., big M), the the estimated desity is very spiky (i.e., oisy). If bi width is large (i.e., small M), the the true structure of the desity is smoothed out. I practice, we eed to fid a optimal value for M that compromises betwee these two issues. Also, how we exted to the multidimesioal case?

Noparametric Desity Estimatio The probability that a give vector x, draw from the ukow desity p(x), will fall iside some regio R i the iput space is give by: P p( x') dx' R If we have data poits {x 1, x 2,..., x } draw idepedetly from p(x), the probability that k of them will fall i R is give by the biomial law: P ( k ) Pk P (1 P ) k k k

Noparametric Desity Estimatio The expected value of k is: E[ k] P The expected percetage of poits fallig i R is: E[ k/ ] The variace is give by: P 2 P(1 P) Var[ k / ] E[( k / P) ]

Noparametric Desity Estimatio The distributio is sharply peaked as, thus: P k/ Approximatio 1

Noparametric Desity Estimatio If we assume that p(x) is cotiuous ad does ot vary sigificatly over the regio R, we ca approximate P by: P p( x') dx' p( x) V Approximatio 2 R where V is the volume eclosed by R.

Noparametric Desity Estimatio Combiig these two approximatios we have: p( x) k/ V The above approximatio is based o cotradictory assumptios: R is relatively large (i.e., it cotais may samples so that P k is sharply peaked) Approximatio 1 R is relatively small so that p(x) is approximately costat iside the itegratio regio Approximatio 2 We eed to choose a optimum R i practice...

Noparametric Desity Estimatio Suppose we form regios R 1, R 2,... cotaiig x. R 1 cotais k 1 sample, R 2 cotais k 2 samples, etc. R i has volume V i ad cotais k i samples. The -th estimate p (x) of p(x) is give by: p k / ( x) V

Noparametric Desity Estimatio The followig coditios must be satisfied i order for p (x) to coverge to p(x): limv 0 lim k lim k / 0 Approximatio 2 Approximatio 1 to allow p (x) to coverge

Noparametric Desity Estimatio How to choose the optimum values for V ad k? k / p ( x) Two leadig approaches: V (1) Fix the volume V ad determie k from the data (kerel-based desity estimatio methods), e.g., V 1/ (2) Fix the value of k ad determie the correspodig volume V from the data (k-earest eighbor method), e.g., k

Noparametric Desity Estimatio

Parze Widows Problem: Give a vector x, estimate p(x) Assume R to be a hypercube with sides of legth h, cetered o the poit x: d k / V h p ( ) x V To fid a expressio for k (i.e., # poits i the hypercube) let us defie a kerel fuctio: 1 1 u j j 1,..., d ( u) 2 0 otherwise

Parze Widows The total umber of poits x i fallig iside the hypercube is: i k x x i1 h cetered at x The, the estimate becomes p p k / ( x) V 1 1 x x i ( x) i1 V h equals 1 if x i falls withi hypercube Parze widows estimate

Parze Widows The desity estimate is a superpositio of kerel fuctios ad the samples x i. 1 1 x x i p( x) i1 V h ( u) iterpolates the desity betwee samples. Each sample x i cotributes to the estimate based o its distace from x.

Parze Widows The kerel fuctio ( u) ca have a more geeral form (i.e., ot just hypercube). I order for p (x) to be a legitimate estimate, must be a valid desity itself: ( u) 0 ( u) du1

Parze Widows The parameter h acts as a smoothig parameter that eeds to be optimized. Whe h is too large, the estimated desity is over-smoothed (i.e., superpositio of broad kerel fuctios). Whe h is too small, the estimate represets the properties of the data rather tha the true desity (i.e., superpositio of arrow kerel fuctios)

Parze Widows ( u) assumig differet h values:

Parze Widows Example: p (x) estimates assumig 5 samples:

Parze Widows Example: both p(x) ad ( u) are Gaussia h h 1 / p (x)

Parze Widows Example: p(x) cosists of a uiform ad triagular desity ad ( u) is Gaussia h h 1 / p (x)

k-nearest Neighbor Estimate Fix k ad allow V to vary: Cosider a hypersphere aroud x. Allow the radius of the hypersphere to grow util it cotais k data poits. V is determied by the volume of the hypersphere. p k / ( x) V size depeds o desity

k-nearest Neighbor Estimate The parameter k acts as a smoothig parameter ad eeds to be optimized.

k-nearest Neighbor Estimate Parze widows k -earest-eighbor k k 1

k-nearest Neighbor Classifier Suppose that we have c classes ad that class ω i cotais i poits with 1 + 2 +...+ c = P( / x) i Give a poit x, we fid the k earest eighbors Suppose that k i poits from k belog to class ω i, the: p p ( x / ) P( ) i i p ( x) ki ( x / i) V i

k-nearest Neighbor Classifier

k-nearest Neighbor Classifier The prior probabilities ca be computed as: i P( i ) Usig the Bayes rule, the posterior probabilities ca be computed as follows: where p ( x / ) P( ) k P( i / x) p ( x) k p k ( x) V i i i

k-nearest Neighbor Classifier k-earest-eighbor classificatio rule: Give a data poit x, fid a hypersphere aroud it that cotais k poits ad assig x to the class havig the largest umber of represetatives iside the hypersphere. p( x / i) P( i) ki P( i / x) p( x) k Whe k=1, we get the earest-eighbor rule.

k-nearest Neighbor Classifier

k-nearest Neighbor Classifier The decisio boudary is piece-wise liear. Each lie segmet correspods to the perpedicular bisector of two poits belogig to differet classes.

k-nearest Neighbor Classifier Let P* be the miimum possible error, which is give by the miimum error rate classifier. Let P be the error give by the earest eighbor rule. Give ulimited umber of traiig data, it ca be show that: c P P P (2 P ) 2P c 1 * * * *

k-nearest Neighbor Classifier

Clusterig So far we assumed that the class labels are give for traiig samples. Sometimes it's very costly to provide class labels. What ca we do if we do't kow class labels? Usupervised methods, or smart preprocessig methods Clusterig discovers distict subclasses observed i the data distributio.

Clusterig

Algorithm k-meas 1. Determie the umber of clusters: k 2. (Radomly) guess k cluster ceter locatios 3. Each data poit fids out which ceter it's closest to 4. Each ceter fids the cetroid of the poits it ows 5. Termiate if assigmet of N data poits does ot chage 6. Repeat from 3 otherwise

K-meas Clusterig: Step 1 Algorithm: k-meas, Distace Metric: Euclidea Distace 5 4 3 k 1 2 k 2 1 0 k 3 0 1 2 3 4 5

K-meas Clusterig: Step 2 Algorithm: k-meas, Distace Metric: Euclidea Distace 5 4 3 k 1 2 k 2 1 0 k 3 0 1 2 3 4 5

K-meas Clusterig: Step 3 Algorithm: k-meas, Distace Metric: Euclidea Distace 5 4 k 1 3 2 1 k 2 k 3 0 0 1 2 3 4 5

K-meas Clusterig: Step 4 Algorithm: k-meas, Distace Metric: Euclidea Distace 5 4 k 1 3 2 1 k 2 k 3 0 0 1 2 3 4 5

K-meas Clusterig: Step 5 Algorithm: k-meas, Distace Metric: Euclidea Distace 5 expressio i coditio 2 4 3 2 1 0 k 2 k 3 0 1 2 3 4 5 expressio i coditio 1 k 1

Hierarchical Clusterig Algorithm (Agglomerative Hierarchical Clusterig) 1. iitialize c: desired umber of clusters, c 1 =, D i ={x i } for i=1,..., 2. c 1 =c 1-1 3. fid earest clusters, say, D i ad D j 4. merge D i ad D j 5. repeat from 2 util c=c 1 6. retur c clusters

Hierarchical Clusterig Dedrogram

The Nearest-Neighbor Algorithm The Nearest-Neighbor Algorithm If miimum distace betwee elemets of two clusters is used, the method is called the earesteighbor cluster algorithm. If it is termiated whe the distace betwee earest clusters exceeds a arbitrary threshold, it is called the sigle-likage algorithm.

The Nearest-Neighbor Algorithm

The Nearest-Neighbor Algorithm The Farthest-Neighbor Algorithm If maximum distace betwee elemets of two clusters is used, the method is called the farthesteighbor cluster algorithm. If it is termiated whe the distace betwee earest clusters exceeds a arbitrary threshold, it is called the complete-likage algorithm.

The Nearest-Neighbor Algorithm