These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning Algorithms (IFT6266 A7) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop 1

A note: Methods We (perhaps unwisely) skipped Bishop 2.1--2.4 until just before our graphical models lectures Give it a look.... 2 2

Nonparametric Methods Parametric methods: fit a small number of parameters to a data set. Must make assumptions about the underlying distribution of data. If those assumptions are wrong, model is flawed. Eg. trying to fit Gaussian model to multimodal data. Nonparametric methods: fit a large number of parameters to a data set. Number of parameters scales with number of data points. In general one parameter per data point Main advantage is that we need make only very weak assumptions about the underlying distribution of data. 3 3

Histograms. 1. 1. 1 p i = n i N i p(x) dx = 1 4 4

Histograms Discontinuities at bin boundaries Curse of dimensionality: if we divide each variable in a D- dimensional space into M bins, we get M D bins Locality is important (density defined via evaluation of points in a local neighborhood) Smoothing achieved by bin count; we want neither too much nor too little smoothing Relates to model complexity and regularization in parametric modeling

Kernel density estimators Assume Euclidean distance, observe from unknown density p(x) Consider small region R containing x. Mass for region is: P = R p(x) dx Distribution of points is binomial: Bin(K N, P ) = N! K!(N K)! P K (1 P ) N K From chp 2.1 on binary variables: p(x = 1 µ) = µ Bin(m N, µ) = [m] var[m] N m= ( N m ) ( ) N µ m (1 µ) N m m mbin(m N, µ) = Nµ N (m [m]) 2 Bin(m N, µ) = Nµ(1 µ) m= 6 N! (N m)!m! 6

Kernel density estimators Thus we see that mean fraction of points falling into region is E[K/N]=P Variance is var[k/n] = P(1-P) / N For large N, distribution will be sharply peaked around the mean so K NP If region R is sufficiently smally that the density p(x) is constant then we have P p(x)v where V is the volume of R Combining we get p(x) = K / NV Depends on contradictory assumptions (that R is small enough to have constant density over region yet sufficiently large such that the number K of points falling into the region generates a sharply peaked distribution If we fix K and determine V we get K-nearest-neighbors If we fix V and determine K we get kernal density estimator 7 7

Kernel density estimators Wish to determine density around x in region R centered on x. We will count points: k(u) = 1, ui. i = 1,..., D, otherwise k(u) is a kernel function, here called a Parzen window Total number of data points inside of cube of side h defined by k(u) : K = N n=1 ( ) x xn k h Substitution into p(x) = K / NV (where volume of hypercube of side h in D dimeinsions is V = 1/h d ) P (X) = 1 N N n=1 ( ) 1 x h D k xn h 8 8

Parzen estimator Model suffers from discontinuities at hypercube boundaries Substitute Gaussian (where h represents standard deviation of Gaussian components): P (X) = 1 N N n=1 { 1 exp x x n 2 } (2πh 2 ) 1/2 2h 2 As expected, h acts to smooth Tradeoff between noise sensitivity and oversmoothing Any kernel can be used provided: k(u) m, integral k(u)du = 1 No computation for training phase But must store entire data set to evaluate. 1. 1. 1 9 9

Nearest neighbor methods One weakness of kernel density estimation is that kernel width h is fixed for all lerneks. Fix K and determine V Consider small sphere centered at x, allow sphere to grow until it contains precisely K points. Estimate of density p(x) given by same formula: P(X) = K / NV K governs degree of smoothing. Compare Parzens (left) to KNN (right). 1. 1. 1. 1. 1 1. 1 1

Classification using KNN Can do classification by applying KNN to each class and applying Bayes: To classify new point x, draw a sphere containning precisely K points. Suppose sphere contains Kk points from class Ck. Then model p(x) = K/NV provides a density estimate: p(x C k ) = We also obtain unconditional density: p(x) = K k N k V K NV With class priors: p(c k ) = N k N Applying Bayes theorem yields: p(c k x) = p(x C k)p(c k ) p(x) 11 = K k K 11

Classification using KNN To minimize misclassification rate, always choose class with largest Kk/K. For K=1, yields decision boundary composed of hyperplanes that form perpendicular bisectors of pairs from different classes x 2 x 2 (a) x 1 (b) x 1 Left: K=3, Right: K=1 12 12

KNN Example K acts as regularizer. Tree-base search can be used to find approximate near neighbors. 2 2 2 1 1 1 1 2 1 2 1 2 Oil dataset. Left K=1, Middle K=3, Right K=31 13 13