Machine Learning Lecture 3

Size: px

Start display at page:

Download "Machine Learning Lecture 3"

Paul Elliott
5 years ago
Views:

Course Outline Machine Learning Lecture 3 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Probability Density Estimation II 26.04.

1 Course Outline Machine Learning Lecture 3 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Probability Density Estimation II Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Bastian Leibe RWTH Aachen Generative Models (4 weeks) Bayesian Networks Markov Random Fields leibe@vision.rwth-aachen.de Many slides adapted from B. Schiele 2 Topics of This Lecture Recap: Gaussian (or Normal) Distribution Recap: Parametric Methods Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff Mixture distributions Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt 3 One-dimensional case Mean ¹ Variance ¾ 2 N(xj¹; ¾ 2 ) = p ¾ (x ¹)2 exp ½ 2¼¾ 2¾ 2 Multi-dimensional case Mean ¹ Covariance N(xj¹; ) = ½ exp ¾ (2¼) D=2 j j=2 2 (x ¹)T (x ¹) 4 Recap: Maximum Likelihood Approach Recap: Bayesian Learning Approach Computation of the likelihood Single data point: p(x n jµ) Assumption: all data points X = fx ; : : : ; x ng are independent NY L(µ) = p(xjµ) = p(x n jµ) Log-likelihood E(µ) = ln L(µ) = ln p(x n jµ) Estimation of the parameters µ (Learning) Maximize the likelihood (=minimize the negative log-likelihood) Take the derivative and set it to E(µ) p(x njµ) p(x n jµ)! = 0 5 Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is p(xjx) = p(x; µjx)dµ Assumption: given µ, this doesn t depend on X anymore p(xjx) = Slide adapted from Bernt Schiele p(x; µjx) = p(xjµ; X)p(µjX) p(xjµ)p(µjx)dµ This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). 6

Bayesian Learning Approach Topics of This Lecture Discussion Estimate for x based on parametric form µ Likelihood of the parametric form µ given the data set X.

Prior for the parameters µ p(xjx) 7 Recap: Bayes Decision Theory Parametric Methods Recap: Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation

2 Bayesian Learning Approach Topics of This Lecture Discussion Estimate for x based on parametric form µ Likelihood of the parametric form µ given the data set X. p(xjµ)l(µ)p(µ) p(xjx) = R dµ L(µ)p(µ)dµ Normalization: integrate over all possible values of µ If we now plug in a (suitable) prior p(µ), we can estimate from the data set X. Prior for the parameters µ p(xjx) 7 Recap: Bayes Decision Theory Parametric Methods Recap: Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff Mixture distributions Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt 8 Non-Parametric Methods Histograms Non-parametric representations Often the functional form of the distribution is unknown Basic idea: Partition the data space into distinct bins with widths i and count the number of observations, n i, in each bin. 3 N =0 2 x Estimate probability density from data Histograms Kernel density estimation (Parzen window / Gaussian kernels) k-nearest-neighbor Often, the same width is used for all bins, i = This can be done, in principle, for any dimensionality D but the required number of bins grows exponentially with D! 9 0 Histograms Summary: Histograms The bin width M acts as a smoothing factor. not smooth enough about OK too smooth Properties Very general. In the limit (N!), every probability density can be represented. No need to store the data points once histogram is computed. Rather brute-force Problems High-dimensional feature spaces D-dimensional space with M bins/dimension will require M D bins! Requires an exponentially growing number of data points Curse of dimensionality Discontinuities at bin edges Bin size? too large: too much smoothing too small: too much noise 2 2

Statistically Better-Founded Approach Data point x comes from pdf p(x) P = p(y)dy Probability that x falls into small region R R If R is sufficiently small, p(x) is roughly constant P = Let V be the

3 Statistically Better-Founded Approach Data point x comes from pdf p(x) P = p(y)dy Probability that x falls into small region R R If R is sufficiently small, p(x) is roughly constant P = Let V be the volume of R R p(y)dy ¼ p(x)v If the number N of samples is sufficiently large, we can estimate P as P = K N ) K NV 3 Statistically Better-Founded Approach fixed V determine K Kernel methods Example: Determine the number K of data points inside a fixed window Kernel Methods K NV fixed K determine V K-Nearest Neighbor 4 Kernel Methods Kernel Methods: Parzen Window Parzen Window Hypercube of dimension D with edge length h: k(u) = K = ½ ; jui 2 ; i = ; : : : ; D 0; else Kernel function k( x x n ) V = h Probability density estimate: K NV = Nh D k(u)du = h d k( x x n ) h 5 Interpretations. We place a kernel window k at location x and count how many data points fall inside it. 2. We place a kernel window k around each data point x n and sum up their influences at location x. Direct visualization of the density. Still, we have artificial discontinuities at the cube boundaries We can obtain a smoother density model if we choose a smoother kernel function, e.g. a Gaussian 6 Kernel Methods: Gaussian Kernel Gauss Kernel: Examples Gaussian kernel Kernel function ¾ k(u) = exp ½ u2 (2¼h 2 ) =2 2h 2 K = k(x x n ) V = k(u)du = not smooth enough about OK Probability density estimate K NV = N ½ (2¼) D=2 h exp jjx x njj 2 ¾ 2h 2 too smooth h acts as a smoother

4 Kernel Methods Statistically Better-Founded Approach In general Any kernel such that K NV can be used. Then K = k(x x n ) fixed V determine K Kernel Methods fixed K determine V K-Nearest Neighbor And we get the probability density estimate K NV = N k(x x n ) K-Nearest Neighbor Increase the volume V until the K next data points are found. Slide adapted from Bernt Schiele 9 20 K-Nearest Neighbor k-nearest Neighbor: Examples Nearest-Neighbor density estimation Fix K, estimate V from the data. Consider a hypersphere centred on x and let it grow to a volume V? that includes K of the given N data points. Then K = 3 not smooth enough about OK Side note Strictly speaking, the model produced by K-NN is not a true density model, because the integral over all space diverges. E.g. consider K = and a sample exactly on a data point x = x j. too smooth K acts as a smoother Summary: Kernel and k-nn Density Estimation K-Nearest Neighbor Classification Properties Very general. In the limit (N!), every probability density can be represented. No computation involved in the training phase Simply storage of the training set Problems Requires storing and computing with the entire dataset. Computational cost linear in the number of data points. This can be improved, at the expense of some computation during training, by constructing efficient tree-based search structures. Kernel size / K in K-NN? Too large: too much smoothing Too small: too much noise 23 Bayesian Classification p(c j jx) = p(xjc j)p(c j ) p(x) Here we have K NV p(xjc j ) ¼ K j N j V p(c j ) ¼ N j N p(c j jx) ¼ K j N j NV N j V N K k-nearest Neighbor classification = K j K 24 4

K-Nearest Neighbors for Classification K-Nearest Neighbors for Classification Results on an example data set K = 3 K = 25 K acts as a smoothing parameter. Theoretical guarantee For N!

26 Bias-Variance Tradeoff Probability density estimation Histograms: bin size? M too large: too smooth M too small: not smooth enough Kernel methods: kernel size?

K too large: too smooth K too small: not smooth enough Too much bias Too much variance This is a general problem of many probability density estimation methods Including parametric methods and

5 K-Nearest Neighbors for Classification K-Nearest Neighbors for Classification Results on an example data set K = 3 K = 25 K acts as a smoothing parameter. Theoretical guarantee For N!, the error rate of the -NN classifier is never more than twice the optimal error (obtained from the true conditional class distributions). 26 Bias-Variance Tradeoff Probability density estimation Histograms: bin size? M too large: too smooth M too small: not smooth enough Kernel methods: kernel size? h too large: too smooth h too small: not smooth enough K-Nearest Neighbor: K? K too large: too smooth K too small: not smooth enough Too much bias Too much variance This is a general problem of many probability density estimation methods Including parametric methods and mixture models Discussion The methods discussed so far are all simple and easy to apply. They are used in many practical applications. However Histograms scale poorly with increasing dimensionality. Only suitable for relatively low-dimensional data. Both k-nn and kernel density estimation require the entire data set to be stored. Too expensive if the data set is large. Simple parametric models are very restricted in what forms of distributions they can represent. Only suitable if the data has the same general form. We need density models that are efficient and flexible! Topics of This Lecture Mixture Distributions Recap: Bayes Decision Theory Parametric Methods Recap: Maximum Likelihood approach Bayesian Learning A single parametric distribution is often not sufficient E.g. for multimodal data Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff Mixture distributions Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt Single Gaussian Mixture of two Gaussians

6 Mixture of Gaussians (MoG) Sum of M individual Normal distributions f(x) In the limit, every smooth distribution can be approximated this way (if M is large enough) MX p(xjµ) = p(xjµ j )p(j) j= x 3 Mixture of Gaussians Notes MX p(xjµ) = p(xjµ j )p(j) j= p(xjµ j ) = N(xj¹ j ; ¾ 2 j ) = p(j) = ¼ j p 2¼¾j exp X M with 0 ¼ j and ¼ j =. The mixture density integrates to : The mixture parameters are µ = (¼ ; ¹ ; ¾ ; : : : ; ; ¼ M ; ¹ M ; ¾ M ) Slide adapted from Bernt Schiele Likelihood of measurement x given mixture component j ( j= (x ¹ j) 2 2¾ 2 j p(x)dx = ) Prior of component j 32 Mixture of Gaussians (MoG) Mixture of Multivariate Gaussians Generative model j p(j) = ¼ j Weight of mixture component p(x) 2 3 x p(xjµ j ) Mixture component p(x) x Mixture density MX p(xjµ) = p(xjµ j )p(j) j= Mixture of Multivariate Gaussians Mixture of Multivariate Gaussians Multivariate Gaussians MX p(xjµ) = p(xjµ j )p(j) p(xjµ j ) = j= ½ exp ¾ (2¼) D=2 j j j=2 2 (x ¹ j) T j (x ¹ j ) Generative model j p(j) = ¼ j 3X 2 3 p(xjµ) = ¼ jp(xjµ j) j= Mixture weights / mixture coefficients: X M p(j) = ¼ j with 0 ¼ j and ¼ j = j= Parameters: µ = (¼ ; ¹ ; ; : : : ; ¼ M ; ¹ M ; M ) p(xjµ ) p(xjµ 2) p(xjµ 3)

7 Mixture of Gaussians st Estimation Attempt Mixture of Gaussians st Estimation Attempt Maximum Likelihood Minimize E = ln L(µ) = ln p(x n jµ) Let s first look at ¹ j = j E We can already see that this will be difficult, since ( K ) X ln p(xj¼; ¹; ) = ln ¼ k N(x n j¹ k ; k ) k= ¹ @¹ p(x n jµ j ) = j K j k= p(x njµ k ) Ã! = p(x n jµ j ) (x n ¹ j ) P K k= p(x njµ k ) X N ¼ j N (x n j¹ j ; j )! = P (x n ¹ j ) P K k= ¼ = 0 K k= p(x (x n ¹ j )p(x n jµ j ) =! 0 njµ k ) kn (x n j¹ k ; k ) We thus obtain ) ¹ j = j(x n )x n j(x j N (xnj¹ k ; k) = (xn ¹ j )N (xnj¹ k ; k) = j (x n ) responsibility of component j for x n Slide adapted from Bernt Schiele This will cause problems! Mixture of Gaussians st Estimation Attempt Mixture of Gaussians Other Strategy But Other strategy: ¹ j = j(x n )x n j(x n ) I.e. there is no direct analytical = f (¼ ; ; ; : : : ; ¼ M ; ¹ M ; M ) j Complex gradient function (non-linear mutual dependencies) Optimization of one Gaussian depends on all other Gaussians! It is possible to apply iterative numerical optimization here, but in the following, we will see a simpler method. Observed data: Unobserved data: Unobserved = hidden variable : j x h(j = jx n ) = h(j = 2jx n ) = Mixture of Gaussians Other Strategy Mixture of Gaussians Other Strategy Assuming we knew the values of the hidden variable Assuming we knew the mixture components assumed known ML for Gaussian # ML for Gaussian #2 p(j = jx) p(j = 2jx) assumed known j j h(j = jx n ) = h(j = 2jx n ) = ¹ = n)x n i= h(j = jx n) ¹ 2 = n)x n i= h(j = 2jx n) Bayes decision rule: Decide j = if p(j = jx n ) > p(j = 2jx n )

Mixture of Gaussians Other Strategy References and Further Reading Chicken and egg problem what comes first? More information in Bishop s book Gaussian distribution and ML: Ch.

2 We don t know any of those! 22 2 2 j Bayesian Learning: Ch. 3.3-3.5 Nonparametric methods: Ch. 4.-4.5 In order to break the loop, we need an estimate for j. Christopher M.

8 Mixture of Gaussians Other Strategy References and Further Reading Chicken and egg problem what comes first? More information in Bishop s book Gaussian distribution and ML: Ch..2.4 and Bayesian Learning: Ch..2.3 and Nonparametric methods: Ch Additional information can be found in Duda & Hart ML estimation: Ch. 3.2 We don t know any of those! j Bayesian Learning: Ch Nonparametric methods: Ch In order to break the loop, we need an estimate for j. Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 E.g. by clustering Next lecture 43 R.O. Duda, P.E. Hart, D.G. Stork Pattern Classification 2 nd Ed., Wiley-Interscience,

Machine Learning Lecture 3

Many slides adapted from B. Schiele Machine Learning Lecture 3 Probability Density Estimation II 26.04.2016 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course