Machine Learning Lecture 3

Size: px

Start display at page:

Download "Machine Learning Lecture 3"

Peregrine Parsons
5 years ago
Views:

1 Many slides adapted from B. Schiele Machine Learning Lecture 3 Probability Density Estimation II Bastian Leibe RWTH Aachen leibe@vision.rwth-aachen.de

Functions Support Vector Machines Ensemble Methods & Boosting Randomized

2 Course Outline Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Discriminative Approaches (5 weeks) Linear Discriminant Functions Support Vector Machines Ensemble Methods & Boosting Randomized Trees, Forests & Ferns Generative Models (4 weeks) Bayesian Networks Markov Random Fields 2

3 Topics of This Lecture Recap: Parametric Methods Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff Mixture distributions Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt 3

4 Recap: Gaussian (or Normal) Distribution One-dimensional case Mean ¹ Variance ¾ 2 N(xj¹; ¾ 2 ) = p 1 ¾ (x ¹)2 exp ½ 2¼¾ 2¾ 2 Multi-dimensional case Mean ¹ Covariance N(xj¹; ) = ½ 1 exp 1 ¾ (2¼) D=2 j j1=2 2 (x ¹)T 1 (x ¹) Image source: C.M. Bishop,

5 Recap: Maximum Likelihood Approach Computation of the likelihood p(x n jµ) Single data point: Assumption: all data points X = fx 1 ; : : : ; x n g are independent Log-likelihood L(µ) = p(xjµ) = E(µ) = ln L(µ) = NX ln p(x n jµ) n=1 Estimation of the parameters µ (Learning) Maximize the likelihood (=minimize the negative log-likelihood) Take the derivative and set it to zero. Slide credit: Bernt E(µ) = X n=1 NY n=1 p(x p(x njµ) p(x n jµ)! = 0 5

6 Recap: Bayesian Learning Approach Bayesian view: Consider the parameter vector µ as a random variable. When estimating the parameters, what we compute is p(xjx) = p(xjx) = Z Z p(x; µjx)dµ p(x; µjx) = p(xjµ; X)p(µjX) p(xjµ)p(µjx)dµ Assumption: given µ, this doesn t depend on X anymore This is entirely determined by the parameter µ (i.e. by the parametric form of the pdf). Slide adapted from Bernt Schiele 6

7 Bayesian Learning Approach Discussion Likelihood of the parametric form µ given the data set X. Estimate for x based on parametric form µ Prior for the parameters µ p(xjx) = Z p(xjµ)l(µ)p(µ) R L(µ)p(µ)dµ dµ Normalization: integrate over all possible values of µ If we now plug in a (suitable) prior p(µ), we can estimate from the data set X. p(xjx) 7

8 Topics of This Lecture Recap: Bayes Decision Theory Parametric Methods Recap: Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff Mixture distributions Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt 8

9 Non-Parametric Methods Non-parametric representations Often the functional form of the distribution is unknown x Estimate probability density from data Histograms Kernel density estimation (Parzen window / Gaussian kernels) k-nearest-neighbor Slide credit: Bernt Schiele 9

3 2 1 N =10 Often, the same width is used for all bins, i =. 0 0 0.

10 Histograms Basic idea: Partition the data space into distinct bins with widths i and count the number of observations, n i, in each bin N =10 Often, the same width is used for all bins, i = This can be done, in principle, for any dimensionality D but the required number of bins grows exponentially with D! 10 Image source: C.M. Bishop, 2006

11 Histograms The bin width M acts as a smoothing factor. not smooth enough about OK too smooth 11 Image source: C.M. Bishop, 2006

12 Summary: Histograms Properties Very general. In the limit (N!1), every probability density can be represented. No need to store the data points once histogram is computed. Rather brute-force Problems High-dimensional feature spaces D-dimensional space with M bins/dimension will require M D bins! Requires an exponentially growing number of data points Curse of dimensionality Discontinuities at bin edges Bin size? Slide credit: Bernt Schiele too large: too much smoothing too small: too much noise 12

13 Statistically Better-Founded Approach Data point x comes from pdf p(x) P = Z Probability that x falls into small region R R p(y)dy If R is sufficiently small, p(x) is roughly constant Let V be the volume of R P = Z R p(y)dy ¼ p(x)v If the number N of samples is sufficiently large, we can estimate P as Slide credit: Bernt Schiele P = K N ) p(x) ¼ K NV 13

14 Statistically Better-Founded Approach p(x) ¼ K NV fixed V determine K fixed K determine V Kernel Methods K-Nearest Neighbor Kernel methods Example: Determine the number K of data points inside a fixed window Slide credit: Bernt Schiele 14

15 Kernel Methods Parzen Window Hypercube of dimension D with edge length h: k(u) = K = ½ 1; jui 1 2 ; i = 1; : : : ; D 0; else NX n=1 Kernel function k( x x n h ) V = Z k(u)du = h d Probability density estimate: Slide credit: Bernt Schiele p(x) ¼ K NV = 1 Nh D NX n=1 k( x x n h ) 15

16 Kernel Methods: Parzen Window Interpretations 1. We place a kernel window k at location x and count how many data points fall inside it. 2. We place a kernel window k around each data point x n and sum up their influences at location x. Direct visualization of the density. Still, we have artificial discontinuities at the cube boundaries We can obtain a smoother density model if we choose a smoother kernel function, e.g. a Gaussian 16

17 Kernel Methods: Gaussian Kernel Gaussian kernel ¾ 1 k(u) = exp ½ u2 (2¼h 2 ) 1=2 2h 2 NX Z K = k(x x n ) V = k(u)du = 1 Kernel function n=1 Probability density estimate p(x) ¼ K NV = 1 N NX n=1 1 (2¼) D=2 h exp ½ jjx x njj 2 2h 2 ¾ Slide credit: Bernt Schiele 17

18 Gauss Kernel: Examples not smooth enough about OK too smooth h acts as a smoother. 18 Image source: C.M. Bishop, 2006

19 Kernel Methods In general Any kernel such that can be used. Then K = NX n=1 k(x x n ) And we get the probability density estimate Slide adapted from Bernt Schiele p(x) ¼ K NV = 1 N NX n=1 k(x x n ) 19

20 Statistically Better-Founded Approach p(x) ¼ K NV fixed V determine K fixed K determine V Kernel Methods K-Nearest Neighbor K-Nearest Neighbor Increase the volume V until the K next data points are found. Slide credit: Bernt Schiele 20

21 K-Nearest Neighbor Nearest-Neighbor density estimation Fix K, estimate V from the data. Consider a hypersphere centred on x and let it grow to a volume V? that includes K of the given N data points. Then K = 3 Side note Strictly speaking, the model produced by K-NN is not a true density model, because the integral over all space diverges. E.g. consider K = 1 and a sample exactly on a data point x = x j. 21

22 k-nearest Neighbor: Examples not smooth enough about OK too smooth K acts as a smoother. 22 Image source: C.M. Bishop, 2006

23 Summary: Kernel and k-nn Density Estimation Properties Very general. In the limit (N!1), every probability density can be represented. No computation involved in the training phase Simply storage of the training set Problems Requires storing and computing with the entire dataset. Computational cost linear in the number of data points. This can be improved, at the expense of some computation during training, by constructing efficient tree-based search structures. Kernel size / K in K-NN? Too large: too much smoothing Too small: too much noise 23

24 K-Nearest Neighbor Classification Bayesian Classification Here we have p(c j jx) = p(xjc j)p(c j ) p(x) p(x) ¼ K NV p(xjc j ) ¼ K j N j V p(c j jx) ¼ K j N j V N j N NV K = K j K p(c j ) ¼ N j N k-nearest Neighbor classification Slide credit: Bernt Schiele 24

25 K-Nearest Neighbors for Classification K = 3 K = 1 25 Image source: C.M. Bishop, 2006

1, the error rate of the 1-NN classifier is never more than twice the

26 K-Nearest Neighbors for Classification Results on an example data set K acts as a smoothing parameter. Theoretical guarantee For N!1, the error rate of the 1-NN classifier is never more than twice the optimal error (obtained from the true conditional class distributions). 26 Image source: C.M. Bishop, 2006

27 Bias-Variance Tradeoff Probability density estimation Histograms: bin size? M too large: too smooth M too small: not smooth enough Kernel methods: kernel size? h too large: too smooth h too small: not smooth enough K-Nearest Neighbor: K? K too large: too smooth K too small: not smooth enough Too much bias Too much variance This is a general problem of many probability density estimation methods Including parametric methods and mixture models Slide credit: Bernt Schiele 27

28 Discussion The methods discussed so far are all simple and easy to apply. They are used in many practical applications. However Histograms scale poorly with increasing dimensionality. Only suitable for relatively low-dimensional data. Both k-nn and kernel density estimation require the entire data set to be stored. Too expensive if the data set is large. Simple parametric models are very restricted in what forms of distributions they can represent. Only suitable if the data has the same general form. We need density models that are efficient and flexible! 28

29 Topics of This Lecture Recap: Bayes Decision Theory Parametric Methods Recap: Maximum Likelihood approach Bayesian Learning Non-Parametric Methods Histograms Kernel density estimation K-Nearest Neighbors k-nn for Classification Bias-Variance tradeoff Mixture distributions Mixture of Gaussians (MoG) Maximum Likelihood estimation attempt 29

30 Mixture Distributions A single parametric distribution is often not sufficient E.g. for multimodal data Single Gaussian Mixture of two Gaussians 30 Image source: C.M. Bishop, 2006

31 Mixture of Gaussians (MoG) Sum of M individual Normal distributions f(x) x In the limit, every smooth distribution can be approximated this way (if M is large enough) p(xjµ) = MX p(xjµ j )p(j) j=1 Slide credit: Bernt Schiele 31

32 Mixture of Gaussians Notes p(xjµ) = MX p(xjµ j )p(j) j=1 p(xjµ j ) = N(xj¹ j ; ¾ 2 j ) = p(j) = ¼ j 0 ¼ j 1 1 p 2¼¾j exp with and. The mixture density integrates to 1: ( M X j=1 Z Likelihood of measurement x given mixture component j (x ¹ j) 2 2¾ 2 j ¼ j = 1 p(x)dx = 1 ) Prior of component j The mixture parameters are µ = (¼ 1 ; ¹ 1 ; ¾ 1 ; : : : ; ; ¼ M ; ¹ M ; ¾ M ) Slide adapted from Bernt Schiele 32

33 Mixture of Gaussians (MoG) Generative model j p(j) = ¼ j Weight of mixture component p(x) p(xjµ j ) Mixture component x p(x) p(xjµ) = Mixture density MX p(xjµ j )p(j) Slide credit: Bernt Schiele x j=1 33

34 Mixture of Multivariate Gaussians 34 Image source: C.M. Bishop, 2006

35 Mixture of Multivariate Gaussians Multivariate Gaussians MX p(xjµ) = p(xjµ j )p(j) p(xjµ j ) = j=1 1 exp (2¼) D=2 j j j1=2 ½ 1 ¾ 2 (x ¹ j) T 1 j (x ¹ j ) Mixture weights / mixture coefficients: p(j) = ¼ j Parameters: with 0 ¼ j 1 and M X j=1 ¼ j = 1 µ = (¼ 1 ; ¹ 1 ; 1 ; : : : ; ¼ M ; ¹ M ; M ) 35 Slide credit: Bernt Schiele Image source: C.M. Bishop, 2006

Mixture of Multivariate Gaussians Generative model j p(j) = ¼ j 1 2 3 p(xjµ 3 ) p(xjµ) = 3X ¼ j

36 Mixture of Multivariate Gaussians Generative model j p(j) = ¼ j p(xjµ 3 ) p(xjµ) = 3X ¼ j p(xjµ j ) j=1 p(xjµ 1 ) p(xjµ 2 ) 36 Slide credit: Bernt Schiele Image source: C.M. Bishop, 2006

37 Mixture of Gaussians 1 st Estimation Attempt Maximum Likelihood Minimize Let s first look at ¹ j = 0 E = ln L(µ) = E NX n=1 ln p(x n jµ) ¹ j We can already see that this will be difficult, since ln p(xj¼; ¹; ) = Slide adapted from Bernt Schiele ( NX K ) X ln ¼ k N(x n j¹ k ; k ) n=1 k=1 This will cause problems! 37

38 Mixture of Gaussians 1 st Estimation p(x n jµ j ) = j K j n=1 k=1 p(x njµ k ) Ã! NX = 1 p(x n jµ j ) (x n ¹ j ) P K k=1 p(x njµ k ) n=1 X N 1 1 NX ¼ j N (x n j¹ j ; j ) = P (x n ¹ j ) P K k=1 ¼ K k=1 p(x (x n ¹ j )p(x n jµ j ) =! 0 njµ k ) kn (x n j¹ k ; k ) n=1 We thus obtain P N n=1 ) ¹ j = j(x n )x n P N n=1 j(x n j N (x n j¹ k ; k ) = 1 (x n ¹ j )N (x n j¹ k ; k ) = j (x n ) responsibility of component j for x n! = 0 38

Mixture of Gaussians 1 st Estimation Attempt But ¹ j = P N n=1 j(x n )x n P N n=1 j(x n ) I.e. there is no direct analytical solution!

dependencies) Optimization of one Gaussian depends on all other Gaussians!

39 Mixture of Gaussians 1 st Estimation Attempt But ¹ j = P N n=1 j(x n )x n P N n=1 j(x n ) I.e. there is no direct analytical = f (¼ 1 ; 1 ; 1 ; : : : ; ¼ M ; ¹ M ; M ) j Complex gradient function (non-linear mutual dependencies) Optimization of one Gaussian depends on all other Gaussians! It is possible to apply iterative numerical optimization here, but in the following, we will see a simpler method. 39

40 Mixture of Gaussians Other Strategy Other strategy: Observed data: Unobserved data: Unobserved = hidden variable : j x h(j = 1jx n ) = h(j = 2jx n ) = Slide credit: Bernt Schiele 40

41 Mixture of Gaussians Other Strategy Assuming we knew the values of the hidden variable ML for Gaussian #1 ML for Gaussian #2 assumed known j h(j = 1jx n ) = h(j = 2jx n ) = ¹ 1 = P N n=1 h(j = 1jx n)x n P N i=1 h(j = 1jx n) ¹ 2 = P N n=1 h(j = 2jx n)x n P N i=1 h(j = 2jx n) Slide credit: Bernt Schiele 41

42 Mixture of Gaussians Other Strategy Assuming we knew the mixture components assumed known p(j = 1jx) p(j = 2jx) j Bayes decision rule: Decide j = 1 if p(j = 1jx n ) > p(j = 2jx n ) Slide credit: Bernt Schiele 42

43 Mixture of Gaussians Other Strategy Chicken and egg problem what comes first? We don t know any of those! j In order to break the loop, we need an estimate for j. E.g. by clustering Next lecture Slide credit: Bernt Schiele 43

References and Further Reading More information in Bishop s book Gaussian distribution and ML: Ch. 1.2.4

Additional information can be found in Duda & Hart ML estimation: Ch. 3.2 Bayesian Learning: Ch. 3.3-3.

44 References and Further Reading More information in Bishop s book Gaussian distribution and ML: Ch and Bayesian Learning: Ch and Nonparametric methods: Ch Additional information can be found in Duda & Hart ML estimation: Ch. 3.2 Bayesian Learning: Ch Nonparametric methods: Ch Christopher M. Bishop Pattern Recognition and Machine Learning Springer, 2006 R.O. Duda, P.E. Hart, D.G. Stork Pattern Classification 2 nd Ed., Wiley-Interscience,

Machine Learning Lecture 3

Course Outline Machine Learning Lecture 3 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Probability Density Estimation II 26.04.206 Discriminative Approaches (5 weeks) Linear