Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

Size: px

Start display at page:

Download "Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP"

Magnus Boyd
5 years ago
Views:

1 Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP

2 Linear regression

3 Linear Basis FuncDon Models (1) Example: Polynomial Curve FiLng

4 Linear Basis FuncDon Models (2) Generally Where φ j (x) are known as basis func*ons. Typically, φ 0 (x) = 1, so that w 0 acts as a bias. In the simplest case, we use linear basis funcdons : φ d (x) = x d.

5 Linear Basis FuncDon Models (3) Polynomial basis funcdons: These are global; a small change in x affect all basis funcdons.

6 Linear Basis FuncDon Models (4) Gaussian basis funcdons: These are local; a small change in x only affect nearby basis funcdons. µ j and s control locadon and scale (width).

7 Linear Basis FuncDon Models (5) Sigmoidal basis funcdons: where Also these are local; a small change in x only affect nearby basis funcdons. µ j and s control locadon and scale (slope).

8 Curve FiLng Re-visited

9 Maximum Likelihood and Least Squares (1) Assume observadons from a determinisdc funcdon with added Gaussian noise: where which is the same as saying, Given observed inputs,, and targets,, we obtain the likelihood funcdon

10 Maximum Likelihood and Least Squares (2) Taking the logarithm, we get where is the sum-of-squares error.

11 Sum-of-Squares Error FuncDon

12 Maximum Likelihood and Least Squares (3) CompuDng the gradient and selng it to zero yields 0= Solving for w, we get ( N N ) t n φ(x n ) T w T φ(x n )φ(x n ) T n=1 ( ) n=1 The Moore-Penrose pseudo-inverse,. where

13 Maximum Likelihood and Least Squares (4) Maximizing with respect to the bias, w 0, alone, we see that We can also maximize with respect to β, giving

14 Geometry of Least Squares Consider N-dimensional M-dimensional S is spanned by. w ML minimizes the distance between t and its orthogonal projecdon on S, i.e. y.

15 SequenDal Learning Data items considered one at a Dme (a.k.a. online learning); use stochasdc (sequendal) gradient descent: This is known as the least-mean-squares (LMS) algorithm. Issue: how to choose η?

sum-of-squares error funcdon and a quadradc regularizer,

16 Regularized Least Squares (1) Consider the error funcdon: Data term + RegularizaDon term With the sum-of-squares error funcdon and a quadradc regularizer, we get which is minimized by λ is called the regularizadon coefficient.

17 Regularized Least Squares (2) With a more general regularizer, we have Lasso QuadraDc

18 Regularized Least Squares (3) Lasso tends to generate sparser soludons than a quadradc regularizer.

19 The Loss FuncDon Loss funcdon: Choose esdmate y(x) of value of t for each x in order to minimize total loss E[L].

20 The Squared Loss FuncDon Define E[t x] = Z tp(t x)dt

21 Curve FiLng Re-visited

22 The Bias-Variance DecomposiDon (1) Recall the expected squared loss, where The second term of E[L] corresponds to the noise inherent in the random variable t. What about the first term?

23 The Bias-Variance DecomposiDon (2) Suppose we were given muldple data sets, each of size N. Any pardcular data set, D, will give a pardcular funcdon y(x;d). We then have

24 The Bias-Variance DecomposiDon (3) Taking the expectadon over D yields

25 The Bias-Variance DecomposiDon (4) Thus we can write where

26 The Bias-Variance DecomposiDon (5) Example: 25 data sets from the sinusoidal, varying the degree of regularizadon, λ.

27 The Bias-Variance DecomposiDon (6) Example: 25 data sets from the sinusoidal, varying the degree of regularizadon, λ.

28 The Bias-Variance DecomposiDon (7) Example: 25 data sets from the sinusoidal, varying the degree of regularizadon, λ.

29 The Bias-Variance Trade-off From these plots, we note that an over-regularized model (large λ) will have a high bias, while an underregularized model (small λ) will have a high variance.

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)