Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Size: px

Start display at page:

Download "Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression"

Aldous Blake
5 years ago
Views:

size parametric: Gaussian; fitting via maximum likelihood Principal Components Analysis Principal Components are the eigenvectors of the covariance matrix are orthonormal ordered set describing

1 Machine learning, pattern recognition and statistical data modelling Lecture 3. Linear Methods (part 1) Coryn Bailer-Jones Last time... curse of dimensionality local methods quickly become nonlocal as p increases density estimation non-parametric: histograms, kernel method, k-nn trade-off between number of neighbours and volume size parametric: Gaussian; fitting via maximum likelihood Principal Components Analysis Principal Components are the eigenvectors of the covariance matrix are orthonormal ordered set describing directions of maximum variance reduced reconstruction: data compression a linear transformation (coordinate rotation) 1 2 This week... Preprocessing the data and feature slection preprocessing check and if appropriate remove outliers, errors etc. linear regression subset selection ridge regression, lasso remove variables which are obviously irrelevant or entirely noise regularization best subset selection (exhaustive for small p) SVD forward/backward subset selection ( greedy search) Hastie et al. ch. 3 variable transformation transform to zero mean and unit standard deviation if data are correlated, may want to whiten or sphere the data take logs if there is a range of scales relevant to both input and output variables use domain knowledge to define (new) features look at covariance in data highly correlated or redundant features may still be useful

2 Feature selection class centres are at (-1,-1) and (1,1) Feature selection now give high covariance parallel to line joining class centres Guyon & Elisseff (2003) Guyon & Elisseff (2003) The data are shown twice in each case with an axis change. In rotated system the sd is reduced by a factor of sqrt(2). Just as good separation with new x-axis only. [Combining n i.i.d. random variables reduced standard deviation by a factor of sqrt(n)] 5 6 Sphering ( whitening ) the data 7 Linear regression and linear least squares Determine parameters by minimizing sum-of-squares error on all N training data This is quadratic in so always has a minimum. Differentiate w.r.t If (the information matrix ) is non-singular then the unique solution is includes 1 for the intercept 8

3 Linear regression Model testing: cross validation See R scripts on web page taken from section 1.3 of Venables & Ripley Solve for model parameters using a training data set, but must evaluate performance on an independent set to avoid overtraining To determine generalization performance and (in particular) to optimize free parameters we can split/sample available data train/test sets N-fold cross-validation (N models) train, test and evaluation sets bootstrapping 9 10 Linear regression in high dimensions Ridge regression Gauss-Markov theorem Least squares estimate of the parameters has the smallest variance among all linear unbiased estimates what are bias and variance? may well exist biased estimators which have lower MSE than least squares trade a small increase in bias for a large decrease in variance achieve this via shrinkage (regularization) 11 12

4 Ridge regression Singular Value Decomposition (SVD) Singular Value Decomposition (SVD) Effective number of degrees of freedom in ridge regression 15 16

A multidimensional data set Correlation between level of prostate specific antigen (PSA) and several clinical measures in a set of 97 men about to receive surgery for prostate cancer Prostate cancer

variables gleason and svi are categorical Hastie, Tibshirani, Friedman (2001) 17 18 Ridge regression: vs.

5 A multidimensional data set Correlation between level of prostate specific antigen (PSA) and several clinical measures in a set of 97 men about to receive surgery for prostate cancer Prostate cancer data (scatter plot) Objective: predict log(psa) based on 8 parameters logarithm of volume of tumors, lcavol,logarithm of weight of prostate gland, lweight, age, etc. variables gleason and svi are categorical Hastie, Tibshirani, Friedman (2001) Ridge regression: vs. Properties of ridge regression Hastie, Tibshirani, Friedman (2001) it projects the data onto its principal components and downweights ( shrinks ) low variance components more than high variance ones. implicit assumption is, therefore, that response variable tends to vary more in directions of highest variance inputs. has one free parameter () which controls extent of shrinkage ( regularization ). This could be set by cross validation

What is machine learning?

Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship