Last time... Bias-Variance decomposition. This week

Size: px

Start display at page:

Download "Last time... Bias-Variance decomposition. This week"

Dale McCarthy
5 years ago
Views:

of parameters (dimensionality) linear classification models least squares fit of boundary LDA: effectively assumes equal, Gaussian class covariances QDA: Gaussian covariances logistic regression as a

1 Machine learning, pattern recognition and statistical data modelling Lecture 4. Going nonlinear: basis expansions and splines Last time... Coryn Bailer-Jones linear regression methods for high dimensional data: ridge regression, lasso, partial least squares, PCA regression common feature: use regularization to reduce effective number of parameters (dimensionality) linear classification models least squares fit of boundary LDA: effectively assumes equal, Gaussian class covariances QDA: Gaussian covariances logistic regression as a way of directly modelling probabilities: Space Shuttle Challenger example Any questions on this? 1 2 This week Bias-Variance decomposition Bias-variance decomposition and trade-off Going non-linear: basis expansions Splines regression splines smoothing splines: regularization via control of smoothness ( complexity ) effective number of degrees of freedom Bias measures the degree to which our estimates typically differ from the truth Variance is the extent to which our estimates vary or scatter (e.g. as a result of using slightly different data, small changes in the parameters etc.) Multidimensional splines 3 4

2 Bias-variance trade-off Bias and variance local polynomial regression with a variable kernel (using in R; see first lecture) black line: true function black points: data set Blue line: h = 0.02 complex, rough low bias, high variance Red line: h = 0.5 simple, smooth high bias, low variance 5 6 Basis expansions linear model quadractic terms higher order terms (see R example) other transformations, e.g. split range with an indicator function As higher order terms are added we need to control the complexity Restrict class of functions a priori, e.g. additive models Adaptive selection of function, e.g. variable selection, only include significant basis functions (CART, MARS, boosting) Regularization. Typically an integral part of nonlinear modelling Splines (one-dimensional) Idea behind splines is that region to fit is split into subregions which are then fit separately Data drawn from nonlinear function (blue line) with Gaussian noise Split region to fit into three by the knots Piecewise linear 3 constants 3 linear fits 3 linear fits forced to be continuous (2 restrictions 2 fewer free parameters) 7 8

Piecewise linear fits Splines (one-dimensional) Piecewise polynomial independent continuous continuous first derivatives continuous second derivatives (cubic spline) 9 10 Piecewise cubic polynomial

spline) continuous third derivatives is a global cubic polynomial cubic spline has six basis function (6D linear space) (3 regions) x (4 parameters per region) (2 knots) x (3 constraints per knot)

3 Piecewise linear fits Splines (one-dimensional) Piecewise polynomial independent continuous continuous first derivatives continuous second derivatives (cubic spline) 9 10 Piecewise cubic polynomial fits Subregions are fit separately but not independently 11 Splines (one-dimensional) Piecewise polynomial independent continuous continuous first derivatives continuous second derivatives (cubic spline) continuous third derivatives is a global cubic polynomial cubic spline has six basis function (6D linear space) (3 regions) x (4 parameters per region) (2 knots) x (3 constraints per knot) = 6 Generally: No. basis functions = 4 x (N+1) 3N = N + 4, where N is no. of knots order-m spline is a piecewise polynomial of order M and has continuous derivatives up to M-2. Cubic spline has M=4 12

4 Cubic splines regression spline fixed knots natural cubic splines polynomial fits near boundaries often very wild force extrapolation to be linear: frees up 4 d.o.f (i.e. adds two constraints at the two boundaries) increases variance at the boundaries must select knots regularization controlled by number of knots and degree of polynomials not very satisfactory... See R scripts for an example using smooth.spline{stats} Avoid knot selection by using all data points as knots avoid overfitting by using regularization Minimise a penalized sum-of-squares Avoid knot selection by using all data points as knots avoid overfitting via regularization Minimise a penalized sum-of-squares 15 16

5 ...it's a generalization of ridge regression, which was What is the df of a model linear in the data (x)? Check empirically using R... How are df and related? Splines in R See R scripts on the web Regularization and bias-variance with smoothing splines Properties of the smoother matrix it is an N x N symmetric matrix of rank N semi-positive definite, i.e. Bias-variance decomposition df too low = too much smoothing high bias, low variance, function underfit df too high = too little smoothing low bias, high variance, function overfit 19 20

6 Generalized Cross Validation Generalized Cross Validation (GCV) from minimum of prediction error we estimate optimal value of parameter generally we don't know the true function so cannot plot true error can estimate via cross validation (CV) over K data sets what is appropriate value for K? (5-10 is typical) K=N is leave-one-out CV, but estimates have high variance because error is estimated on just one data point Generalized CV (GCV) is an approximation to leave-one-out (LCV) i.e. only have to calculate once, not N times GCV replaces matrix elements by their mean. The trace (=df) is often easier to compute Spline packages in R Multidimensional splines splint{fields} regression spline through all points (i.e. exact fit), equivalent to df=no. data points smooth.spline{stats} smoothing spline with regularization controlled via no. of knots, nknots, and smoothing degree via either spar or df. Set spar=0 to turn off smoothing sreg{fields} smoothing spline with regularization controlled via degrees of freedom, df, or lambda (former preferred) Several other packages are also available see section 8.7 of Venables & Ripley for non-intuitive use of ns() and bs()! spline{stats} bs{splines}, ns{splines} mars{mda} multidimensional splines (see later lecture) 23 24

Tensor product basis of B-splines Multidimensional smoothing splines: thin-plate splines 25 26 Multidimensional splines Number of basis function grows exponentially with the dimension Curse of

7 Tensor product basis of B-splines Multidimensional smoothing splines: thin-plate splines Multidimensional splines Number of basis function grows exponentially with the dimension Curse of dimensionality! Thus need to control order of functions which terms to include what basis set to use Ideally do this automatically (later lectures...) Summary Bias-variance trade-off low d.o.f smooth fit high bias, low variance function underfit high d.o.f complex fit low bias, high variance function overderfit Basis expansion permit nonlinear modelling Models can still be linear in the response variables (y) and give linear (analytic) solutions for the parameters Smoothing (cubic) splines knots at every data point regularization (smoothing) controlled via effective degrees of freedom, df multidimensional versions exist, but basis set grows exponentially with dimension 27 28

What is machine learning?

Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship