Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship between variables (predictors and responses) finding naural groups or classes in data relating observables to physical quantities Prediction capturing relationship between inputs and outputs for a set of labelled data with the goal of predicting outputs for unlabelled data ( pattern recognition ) Learning from data dealing with noise coping with high dimensions (many potentially relevant variables) fitting models to data generalizing 2
Concepts: types of problems supervised learning predictors (x) and responses (y) infer P(y x), perhaps modelled as f(x ; w) discrete y is a classification problem; real-valued is regression unsupervised learning no distinction between predictors and responses infer P(x), or things about this e.g. no. of modes/classes (mixture modelling, peak finding) low dimensional projections (descriptions) (PCA, SOM, MDS) outlier detection (discovery) 3 Concepts: probabilities and Bayes likelihood of x given y posterior probability of y given x prior over y evidence for model 4
Concepts: solution procedure 1. need some kind of expression for P(y x) or P(x) e.g. f(x ; w) = P(y x ) 2. parametric, semi-, or non-parametric. E.g. density estimation and nonlinear regression parametric: Gaussian distribution P(x), spline f(x) semi-parametric: sum of several Gaussians, additive model, local regression non-parametric: k-nn, kernel estimate, k-nn 3. parametric models: fit to data a) need to infer adjustable parameters, w, from data b) generally minimize a loss function on a labelled data set w.r.t w 4. compare different models 5 Concepts: objective function 6
Loss functions 7 Models: linear modelling (linear least squares) 8
Concepts: maximum likelihood (as a loss function) 9 Concepts: generalization and regularization given a specific set of data, we nonetheless want a general solution therefore, must make some kind of assumption(s) smoothness in functions priors on model parameters (or functions, or predictions) restricting model space regularization involves a free parameter, although this can also be inferred from the data 10
Models: penalized linear modelling (ridge regression) 11 Models: ridge regression (as regularization) the regularization projects the data onto the PCs and downweights ( shrinks ) them inversely proportional to their variance limits the model space one free parameter: large implies large degree of regularization, df() is small 12
Models: ridge regression vs. df() Hastie, Tibshirani, Friedman (2001) 13 Models: splines Hastie, Tibshirani, Friedman (2001) 14
Concepts: regularization (in splines) Avoid know selection by selecting all points as knots Avoid overfitting via regularization that is, minimise a penalized sum-of-squares 15 Concepts: regularization (in smoothing splines) 16
Concepts: regularization (in smoothing splines) 17 Concepts: regularization in ANNs and SVMs 18
Concepts: model comparison and selection cross validation n-fold, leave-one-out, generalized compare and select models using just the training set account for model complexity plus bias from finite-sized training set Bayes Information Criterion Akaike Information Criterion k is no. of parameters; N is no. of training vectors smallest BIC or AIC corresponds to optimal model Bayesian evidence for model (hypothesis) H, P(D H) probability that data arises from model, marginalized over all model parameters 19 Concepts: Occam's razor and Bayesian evidence D = data H = hypothesis (model) w = model parameters Simpler model, H 1, predicts less of the data space Evidence naturally penalizes more complex models after MacKay (1992) 20
Concepts: curse of dimensionality to retain density, no. vectors must grow exponentially with no. dimensions generally cannot do this overcome curse in various ways make assumptions: structured regression limit model space generalized additive models basis functions and kernels 21 Models: basis expansions linear model quadractic terms higher order terms other transformations, e.g. split range with an indicator function generalized additive models 22
Models: MLP neural network basis functions 23 Models: radial basis function neural networks 24
Concepts: optimization With gradient information gradient descent add second derivative (Hessian): Newton, quasi-newton, Levenberg-Marquardt, conjugate gradients pure gradient methods get stuck in local minima random restart committee/ensemble of models momentum terms (non-gradient info.) without gradient information expectation-maximization (EM) algorithm simulated annealing genetic algorithms 25 Concepts: marginalization (Bayes again) 26