Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones What is machine learning? Data interpretation describing relationship between predictors and responses finding naural groups or classes in data relating observables (or a function thereof) to physical quantities Prediction capturing relationship between inputs and outputs for a set of labelled data with the goal of predicting outputs for unlabelled data ( pattern recognition ) Learning from data 1 2 What is statistics? Many names... the systematic and quantitative way of making inferences from data data is considered as the outcome of a random event variability is expressed by a probability distribution mathematics is used to manipulate probability distributions allows us to write down statistical models for the data and solve for the parameters of interest machine learning statistical learning pattern recognition statistical data modelling data mining (although this has other meanings) multivariate data analysis... 3 4
Parameter estimation from stellar spectra Parameter estimation from stellar spectra learn mapping from spectra to data using labelled examples multidimensional (few hundred) noninear inverse Willemsenet al. (2005) Willemsenet al. (2005) 5 6 Galaxy spectral classification want to classify galaxies based on the appearance of their observed spectra can simulate spectra of known classes much variance within these basic classes Tsalmantza et al. (2007) www.astro.princeton.edu Right: Optical synthetic spectra of 8 basic galaxy types (SDSS filters overlaid) Top right: colour-colour diagram showing 8 basic types compared to locus of SDSS galaxies Above: locus of 10,000 simulated galaxies in which various parameters have been varied Tsalmantza et al. (2007) 7 8
Course objectives learn the basic concepts of machine learning learn the basic tools and methods of machine learning identify appropriate methods interpret results in light of methods used recognise inherent limitations and assumptions linear and nonlinear methods Right: Optical synthetic spectra of 8 basic galaxy types (SDSS filters overlaid) methods for high dimensional data become familiar with a freely available package for modelling data (R) Top right: Simulated Gaia spectra of the 8 basic galaxy types Above: the 10,000 simulated galaxy spectra projected into the space of the first 3 Principal Components. (These plus mean explain 99.7% of variance.) Tsalmantza et al. (2007) 9 10 Lecture schedule Online material and texts http://www.mpia.de/homes/calj/ss2007_mlpr.html viewgraphs R scripts used in lectures bilbiography, links to articles recommended book The Elements of Statistical Learning, Hastie et al. (2001), Springer links to R tutorials 11 12
The course R (S, S-PLUS) assumed knowledge http://www.r-project.org simple probability theory and statistics (distributions, hypothesis testing, least squares,...) linear algebra (matrices, eigenvalue problems,...) elementary calculus interact learn by being active, not passive! question, criticize, etc. hands-on: learn a package and play with data a language and environment for statistical computing and graphics open source runs on linux, Windows, MacOS operations for vectors and matrices large number of statistical and machine learning packages can link to C, C++ and Fortran code Good book on using R for statistics Modern Applied Statistics with S, Venables & Ripley, 2002, Springer 13 14 Supervised and unsupervised learning A simple problem of data fitting Supervised learning for each observed vector (the predictors, inputs, independent variables ), x, there are one or more dependent variables ( responses, outputs ), y, or two or more classes, C. regression problems: goal is to learn a function, y = f(x; ), where is a set of parmeters to be inferred from a training set of pre-labelled vectors {x, y} classification problems: goal is either to define decision boundaries between objects with different classes, or to model the density of the class probabilities over the data, i.e. P 1 (C = C 1 ) = f(x; ), where parametrizes the probability density function (PDF) and is learned from a training set of pre-classified vectors {x, C} Unsupervised learning no pre-labelled data or pre-defined dependent variables or classes goal is to find either natural classes/clusterings in data, or simpler (e.g. lower dimensional) variables which explain the data Examples: PCA, K-means clustering 15 16
Learning, generalization and regularization Learning from data See R scripts on web page make assumptions about smoothness of function regularization generalization take into account variance (errors) in data domain knowledge helps (if it's reliable...) this is data interpolation extrapolation is less contrained 17 18 Notation (as used by Hastie et al.) Fitting a model: linear least squares Input variables denoted by Output variables denotes by i=1..n observations with j=1..p dimensions Upper case used to refer to generic aspects of variables if a vector, subscripts access components specific observed values are written in lower case Bold is used for vectors and matrices vectors in lower case matrices in upper case Hastie et al. do not use bold for p-vectors Parameters use Greek letters can also be vectors, of course are p x 1 column vectors Determine parameters by minimizing sum-of-squares error on all N training data This is quadratic in so always has a minimum. Differentiate w.r.t is a p x 1 column vector is an N x p matrix If (the information matrix ) is non-singular then the unique solution is 19 20
2-class classification: linear decision boundary 2-class classification: K-nearest neighbours G i = 0 for green class G i = 1 for red class Boundary is Hastie, Tibshirani, Friedman (2001) Hastie, Tibshirani, Friedman (2001) 21 22 Comparison Which solution is optimal? Linear model makes a very strong assumption about the data viz. wellapproximated by a globally linear function stable but biased learn relationship between (X, y) and encapsulate into parameters, K-nearest neighbours no assumption about functional form of relationship (X, y), i.e. it is nonparametric but does assume that function is well-approximated by a locally constant function less stable but less biased no free parameters to learn, so application to new data relatively slow: brute force search for neighbours takes O(N) if we know nothing about how the data were generated (underlying model, noise), we don't know if data drawn from two uncorrelated Gaussians: linear decision boundary is almost optimal if data drawn from mixture of multiple distributions: linear boundary not optimal (nonlinear, disjoint) what is optimal? smallest generalization errors simple solution (interpretability) more complex models permit lower errors on training data but we want models to generalize need to control complexity / nonlinearity (regularization) with enough training data, wouldn't k-nn be best? 23 24
Summary supervised vs. unsupervised learning generalization and regularization regression vs. classification parametric vs. non-parametri linear regression, k nearest neighbours least squares 25