Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1

Training and Generalization Training stage: Utilizing training data to learn a model for regression or classification Generalization stage: Given a new input x estimated mapping to find find output y (prediction or classification) Generalization performance (RSS or misclassification error) needs to be assessed since it can guide selection of model or learning method Setting: Let Y be the target variable, a vector of inputs X, a prediction model estimated using a training set T Typical choices of the loss function 2

Training Error vs. Testing Error Test (generalization) error For a fixed training set T Expected test error Training error is average loss over training sample - training error (100 training sets T, N=50) - test error Need to estimate Err, training error not good performance indicator (underestimates true performance) 3

Quantities in Classification When having a categorical response G, with K possible labels We model p k (X)=Pr(G=k X) and decide Typical choices of the loss functions Test error Training error 4

Generic Steps Models have tuning constants, say α, that adjust complexity and affect predictor Model selection: Estimate the performance of different models in order to choose best one Model assessment: After choosing model, estimate the test error on new data 50% 25% 25% fit the models estimate prediction error for model selection assess generalization error of selected model 5

Bias-Variance Decomposition Assume that Y=f(x)+ε, where E[ε]=0 and var(ε)=σ ε 2 Then expected prediction error at input point X=x 0 First term is the noise variance Typically a complex model : Lower bias but higher variance 6

k-nn Fit Example Using k-nearest neighbor fit the prediction error takes the form Assuming training data x i are fixed, and randomness is in y i k is inversely proportional to the model complexity For small k can potentially better adapt to the underlying f(x) As k goes up the bias will go up but variance will decrease due to averaging 7

Linear Fit Example In the linear fit case we have Then the error is Here that produce the fit Average of the error Model complexity proportional to p 8

Ridge Regression Bias Breakdown Parameters of best fitting linear approximation to f Same error as before except Best-fitting linear parameters The average squared bias can be written as Model bias due to linear fitting; Estimation bias due to the utilization of regularization introducing extra bias for smaller variance 9

The Behavior of Bias and Variance Trading-off bias for variance 10

Optimism of the Training Error Rate Given a training set Generalization error is This is for fixed training set and averaging over testing point (X 0,Y 0 ) Averaging over training sets gives Training error Since fitting method adapts to training data training error is an optimistic (lower) estimate of Err T 11

A Quantification of the Optimism Consider the following in-sample error Averaged over the responses Y i0 (due to noise) for a given training input set of points x i, i=1,,n Optimism is defined as Op is typically positive since ērr is biased downward as an estimate of Err in Average optimism over training sets Easier to estimate ω 12

More Details For squared-error and 0-1 loss one can show that optimism The harder we fit the data the higher cov will be and thus larger optimism We have the important relation If ŷ i is obtained by linear fit with d parameters When Y=f(X)+ε we have 13

Ways to Estimate the Testing Error Estimate optimism and add it to training error ērr Using Cp, AIC, BIC Utilize cross-validation or bootstrap techniques This is to form direct estimates of average testing error Err 14

In-Sample Error Estimation The in-sample error estimate (averaged over responses) is given as Get the C p statistic (error estimate) Noise variance estimate obtained from the mean-square error of a low-bias model Similarity to Akaike s information criterion (AIC), true as N Family of densities Including true density Maximized wrt θ log likelihood Given training data 15

Akaike s Information Criterion (AIC) Model selection: Select model that results smallest AIC Given set of models f a (x) indexed by a, consider training error ērr(a) and number of parameters d(a). Then, AIC gives an estimate of testing error Select parameter â that minimizes the AIC Second term is accurate for linear models with additive errors and squared error loss; Holds approximately for linear models and log-likelihoods Formula does not hold in general for 0-1 loss but it is still used to determine optimal parameters 16

An Example AIC used to select order model in phoneme recognition using logistic regression with model 17

Effective Number of Parameters Consider outcomes y 1,y 2,,y N in vector y in R N and similarly for the prediction ŷ Consider linear fitting ŷ=sy, where S is an NxN matrix depending on x i but not on y i [Linear regression and quadratic smoothing] Effective number of parameters If S was projection matrix onto a space spanned by M basis vectors then trace(s)=m If y arises from additive-error model Y=f(X)+ε with variance σ ε 2 then or 18

Bayesian Information Criterion (BIC) BIC like AIC works with settings where fitting is formulated as a max likelihood problem Under Gaussian model with variance σ ε2 the first term is equal to which gives BIC proportional to AIC where factor 2 replaced by logn Although they look similar they are motivated in completely different ways; BIC is derived from a Bayesian approach in model selection Consider a set of candidate models M m, m=1, M and parameters θ m with prior distribution Pr(θ m M m ) Need to find the best model 19

Posterior Model Probability Z corresponds to the training data {x i,y i } i=1 N To compare models M m and M l we form ratio Bayesian factor If ratio greater than one we choose m otherwise we choose l Typicall assume uniform model prior, thus Pr(M m ) constant 20

Approximating the Likelihood Laplace integral approximation [Ripley 96] gives Maximum likelihood estimate and number of free parameters d m in model M m Note that the BIC criterion can be obtained as Minimizing BIC is equivalent to selecting model with largest posterior probability Given BIC m for model M m the posterior can be estimated as 21

BIC or AIC No clear choice between BIC and AIC BIC is asymptotically consistent; Given a family of models that includes the true probability, then the probability that BIC will select the right model goes to 1 as the number of training data N AIC is not asymptotically consistent; Tends to choose complex models as N For finite samples AIC may be a better option than BIC since BIC tends to choose models that are too simple due to the heavy penalty on complexity 22

Minimum Description Length (MDL) Similar to BIC but motivated from coding theory for data compression Think of datum z as a message that want to encode and transmit Our selected model is a way of encoding the datum and will choose the most parsimonious model (shortest code) Consider a model M with parameters θ, and data Z=(X,y) consisting of inputs and outputs; Pdf Pr(y θ,m,x) Message length required to transmit the output is Transmit discrepancy between model and actual target values Transmit model parameters 23

MDL Example MDL principle says we should choose the model that minimizes the length Consider single target y~n(θ,σ 2 ) and model parameter θ~n(0,1) The message length is given as The smaller σ the shorter on average is the message length Similarity to BIC principle 24

Cross-Validation (CV) The simplest and most widely used method for estimating EPE K-Fold CV: Due to scarcity of data split dataset in K equallysized parts, part used to train part used to test. E.g., K=5 For the kth part (k=3) we fit model using the rest K-1 parts and calculate the prediction error of the fitted model when predicting the kth part; This is done repeatedly for k=1,2,,k 25

CV Details Consider the partition mapping: To which partition is observation i allocated (randomization) Fitted function computed with kth data part removed CV estimate of EPE is found as Typical choices for K are 5 or 10 K=N: Leave-one-out validation k(i)=i; Fit obtained using all data but ith 26

Selecting a Model via CV Consider set of models f(x,α) indexed by tuning parameter α The αth model fit with the kth part of data removed Find tuning parameter α that minimizes test error curve If K=N, then CV close to unbiased but high-variance since all training sets look very similar; High computational burden How to pick K? Depends on the slope of learning curve (not known) 27

Selecting K and N If N=200 and K=5 then there is not much difference between CV and actual EPE If N=50 and K=5 then the CV will give significantly higher estimate for EPE (check high slope at 50) CV estimates always overestimate (biased upwards) actual EPE 28

Generalized CV Approximation to leave-one out CV for linear fitting under squared-error loss In the linear fitting prediction takes the form For many linear fitting methods GCV approximation 29

Bootstrap Methods General tool for assessing statistical accuracy and used to estimate EPE Training data z i =(x i,y i ) S(Z): Any quantity computed from Z E.g. the variance Monte Carol estimate using empirical distribution of data 30

Bootstrap for EPE Estimation For each observation keep track of predictions from boot-strap samples not containing that observation (Leave-one out) C -i is the set of indices of the bootstrap samples b that do not contain observation i B should be large enough to avoid zero C -i Suffering from Bias; Better performance using the 0.632 estimator Improved estimator 0.632+ : 31