Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Size: px

Start display at page:

Download "Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer"

Jane Gray
5 years ago
Views:

1 Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1

2 Model Training data Testing data Model Testing error rate Training error rate Good performance on testing data, which is independent from the training data, is most important for a model. It serves as the basis in model selection. 2

3 Bias and Variance Understanding how different sources of error lead to bias and variance helps us improve model learning Error due to bias: difference between expected (average) prediction of model and correct value (label). Imagine you could repeat the whole model learning process more than once Due to randomness in data sets, learned models will have a range of predictions. Error due to variance: variability of a model prediction for a given data point How much predictions for a given point vary 3

4 Bias and Variance Center of the target is a model that predicts correct values Repeat model learning process to get a number of separate hits Low Variance High Variance Low Bias High Bias 4

5 Bias and Variance We attempt to reduce both the bias and variance. However, in practice, with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance. 5

6 Bias and Variance Example: Voter Party Registration Suppose voter party registration prediction based on wealth and religiousness Religiousness Wealth 6

7 Bias and Variance Example: Voter Party Registration We apply k nearest neighbor method Due to the belief that there are non linear relationships 1 NN for training data 7

8 Bias and Variance Example: Voter Party Registration Prediction regions for testing data 1 NN for testing data 8

9 Bias and Variance Example: Voter Party Registration Dotted line shows the true decision boundary 1 NN for testing data 9

10 Bias and Variance Example: Voter Party Registration We can try different values of k for investigation Lighter colors indicate less certainty about predictions Link to course web server: Optionally, at local: knnbiasvariance2.html 10

11 Bias and Variance Example: Voter Party Registration Increasing k results in averaging of more voters in each prediction Results in smoother prediction curves With k of 1, separation between two classes is very jagged Jaggedness and islands are signs of high variance; however, bias is not high (achieving some degrees of accuracy) At large k, variance is small, but bias is high When k is around 20, the model can match quite well with the decision boundary curve. Low bias and low variance 11

12 Bias and Variance For k NN, the model complexity is small when k is large. As k decreases, the model complexity increases. When k is 1, the model complexity is the highest. under fitting over fitting large k small k 12

13 Bias variance trade off 13

14 Goals in model building (1) Model selection: Estimating the performance of different models; choose the best one (2) Model assessment: Estimate the prediction error of the chosen model on new data. 14

15 Goals in model building Ideally, we d like to have enough data to be divided into three sets: Training set: to fit the models Validation set: to estimate prediction error of models, for the purpose of model selection Test set: to assess the generalization error of the final model A typical split: Train Validation Test 15

16 Goals in model building What s the difference between the validation set and the test set? The validation set is used repeatedly on all models. Our selection of the model is based on this set. The test set should be protected and used only once to obtain an unbiased error rate. When there is insufficient data to split into three parts: Analytically e.g. AIC Sample re use e.g. cross validation 16

17 Bias and Variance Denote a true relationship relating the input data and the target value by Y where is the error term normally distributed with a mean of zero. We learn a model and the expected prediction error at a point X is: Bias Variance = Irreducible Error + Squared Bias + Variance 17 around

18 Bias and Variance = Irreducible Error + Squared Bias + Variance The first term is the noise term that cannot be fundamentally reduced. The second term is the squared bias, amount by which the average of our estimate differs from the true mean The third term is the variance, expected squared deviation of around its mean. Typically the more complex we make the model, the lower the (squared) bias but higher the variance. 18 around

19 Bias variance trade off K nearest neighbor classifier: The higher the k, the lower the model complexity (estimation becomes more global, space partitioned into larger patches) For small k, can potentially adapt itself Increasing k, the variance term decreases, and the bias term increases. 19

20 Bias variance trade off For linear model with p coefficients, Although h(x 0 ) is dependent on x 0, its average over sample values is p/n Model complexity is directly associated with p. 20

21 Bias variance trade off 21

22 Bias variance trade off An example. 80 observations, 20 predictors, uniformly distributed in the hypercube [0, 1] 20 Y is 0 if X1 1/2 and 1 if X1 > 1/2, and apply k nearest neighbors. Orange: prediction error Green: squared bias Blue: variance 22

23 Estimates of Prediction Error With limited data, we need to approach testing error (hidden) as much as we can, and/or perform model selection. Two general approaches: (1) Analytical statistic, AIC (2) Resampling based Cross validation 23

24 Estimates of Prediction Error statistic: err 2 where is the training error is the no. of parameters. is the no. of training instances. is an estimate of the noise variance. 24

25 Estimates of Prediction Error Akaike Information Criterion (AIC): When log likelihood loss function is used, there is a relationship: 2E log 2 E loglik 2 where is a family of densities for Y and loglik is the maximized log likelihood log For example, for logistic regression, using binomial log likelihood, we have: AIC 2 loglik 2 For Gaussian model, AIC is the same as statistic For nonlinear and other complex models, we need to replace by some measure of model complexity 25

26 Estimates of Prediction Error Given a set of models indexed by a parameter, denote and the training error and number of parameters for each model The function provides an estimate of the test error and we find the tuning parameter that minimizes it. Our final model is 26

27 AIC in action 27

28 Cross validation The goal is to directly estimate the extra sample error (error on an independent test set) K fold cross validation: Split data into K roughly equal sized parts For each of the K parts, fit the model with the other K 1 parts, and calculate the prediction error on the part that is left out Train Train Validation Train Train 28

29 Cross validation Let be an indexing function indicating the partition to which observation is allocated. Denote as the fitted function, computed with the th part of the data removed. The cross validation prediction error: 1, Typical choices of are 5 or 10. The case is known as leave one out crossvalidation and th observation is fitted using all the data except the th 29

30 Cross validation Given a set of models indexed by a tuning parameter, denote is the th model fit with the th part of the data removed. Then, 1,, The function, provides estimate of the test error We can find the tuning parameter that minimizes it. Our final chosen model is, which we then fit to all the data. 30

31 Cross validation Prediction error (orange) and tenfold cross validation (blue) estimated from a single training set from a data set for linear classification model (80 observations and 20 predictors) 31

32 Cross validation Consider a task with samples in two equalsized classes, and predictors that are independent of the class labels The true (also test) error rate of any classifier is 50% Select 100 predictors that best correlate with class labels Problem: predictors have an unfair advantage on the basis of all of the samples CV error rate: 3%.? Build a 1 nearest neighbor classifier based on 100 predictors Repeat 50 times. The 100 predictors don t change 32

33 Cross validation Correct 5 fold CV Divide samples into K cross validation folds For each fold k: 1. Select a subset of good predictors that show fairly strong correlation with class labels using all samples except those in fold k 2. Using just this subset of predictors, build a classifier using all samples except those in fold k 3. Use the classifier to predict the class labels for the samples in fold k The error estimates from step 3 above are then accumulated over all K folds, to produce the crossvalidation estimate of prediction error. 33

34 Cross validation 34

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing