Lecture 7. CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler. Outline. Machine Learning: Cross Validation. Performance evaluation methods

Size: px

Start display at page:

Download "Lecture 7. CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler. Outline. Machine Learning: Cross Validation. Performance evaluation methods"

Marsha Pearson
5 years ago
Views:

1 CS4442/9542b: Artificial Intelligence II Prof. Olga Veksler Lecture 7 Machine Learning: Cross Validation Outline Performance evaluation methods test/train sets cross-validation k-fold Leave-one-out 1

2 A Regression Problem = f() + noise Can we learn f from this data? Let s consider three methods Linear Regression 2

3 Quadratic Regression Join-the-dots Also known as piecewise linear nonparametric regression if that makes ou feel better 3

4 Which is best? Wh not choose the method with the best fit to the data? What do we reall want? Wh not choose the method with the best fit to the data? How well are ou going to predict future data drawn from the same distribution? 4

5 The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set The test set method (Linear regression eample) 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform our regression on the training set 5

6 The test set method (Linear regression eample) Mean Squared Error = Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform our regression on the training set 4. Estimate our future performance with the test set The test set method 1. Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform our regression on the training set (Quadratic regression eample) 4. Estimate our future performance with the test Mean Squared Error = 0.9 set 6

7 The test set method (Join the dots eample) Mean Squared Error = Randoml choose 30% of the data to be in a test set 2. The remainder is a training set 3. Perform our regression on the training set 4. Estimate our future performance with the test set The test set method Good news: Ver ver simple Can then simpl choose the method with the best test-set score Bad news: What s the downside? 7

8 The test set method Good news: Ver ver simple Can then simpl choose the method with the best test-set score Bad news: Wastes data: we get an estimate of the best method to appl to 30% less data if we don t have much data, our testset might just be luck or unluck We sa the test-set estimator of performance has high variance LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 8

9 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 9

10 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. 10

11 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV = 2.12 LOOCV for Quadratic Regression For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =

12 LOOCV for Join The Dots For k=1 to R 1. Let ( k, k ) be the k th record 2. Temporaril remove ( k, k ) from the dataset 3. Train on the remaining R-1 datapoints 4. Note our error ( k, k ) When ou ve done all points, report the mean error. MSE LOOCV =3.33 Which kind of Cross Validation? Test-set Leaveone-out Downside Variance: unreliable estimate of future performance Epensive Upside Cheap Doesn t waste data..can we get the best of both worlds? 12

13 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the blue partition: Train on all the points not in the blue partition. Find blue points. 13

14 k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the blue partition: Train on all the points not in the blue partition. Find blue points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. k-fold Cross Validation Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the blue partition: Train on all the points not in the blue partition. Find blue points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the gra partition: Train on all the points not in the gra partition. Find gra points. 14

15 k-fold Cross Validation Linear Regression MSE 3FOLD =2.05 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the blue partition: Train on all the points not in the blue partition. Find blue points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the gra partition: Train on all the points not in the gra partition. Find gra points. Then report the mean error k-fold Cross Validation Quadratic Regression MSE 3FOLD =1.11 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the blue partition: Train on all the points not in the blue partition. Find blue points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the gra partition: Train on all the points not in the gra partition. Find gra points. Then report the mean error 15

16 k-fold Cross Validation Joint-the-dots MSE 3FOLD =2.93 Randoml break the dataset into k partitions (in our eample we ll have k=3 partitions colored Red Green and Blue) For the blue partition: Train on all the points not in the blue partition. Find blue points. For the green partition: Train on all the points not in the green partition. Find the test-set sum of errors on the green points. For the gra partition: Train on all the points not in the gra partition. Find gra points. Then report the mean error Which kind of Cross Validation? Test-set Leaveone-out 10-fold 3-fold N-fold Downside Variance: unreliable estimate of future performance Epensive Wastes 10% of the data. 10 times more epensive than test set Wastier than 10-fold. Epensivier than test set Identical to Leave-one-out Upside Cheap Doesn t waste data Onl wastes 10%. Onl 10 times more epensive instead of R times. Slightl better than testset 16

17 CV-based Model Selection We re tring to decide which algorithm to use. We train each machine and make a table i f i f 1 f 2 f 3 f 4 f 5 f 6 TRAINERR 10-FOLD-CV-ERR Choice CV-based Model Selection Eample: Choosing number of hidden units in a onehidden-laer neural net. Step 1: Compute 10-fold CV error for si different model classes: Algorithm 0 hidden units 1 hidden units 2 hidden units 3 hidden units 4 hidden units 5 hidden units TRAINERR 10-FOLD-CV-ERR Choice Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. 17

18 CV-based Model Selection Eample: Choosing k for a k-nearest-neighbor regression. Step 1: Compute LOOCV error for si different model classes: Algorithm K=1 TRAINERR 10-fold-CV-ERR Choice K=2 K=3 K=4 K=5 K=6 Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. CV-based Model Selection Eample: Choosing k for a k-nearest-neighbor regression. Step 1: Compute LOOCV error for si different model classes: Algorithm K=1 K=2 K=3 K=4 K=5 K=6 Wh did we use 10-fold-CV for neural nets and LOOCV for k- nearest neighbor? TRAINERR And wh stop LOOCV-ERR at K=6 Are we guaranteed that a local optimum of K vs LOOCV will be the global optimum? What should we do if we are depressed at the epense of doing LOOCV for K= 1 through 1000? The reason is Computational. For k- NN (and all other nonparametric methods) LOOCV happens to be as cheap as regular predictions. No good reason, ecept it looked like things were getting worse as K was increasing Choice Sadl, no. And in fact, the relationship can be ver bump. Step 2: Whichever model class gave best CV score: train it with all the data, and that s the predictive model ou ll use. Idea One: K=1, K=2, K=4, K=8, K=16, K=32, K=64 K=1024 Idea Two: Hillclimbing from an initial guess at K 18

19 CV-based Algorithm Choice Eample: Choosing which regression algorithm to use Step 1: Compute 10-fold-CV error for si different model classes: Algorithm TRAINERR 10-fold-CV-ERR Choice 1-NN 10-NN Linear Reg n Quad reg n LWR, KW=0.1 LWR, KW=0.5 Step 2: Whichever algorithm gave best CV score: train it with all the data, and that s the predictive model ou ll use. Cross-validation for classification Instead of computing the sum squared errors on a test set, ou should compute The total number of misclassifications on a testset What s LOOCV of 1-NN? What s LOOCV of 3-NN? What s LOOCV of 22-NN? 19

LECTURE 6: CROSS VALIDATION

LECTURE 6: CROSS VALIDATION CSCI 4352 Machine Learning Dongchul Kim, Ph.D. Department of Computer Science A Regression Problem Given a data set, how can we evaluate our (linear) model? Cross Validation