Model Selection and Assessment

Size: px

Start display at page:

Download "Model Selection and Assessment"

Grant Murphy
5 years ago
Views:

1 Model Selection and Assessment CS4780/5780 Machine Learning Fall 2014 Thorsten Joachims Cornell University Reading: Mitchell Chapter 5 Dietterich, T. G., (1998). Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Computation, 10 (7) (

2 Outline Model Selection Controlling overfitting in decision trees Train, validation, test K-fold cross validation Evaluation What is the true error of classification rule h? Is rule h 1 more accurate than h 2? Is learning algorithm A1 better than A2?

3 Learning as Prediction

4 Overfitting Note: Accuracy = 1.0-Error [Mitchell]

5 Controlling Overfitting in Decision Trees Early Stopping: Stop growing the tree and introduce leaf when splitting no longer reliable. Restrict size of tree (e.g., number of nodes, depth) Minimum number of examples in node Threshold on splitting criterion Post Pruning: Grow full tree, then simplify. Reduced-error tree pruning Rule post-pruning

6 Reduced-Error Pruning

7 Model Selection Real-world Process drawn i.i.d. drawn i.i.d. split randomly Train Sample S train split randomly Test Sample S test Train Sample S train S train Learner 1 ĥ 1 Val. Sample S val ĥ Learner m ĥ k Training: Run learning algorithm m times (e.g. different parameters). Validation Error: Errors Err Sval (ĥ i ) is an estimates of Err P (ĥ i ) for each h i. Selection: Use h i with min Err Sval (ĥ i ) for prediction on test examples.

8 K-fold Cross Validation Given Sample of labeled instances S Learning Algorithms A Compute Randomly partition S into k equally sized subsets S 1 S k For i from 1 to k Train A on S 1 S i-1 S i+1. S k and get ĥ. Apply ĥ to S i and compute Err Si (ĥ). Estimate Average Err Si (ĥ) is estimate of average prediction error of rules produced by A, namely E S (Err P (A(S train )))

9 Text Classification Example: Corporate Acquisitions Results Unpruned Tree (ID3 Algorithm): Size: 437 nodes Training Error: 0.0% Test Error: 11.0% Early Stopping Tree (ID3 Algorithm): Size: 299 nodes Training Error: 2.6% Test Error: 9.8% Reduced-Error Tree Pruning (C4.5 Algorithm): Size: 167 nodes Training Error: 4.0% Test Error: 10.8% Rule Post-Pruning (C4.5 Algorithm): Size: 164 tests Training Error: 3.1% Test Error: 10.3% Examples of rules IF vs = 1 THEN - [99.4%] IF vs = 0 & export = 0 & takeover = 1 THEN + [93.6%]

10 Evaluating Learned Real-world Process Hypotheses split randomly Sample S drawn i.i.d. split randomly Training Sample S train S (x 1,y 1 ),, (x n,y n ) train Learner (incl. ModSel) ĥ Test Sample S test (x 1,y 1 ), (x k,y k ) Goal: Find h with small prediction error Err P (h) over P(X,Y). Question: How good is Err P (ĥ) of ĥ found on training sample S train. Training Error: Error Err Strain (ĥ) on training sample. Test Error: Error Err Stest (ĥ) is an estimate of Err P (ĥ).

11 What is the True Error of a Hypothesis? Given Sample of labeled instances S Learning Algorithm A Setup Partition S randomly into S train (70%) and S test (30%) Train learning algorithm A on Strain, result is ĥ. Apply ĥ to S test and compare predictions against true labels. Test Error on test sample Err S test (ĥ) is estimate of true error Err P(ĥ). Compute confidence interval. Training Sample S train S train ĥ Test Sample S test (x 1,y 1 ),, (x n,y n ) Learner (x 1,y 1 ), (x k,y k )

Binomial Distribution The probability of observing x heads in a sample of n independent coin tosses, where in each toss the probability of heads is p, is Normal approximation: For np(1-p)>=5 the

12 Binomial Distribution The probability of observing x heads in a sample of n independent coin tosses, where in each toss the probability of heads is p, is Normal approximation: For np(1-p)>=5 the binomial can be approximated by the normal distribution with Expected value: E(X)=np Variance: Var(X)=np(1-p) With probability, the observation x falls in the interval 50% 68% 80% 90% 95% 98% 99% z

13 Text Classification Example: Results Data Training Sample: 2000 examples Test Sample: 600 examples Unpruned Tree: Size: 437 nodes Training Error: 0.0% Test Error: 11.0% Early Stopping Tree: Size: 299 nodes Training Error: 2.6% Test Error: 9.8% Post-Pruned Tree: Size: 167 nodes Training Error: 4.0% Test Error: 10.8% Rule Post-Pruning: Size: 164 tests Training Error: 3.1% Test Error: 10.3%

14 Given Is Rule h 1 More Accurate than h 2? Sample of labeled instances S Learning Algorithms A 1 and A 2 Setup Test (Same Test Sample) Partition S randomly into S train (70%) and S test (30%) Train learning algorithms A 1 and A 2 on S train, result are ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S test and compute Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ). Decide, if Err P (ĥ 1 ) Err P (ĥ 2 )? Null Hypothesis: Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ) come from binomial distributions with same p. Binomial Sign Test (McNemar s Test)

15 Is Rule h 1 More Accurate than h 2? Given (Different Test Samples) Samples of labeled instances S 1 and S 2 Learning Algorithms A 1 and A 2 Setup Test Partition S 1 randomly into S train1 (70%) and S test1 (30%) Partition S 2 randomly into S train2 (70%) and S test2 (30%) Train learning algorithm A 1 on S train1 and A 2 on S train2, result are ĥ 1 and ĥ 2. Apply ĥ 1 to S test1 and ĥ 2 to S test2 and get Err Stest1 (ĥ 1 ) and Err Stest2 (ĥ 2 ). Decide, if Err P (ĥ 1 ) Err P (ĥ 2 )? Null Hypothesis: Err Stest1 (ĥ 1 ) and Err Stest2 (ĥ 2 ) come from binomial distributions with same p. t-test (z-test) [ see Mitchell book]

16 Is Learning Algorithm A 1 better than A 2? Given k samples S 1 S k of labeled instances, all i.i.d. from P(X,Y). Learning Algorithms A 1 and A 2 Setup For i from 1 to k Partition S i randomly into S train (70%) and S test (30%) Train learning algorithms A 1 and A 2 on S train, result are ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S test and compute Err Stest (ĥ 1 ) and Err Stest (ĥ 2 ). Test Decide, if E S (Err P (A 1 (S train ))) E S (Err P (A 2 (S train )))? Null Hypothesis: Err Stest (A 1 (S train )) and Err Stest (A 2 (S train )) come from same distribution over samples S. t-test (z-test) or Wilcoxon Signed-Rank Test [ see Mitchell book]

17 Approximation via K-fold Cross Validation Given Sample of labeled instances S Learning Algorithms A 1 and A 2 Compute Randomly partition S into k equally sized subsets S 1 S k For i from 1 to k Train A 1 and A 2 on S 1 S i-1 S i+1. S k and get ĥ 1 and ĥ 2. Apply ĥ 1 and ĥ 2 to S i and compute Err Si (ĥ 1 ) and Err Si (ĥ 2 ). Estimate Average Err Si (ĥ 1 ) is estimate of E S (Err P (A 1 (S train ))) Average Err Si (ĥ 2 ) is estimate of E S (Err P (A 2 (S train ))) Count how often Err Si (ĥ 1 )>Err Si (ĥ 2 ) and Err Si (ĥ 1 )<Err Si (ĥ 2 )

Introduction to Machine Learning

Introduction to Machine Learning Decision Tree Example Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short} Class: Country = {Gromland, Polvia} CS4375 --- Fall 2018 a