Classification Part II Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Practical Issues of Classification Underfitting and Overfitting Missing Values Costs of Classification Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004
Underfitting and Overfitting (Example) 500 circular and 500 triangular data points. Circular points: 0.5 sqrt(x 1 +x ) 1 Triangular points: sqrt(x 1 +x ) > 0.5 or sqrt(x 1 +x ) < 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 3 Underfitting and Overfitting Overfitting Overfitting: once the tree becomes too large, its test error rate begins to increase even through its training error rate continues to decrease. Underfitting: when model is too simple, both training i and test t errors are large Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 4
Understand the overfitting phenomenon Training error of a model can be reduced by increasing the model complexity. The leaf nodes of the tree can be expanded unitl it perfectly fits the training data The training error for such a complex tree could be zero, but the test error can be large because the tree may contain nodes that accidently fit some of the noise points in the training data. Some noise nodes can degrade the performance of the tree because they do not generalize well to the test example. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 5 Overfitting due to Noise Decision boundary is distorted by noise point Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 6
Overfitting due to lack of representative samples Problem: a small number of training records lack of representative samples in the training data Learning algorithms continue to refine their models even when few training records are available Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 7 Overfitting due to Insufficient Examples Lack of data points in the lower half of the diagram makes it difficult to predict correctly the class labels l of that t region - Insufficient number of training records in the region causes the decision tree to predict the test examples using other training records that are irrelevant to the classification task Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 8
Notes on Overfitting Overfitting results in decision trees that are more complex than necessary Training error no longer provides a good estimate of how well the tree will perform on previously unseen records Need new ways for estimating errors Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 9 Overfitting and the Multiple Comparison Procedure Model overfitting is related with a methodology known as multiple comparison procedure. Prediction is correct in the next ten trading days at least 8 out of 10 times for one stock analyst: Very low At least one of 50 stock analyst whose prediction is correct in the next ten trading days at least 8 out of 10 times for one stock analyst: Very high Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 10
Overfitting and the Multiple Comparison Procedure During decision tree induction, attribute x can be added to the tree if the observed gain, (T 0, T x ), I greater than some predefined threshold α. In practice, more than one test condition is available and the decision tree algorithm must choose the best attribute xmax from a set of candidates,,{x 1, x,, x k k), to partition the data. In this situation, the algorithm is actually using a multiple comparison procedure to decide whether a decision tree should be extended. Unless the gain function or threshold α is modified to account for k, the algorithm may inadvertently add spurious nodes to the model, which leads to model overfitting. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 11 Estimating Generalization Errors Re-substitution errors: error on training (Σ e(t) ) Generalization errors: error on testing (Σ e (t)) e(t)) Methods for estimating generalization errors: Optimistic approach: e (t) e(t) = e(t) Pessimistic approach: As low as possible For each leaf node: e (t) = (e(t)+0.5) Total errors: e (T) = e(t) + N 0.5 (N: number of leaf nodes) For a tree with 30 leaf nodes and 10 errors on training (out of 1000 instances): Training error = 10/1000 = 1% Generalization error = (10 + 30 0.5)/1000 =.5% Reduced error pruning (REP): uses validation data set to estimate generalization error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1
Occam s Razor Given two models of similar generalization errors, one should prefer the simpler model over the more complex model For complex pe models, odes, there e is a greater chance ce that it was fitted accidentally by errors in data Therefore, one should include model complexity when evaluating a model Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 13 How to Address Overfitting Pre-Pruning (Early Stopping Rule) Stop the algorithm before it becomes a fully-grown tree Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same More restrictive ti conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using χ test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 14
How to Address Overfitting Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generalization error improves after trimming, replace sub-tree by a leaf node. Class label of leaf node is determined from majority class of instances in the sub-tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 15 Example of Post-Pruning Class = Yes 0 Training Error (Before splitting) = 10/30 Pessimistic error = (10 + 0.5)/30 = 10.5/30 Class = No 10 Training Error (After splitting) = 9/30 Error = 10/30 Pessimistic st error (After splitting) = (9 + 4 0.5)/30 = 11/30 A? PRUNE! A1 A A3 A4 Class = Yes 8 Class = Yes 3 Class = Yes 4 Class = Yes 5 Class = No 4 Class = No 4 Class = No 1 Class = No 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 16
Examples of Post-pruning Optimistic error? Case 1: Don t prune for both cases Pessimistic error? C0: 11 C0: C1: 3 C1: 4 Don t prune case 1, prune case Reduced error pruning? Case : Depends on validation set C0: 14 C1: 3 C0: C1: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 17 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 18
Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 19 Metrics for Performance Evaluation Focus on the predictive capability of a model Rather than how fast it takes to classify or build models, scalability, etc. Confusion Matrix: PREDICTED CLASS Class=Yes Class=No Class=Yes a b a: TP (true positive) b: FN (false negative) ACTUAL c: FP (false positive) d: TN (true negative) CLASS Class=No c d Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 0
Metrics for Performance Evaluation PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes a (TP) c (FP) b (FN) d (TN) Class=No c d Most widely-used metric: Accuracy = a + d TP + TN = a + b + c + d TP + TN + FP + FN Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 1 Limitation of Accuracy Consider a -class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predicts everything er to be class 0, accuracy is 9990/10000 = 99.9 % Accuracy is misleading because model does not detect any class 1 example Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004
Cost Matrix ACTUAL PREDICTED CLASS C(i j) Class=Yes Class=No Class=Yes C(Yes Yes) C(No Yes) CLASS Class=No C(Yes No) C(No No) C(i j): Cost of misclassifying class j example as class i Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 3 Computing Cost of Classification Cost Matrix ACTUAL CLASS PREDICTED CLASS C(i j) + - + -1 100-1 0 Model PREDICTED CLASS Model PREDICTED CLASS M 1 M ACTUAL CLASS + - + 150 40-60 50 ACTUAL CLASS + - + 50 45-5 00 Accuracy = 80% Accuracy = 90% Cost = 3910 Cost = 455 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 4
Cost vs Accuracy Count PREDICTED CLASS Class=Yes Class=No Accuracy is proportional to cost if 1. C(Yes No)=C(No Yes) ( ) = q. C(Yes Yes)=C(No No) = p Class=Yes a b ACTUAL CLASS Class=No c d N = a + b + c + d Cost ACTUAL CLASS PREDICTED CLASS Class=Yes Class=No Class=Yes p q Class=No q p Accuracy = (a + d)/n Cost=p(a+d)+q(b+c) + (b + = p (a + d) + q (N a d) = qn (q p)(a + d) = N [q (q-p) Accuracy] Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 5 Cost-Sensitive Measures a Precision (p) = a + c a Recall (r) = a + b rp a F - measure (F) = = r + p a + b + c Precision is biased towards C(Yes Yes) & C(Yes No) Recall is biased towards C(Yes Yes) & C(No Yes) F-measure is biased towards all except C(No No) Weighted Accuracy = w a + w d 1 4 a + w b + w c + w d w1 3 4 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 6
Interpretation of Precision and Recall PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) Precision determines the fraction of records that actually turns out to be positive in the group the classifier has declared as a positive class. The higher h the precision i is, the lower the number of false positive errors committed by the classifier. Recall measures the fraction of positive examples correctly predicted by the classifier. Classifiers with large recall have very few positive examples misclassified as the negative class. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 7 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 8
Methods for Performance Evaluation How to obtain a reliable estimate of performance? Performance of a model may depend on other factors besides the learning algorithm: Class distribution Cost of misclassification Size of training and test sets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 9 Learning Curve Learning curve shows how accuracy changes with varying sample size Requires a sampling schedule for creating learning curve: Arithmetic sampling (Langley, et al) Geometric sampling (Provost et al) Effect of small sample size: - Bias in the estimate - Variance of estimate Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 30
Methods of Estimation Holdout Reserve e /3 for training and 1/3 for testing Random subsampling Repeated holdout Cross validation Partition data into k disjoint subsets k-fold: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Stratified sampling oversampling vs undersampling Bootstrap Sampling with replacement Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 31 Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to compare the relative performance among competing models? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 3
ROC (Receiver Operating Characteristic) Developed in 1950s for signal detection theory to analyze noisy signals Characterize the trade-off between positive hits and false alarms ROC curve plots TP rate (on the y-axis) against FP rate (on the x-axis) Performance of each classifier represented as a point on the ROC curve changing the threshold of algorithm, sample distribution or cost matrix changes the location of the point Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 33 ROC Curve - 1-dimensional data set containing classes (positive and negative) - any points located at x > t is classified as positive At threshold t: TP=0.5, FN=0.5, FP=0.1, TN=0.88 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 34
ROC Curve (TP-Rate, FP-Rate): (0,0): declare everything to be negative class (1,1): 1) declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of thetrueclass true Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 35 Using ROC for Model Comparison No model consistently outperform the other M 1 is better for small FPR M is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 36
How to Construct an ROC curve Instance P(+ A) True Class 1 0.95 + 0.93 + 3 0.87-4 0.85-5 085 0.85-6 0.85 + 7 0.76-8 0.53 + 9 0.43 - Use classifier that produces posterior probability for each test instance P(+ A) Sort the instances according to P(+ A) in decreasing order Apply threshold h at each unique value of P(+ A) Count the number of TP, FP, TN, FN at each threshold 10 0.5 + TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 37 How to construct an ROC curve Threshold >= Class + - + - - - + - + + 0.5 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 1 0 FP 5 5 4 4 3 1 1 0 0 0 TN 0 0 1 1 3 4 4 5 5 5 FN 0 1 1 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0. 0 FPR 1 1 0.8 0.8 0.6 0.4 0. 0. 0 0 0 ROC Curve: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 38
Test of Significance Given two models: Model M1: accuracy = 85%, tested on 30 instances Model M: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M? How much confidence can we place on accuracy of M1 and M? g Can the difference in performance measure be explained as a result of random fluctuations in the test set? Test the statistical significance of the observed deviation Estimate the confidence interval of a given model accuracy Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 39 Confidence Interval for Accuracy Prediction can be regarded as a Bernoulli trial A Bernoulli trial has possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads would turn up? Expected number of heads = N p = 50 0.5 = 5 Given x (# of correct predictions) or equivalently, acc=x/n, and N (# of test instances), Can we predict p (true accuracy of model)? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 40
Confidence Interval for Accuracy For large test sets (N > 30), acc has a normal distribution with mean p and variance p(1-p)/np)/n Area = 1 - α acc ( < p P Z < Z ) α / 1 α (1 ) / / p p N =11 α Z α/ Z 1- α / Confidence interval for acc Z α/ = Z 1- α / Confidence Interval for p: N acc + Z ± Z + 4 N acc 4 N acc α / α / p = ( N + Z α / ) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 41 Confidence Interval for Accuracy Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1-α = 0.95 (95% confidence) From probability table, Z α/=1.96 N 50 100 500 1000 5000 p(lower) 0.670 0.711 0.763 0.774 0.789 p(upper) 0.888 0.866 0.833 0.84 0.811 1-α Z 0.99.58 098 0.98 33.33 0.95 1.96 0.90 1.65 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 4
Comparing Performance of Models Given two models, say M1 and M, which is better? M1 is tested on D1 (size=n1), found error rate = e 1 M is tested on D (size=n), found error rate = e Assume D1 and D are independent If n1 and n are sufficiently large, then Approximate: e1 ~ N ( μ1, σ1 ) e ~ N ( μ, σ ) ˆ σ = i e (1 e ) i i n n i Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 43 Comparing Performance of Models To test if performance difference is statistically significant: d = e1 e d ~ N(d t,σ t ) where d t is the true difference Since D1 and D are independent, their variance adds up: σ = σ + σ ˆ σ ˆ σ 1 1 e 1(1 e 1) e (1 e ) = + n1 n + t At (1-α) confidence level, d = d ± Z t α / σˆ t Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 44
An Illustrative Example Given: M1: n1 = 30, e1 = 0.15 M:n=5000 5000, e=05 0.5 d = e e1 = 0.1 (-sided test) 0.15(1 0.15) 0.5(1 0.5) ˆ = + = 30 5000 σ d, α/ At 95% confidence level, Z α/ =1.96 0.0043 d t = 0.100 ± 1.96 0.0043 = 0.100 ± 0.18 => Interval contains 0 => difference may not be statistically significant Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 45 Comparing Performance of Algorithms Each learning algorithm may produce k models: L1 may produce M11, M1,, M1k L may produce M1, M,, Mk If models are generated on the same test sets D1,D,,, Dk (e.g., via cross-validation) For each set: compute d j = e 1j e j d j has mean d t and variance σ t k Estimate: j 1 σ = ˆ d t t ( d j d) = k( k = d ± t 1) 1 α, k 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/004 46 ˆ σ t