Lecture Notes for Chapter 4 Part III. Introduction to Data Mining

Data Mining Cassification: Basic Concepts, Decision Trees, and Mode Evauation Lecture Notes for Chapter 4 Part III Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Practica Issues of Cassification Underfitting and Overfitting Missing Vaues Costs of Cassification Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

Underfitting and Overfitting (Exampe) 500 circuar and 500 trianguar data points. Circuar points: 0.5 sqrt(x 12 +x 22 ) 1 Trianguar points: sqrt(x 12 +x 22 ) > 0.5 or sqrt(x 12 +x 22 ) < 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3

Underfitting and Overfitting Overfitting Underfitting: when mode is too simpe, both training and test errors are arge Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Overfitting due to Noise Decision boundary is distorted by noise point Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5

Notes on Overfitting Overfitting resuts in decision trees that are more compex than necessary Training error no onger provides a good estimate of how we the tree wi perform on previousy unseen records Need new ways for estimating errors Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

Estimating Generaization Errors Re-substitution errors: error on training (Σ e(t) ) Generaization errors: error on testing (Σ e (t)) Methods for estimating generaization errors: Optimistic approach: e (t) = e(t) Pessimistic approach: For each eaf node: e (t) = (e(t)+0.5) Tota error counts: e (T) = e(t) + N 0.5 (N: number of eaf nodes) For a tree with 30 eaf nodes and 10 errors on training (out of 1000 instances): Training error = 10/1000 = 1% Generaization error = (10 + 30 0.5)/1000 = 2.5% Reduced error pruning (REP): uses vaidation data set to estimate generaization error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7

Occam s Razor Given two modes of simiar generaization errors, one shoud prefer the simper mode over the more compex mode For compex modes, there is a greater chance that it was fitted accidentay by errors in data Therefore, one shoud incude mode compexity when evauating a mode Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

Minimum Description Length (MDL) X y X 1 1 X 2 0 X 3 0 X 4 1 X n 1 A Yes 0 A? No B? B 1 B 2 C? 1 C 1 C 2 0 1 B X y X 1? X 2? X 3? X 4? X n? Cost(Mode,Data) = Cost(Data Mode) + Cost(Mode) Cost is the number of bits needed for encoding. We shoud search for the east costy mode. Cost(Data Mode) encodes the errors on training data. Cost(Mode) estimates mode compexity, or future error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9

How to Address Overfitting in Decision Trees Pre-Pruning (Eary Stopping Rue) Stop the agorithm before it becomes a fuy-grown tree Typica stopping conditions for a node: Stop if a instances beong to the same cass Stop if a the attribute vaues are the same More restrictive conditions: Stop if number of instances is ess than some user-specified threshod Stop if cass distribution of instances are independent of the avaiabe features (e.g., using χ 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain). Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

How to Address Overfitting Post-pruning Grow decision tree to its entirety Trim the nodes of the decision tree in a bottom-up fashion If generaization error improves after trimming, repace sub-tree by a eaf node. Heuristic: Cass abe of eaf node is determined from majority cass of instances in the sub-tree generaization error count = error count + 0.5*N, where N is the number of eaf nodes, This is a heuristic used in some agorithms, but there are other ways using statistics Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11

Post-Pruning based on eaves Training Error (Before spitting) = 10/30 Cass = Yes 20 Cass = No 10 Error = 10/30 A? Pessimistic error (Before spitting) = (10 + 1X 0.5)/30 = 10.5/30 Training Error (After spitting) = 9/30 Pessimistic error (After spitting) = (9 + 4 0.5)/30 = 11/30 Post-pruning decision: PRUNE! A1 A2 A3 A4 Cass = Yes 8 Cass = Yes 3 Cass = Yes 4 Cass = Yes 5 Cass = No 4 Cass = No 4 Cass = No 1 Cass = No 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Exampes of Post-pruning Optimistic error? Case 1: Don t prune for both cases Pessimistic error? C0: 11 C1: 3 C0: 2 C1: 4 Don t prune case 1, prune case 2 Case 2: C0: 14 C1: 3 C0: 2 C1: 2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13

Data Fragmentation Number of instances gets smaer as you traverse down the tree Number of instances at the eaf nodes coud be too sma to make any statisticay significant decision Soution: imit number of instances per eaf node >= a user given vaue n. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Decision Trees: Feature Construction x + y < 1 Cass = + Cass = Test condition may invove mutipe attributes, but hard to automate! Finding better node test features is a difficut research issue Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Mode Evauation Metrics for Performance Evauation How to evauate the performance of a mode? Methods for Performance Evauation How to obtain reiabe estimates? Methods for Mode Comparison How to compare the reative performance among competing modes? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Metrics for Performance Evauation Focus on the predictive capabiity of a mode Rather than how fast it takes to cassify or buid modes, scaabiity, etc. Confusion Matrix: count or percentage PREDICTED CLASS Cass=Yes Cass=No Cass=Yes a b ACTUAL CLASS Cass=No c d a: TP (true positive) b: FN (fase negative) c: FP (fase positive) d: TN (true negative) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Metrics for Performance Evauation PREDICTED CLASS Cass=Yes Cass=No ACTUAL Cass=Yes a (TP) CLASS Cass=No c (FP) b (FN) d (TN) Most widey-used metric: Accuracy = a a + b + + d c + d = TP TP + TN + TN + FP + FN Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19

Limitation of Accuracy Consider a 2-cass probem Number of Cass 0 exampes = 9990 Number of Cass 1 exampes = 10 If mode predicts everything to be cass 0, accuracy is 9990/10000 = 99.9 % Accuracy is miseading because mode does not detect any cass 1 exampe Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

Cost Matrix PREDICTED CLASS C(i j) Cass=Yes Cass=No Cass=Yes C(Yes Yes) C(No Yes) ACTUAL CLASS Cass=No C(Yes No) C(No No) C(i j): Cost of miscassifying cass j exampe as cass I - medica diagnosis, customer segmentation Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21

Computing Cost of Cassification Confusion matrix Cost Matrix ACTUAL CLASS PREDICTED CLASS C(i j) + - + -1 100-1 0 Mode M 1 PREDICTED CLASS Mode M 2 PREDICTED CLASS ACTUAL CLASS + - + 150 40-60 250 ACTUAL CLASS + - + 250 45-5 200 Accuracy = 80% Cost = 3910 Accuracy = 90% Cost = 4255 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

Information Retrieva Measures PREDICTED CLASS a Precision : p = a + c a Reca: r = a + b ACTUAL CLASS Cass=Yes Cass=No Cass=Yes a b Cass=No c d F - measure (F) = 2rp r + p = 2a 2a + b + c Let C be cost (can be count in our exampe) Precision is biased towards C(Yes Yes) & C(Yes No) Reca is biased towards C(Yes Yes) & C(No Yes) F-measure is biased towards a except C(No No) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23

Methods of Estimation Hodout Reserve 2/3 for training and 1/3 for testing Cross vaidation Partition data into k disjoint subsets k-fod: train on k-1 partitions, test on the remaining one Leave-one-out: k=n Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25

Test of Significance (Sections 4.5,4.6 of TSK Book) Given two modes: Mode M1: accuracy = 85%, tested on 30 instances Mode M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we pace on accuracy of M1 and M2? Can the difference in performance measure be expained as a resut of random fuctuations in the test set? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26

Confidence Interva for Accuracy Prediction can be regarded as a Bernoui tria A Bernoui tria has 2 possibe outcomes Possibe outcomes for prediction: correct or wrong Coection of Bernoui trias has a Binomia distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 times, how many heads woud turn up? Expected number of heads = N p = 50 0.5 = 25 Given x (# of correct predictions) or equivaenty, acc=x/n, and N =# of test instances, Can we predict p (true accuracy of mode)? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27

Confidence Interva for Accuracy P For arge N, et 1 α be confidence acc has a norma distribution with mean p and variance p(1-p)/n ( Z < < Z α / 2 1 α / 2 = 1 α acc p p(1 p) / N Confidence Interva for p: ) Area = 1 - α Z α/2 Z 1- α /2 p = 2 N acc + Z 2 α / 2 ± Z 2 α / 2 2( N + 4 N + Z 2 α / 2 ) acc 4 N acc 2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

Confidence Interva for Accuracy Consider a mode that produces an accuracy of 80% when evauated on 100 test instances: N=100, acc = 0.8 Let 1-α = 0.95 (95% confidence) From probabiity tabe, Z α/2 =1.96 1-α Z 0.99 2.58 0.98 2.33 N 50 100 500 1000 5000 p(ower) 0.670 0.711 0.763 0.774 0.789 0.95 1.96 0.90 1.65 p(upper) 0.888 0.866 0.833 0.824 0.811 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29

ROC (Receiver Operating Characteristic) Page 298 of TSK book. Many appications care about ranking (give a queue from the most ikey to the east ikey) Exampes Which ranking order is better? ROC: Deveoped in 1950s for signa detection theory to anayze noisy signas Characterize the trade-off between positive hits and fase aarms ROC curve pots TP (on the y-axis) against FP (on the x-axis) Performance of each cassifier represented as a point on the ROC curve changing the threshod of agorithm, sampe distribution or cost matrix changes the ocation of the point Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30

How to Construct an ROC curve Instance P(+ A) True Cass 1 0.95 + 2 0.93 + 3 0.87-4 0.85-5 0.85-6 0.85 + 7 0.76-8 0.53 + 9 0.43-10 0.25 + Predicted by cassifier This is the ground truth Use cassifier that produces posterior probabiity for each test instance P(+ A) for instance A Sort the instances according to P(+ A) in decreasing order Appy threshod at each unique vaue of P(+ A) Count the number of TP, FP, TN, FN at each threshod TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31

How to construct an ROC curve Cass + - + - - - + - + + Threshod >= 0.25 0.43 0.53 0.76 0.85 0.85 0.85 0.87 0.93 0.95 1.00 TP 5 4 4 3 3 3 3 2 2 1 0 FP 5 5 4 4 3 2 1 1 0 0 0 TN 0 0 1 1 2 3 4 4 5 5 5 FN 0 1 1 2 2 2 2 3 3 4 5 TPR 1 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.2 0 FPR 1 1 0.8 0.8 0.6 0.4 0.2 0.2 0 0 0 ROC Curve: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32

Using ROC for Mode Comparison No mode consistenty outperform the other M 1 is better for sma FPR M 2 is better for arge FPR Area Under the ROC curve: AUC Idea: Area = 1 Random guess: Area = 0.5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33

ROC Curve (TP,FP): (0,0): decare everything to be negative cass (1,1): decare everything to be positive cass (1,0): idea Diagona ine: Random guessing Beow diagona ine: prediction is opposite of the true cass Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34