Data Mnng: Model Evaluaton Aprl 16, 2013 1
Issues: Evaluatng Classfcaton Methods Accurac classfer accurac: predctng class label predctor accurac: guessng value of predcted attrbutes Speed tme to construct the model (tranng tme) tme to use the model (classfcaton/predcton tme) Robustness: handlng nose and mssng values Scalablt: effcenc n dsk-resdent databases Interpretablt understandng and nsght provded b the model Other measures, e.g., goodness of rules, such as decson tree sze or compactness of classfcaton rules Aprl 16, 2013 2
Predctor Error Measures Measure predctor accurac: measure how far off the predcted value s from the actual known value Loss functon: measures the error betw. and the predcted value Absolute error: Squared error: ( ) 2 Test error (generalzaton error): the average loss over the test set Mean absolute error: Mean squared error: Relatve absolute error: Relatve squared error: The mean squared-error exaggerates the presence of outlers Popularl use (square) root mean-square error, smlarl, root relatve squared error d d = 1 ' d d = 1 2 ') ( = = d d 1 1 ' = = d d 1 2 1 2 ) ( ') ( Aprl 16, 2013 3
Evaluatng the Accurac of a Classfer or Predctor (I) Holdout method Gven data s randoml parttoned nto two ndependent sets Tranng set (e.g., 2/3) for model constructon Test set (e.g., 1/3) for accurac estmaton Random samplng: a varaton of holdout Repeat holdout k tmes, accurac = avg. of the accuraces obtaned Cross-valdaton (k-fold, where k = 10 s most popular) Randoml partton the data nto k mutuall exclusve subsets, each approxmatel equal sze At -th teraton, use D as test set and others as tranng set Leave-one-out: k folds where k = # of tuples, for small szed data Stratfed cross-valdaton: folds are stratfed so that class dst. n each fold s approx. the same as that n the ntal data Aprl 16, 2013 4
Evaluatng the Accurac of a Classfer or Predctor (II) Bootstrap Works well wth small data sets Samples the gven tranng tuples unforml wth replacement.e., each tme a tuple s selected, t s equall lkel to be selected agan and re-added to the tranng set Several boostrap methods, and a common one s.632 boostrap Suppose we are gven a data set of d tuples. The data set s sampled d tmes, wth replacement, resultng n a tranng set of d samples. The data tuples that dd not make t nto the tranng set end up formng the test set. About 63.2% of the orgnal data wll end up n the bootstrap, and the remanng 36.8% wll form the test set (snce (1 1/d) d e -1 = 0.368) Repeat the samplng procedure k tmes, overall accurac of the model: k acc ( M ) = (0.632 acc( M ) test _ set + 0.368 acc( M ) tran _ = 1 Aprl 16, 2013 5 set )
Model Evaluaton Metrcs for Performance Evaluaton How to evaluate the performance of a model? Methods for Performance Evaluaton How to obtan relable estmates? Methods for Model Comparson How to compare the relatve performance among competng models? Aprl 16, 2013 6
Metrcs for Performance Evaluaton Focus on the predctve capablt of a model Rather than how fast t takes to classf or buld models, scalablt, etc. Confuson Matrx: ACTUAL CLASS PREDICTED CLASS Class=Yes Class=No Class=Yes a (TP) b (FN) Class=No c (FP) d (TN) a: TP (true postve) b: FN (false negatve) c: FP (false postve) d: TN (true Aprl 16, 2013 negatve) 7
Metrcs for Performance Evaluaton PREDICTED CLASS Class=Yes Class=No ACTUAL CLASS Class=Yes Class=No Most wdel-used metrc: a (TP) c (FP) b (FN) d (TN) Accurac = a a + b + + d c + d = TP TP + TN + TN + FP + FN Aprl 16, 2013 8
Classfer Accurac Measures Predcted classes bu_computer = es bu_computer = no total recognton(%) bu_computer = es 6954 46 7000 99.34 bu_computer = no 412 2588 3000 86.27 total 7366 2634 10000 95.52 Accurac of a classfer M, acc(m): percentage of test set tuples that are correctl classfed b the model M Error rate (msclassfcaton rate) of M = 1 acc(m) Gven m classes, CM,j, an entr n a confuson matrx, ndcates # of tuples n class that are labeled b the classfer as class j Alternatve accurac measures (e.g., for cancer dagnoss) senstvt = TP/TP+FN /* true postve recognton rate */ specfct = TN/TN+FP /* true negatve recognton rate */ Ths model can also be used for cost-beneft analss Aprl 16, 2013 9
Lmtaton of Accurac Consder a 2-class problem Number of Class 0 examples = 9990 Number of Class 1 examples = 10 If model predcts everthng to be class 0, accurac s 9990/10000 = 99.9 % Accurac s msleadng because model does not detect an class 1 example Aprl 16, 2013 10
Cost Matrx PREDICTED CLASS C( j) Class=Yes Class=No ACTUAL CLASS Class=Yes C(Yes Yes) C(No Yes) Class=No C(Yes No) C(No No) C( j): Cost of msclassfng class j example as class Aprl 16, 2013 11
Computng Cost of Classfcaton Cost Matrx ACTUAL CLASS PREDICTED CLASS C( j) + - + -1 100-1 0 Model M 1 PREDICTED CLASS Model M 2 PREDICTED CLASS ACTUAL CLASS + - + 150 40-60 250 ACTUAL CLASS + - + 250 45-5 200 Accurac = 80% Cost = 3910 Accurac = 90% Cost = 4255 Aprl 16, 2013 12
Cost vs Accurac Count ACTUAL CLASS PREDICTED CLASS Class=Yes Class=No Class=Yes a b Class=No c d Accurac s proportonal to cost f 1. C(Yes No)=C(No Yes) = q 2. C(Yes Yes)=C(No No) = p N = a + b + c + d Accurac = (a + d)/n Cost ACTUAL CLASS PREDICTED CLASS Class=Yes Class=No Class=Yes p q Class=No q p Cost = p (a + d) + q (b + c) = p (a + d) + q (N a d) = q N (q p)(a + d) = N [q (q-p) Accurac] Aprl 16, 2013 13
Cost-Senstve Measures a Precson (p) = a + c a Recall (r) = a + b 2rp 2a F - measure (F) = = r + p 2a + b + c Precson s based towards C(Yes Yes) & C(Yes No) Recall s based towards C(Yes Yes) & C(No Yes) F-measure s based towards all except C(No No) Weghted Accurac = w a 1 w a + 1 + w b + 2 w d 4 w c + 3 w d Aprl 16, 2013 14 4
Model Evaluaton Metrcs for Performance Evaluaton How to evaluate the performance of a model? Methods for Performance Evaluaton How to obtan relable estmates? Methods for Model Comparson How to compare the relatve performance among competng models? Aprl 16, 2013 15
Methods for Performance Evaluaton How to obtan a relable estmate of performance? Performance of a model ma depend on other factors besdes the learnng algorthm: Class dstrbuton Cost of msclassfcaton Sze of tranng and test sets Aprl 16, 2013 16
Learnng Curve Learnng curve shows how accurac changes wth varng sample sze Requres a samplng schedule for creatng learnng curve: Arthmetc samplng (Langle, et al) Geometrc samplng (Provost et al) Effect of small sample sze: - Bas n the estmate - Varance of estmate Aprl 16, 2013 17
Holdout Methods of Estmaton Reserve 2/3 for tranng and 1/3 for testng Random subsamplng Repeated holdout Cross valdaton Partton data nto k dsjont subsets k-fold: tran on k-1 parttons, test on the remanng one Leave-one-out: k=n Stratfed samplng oversamplng vs undersamplng Bootstrap Samplng wth replacement Aprl 16, 2013 18
Model Evaluaton Metrcs for Performance Evaluaton How to evaluate the performance of a model? Methods for Performance Evaluaton How to obtan relable estmates? Methods for Model Comparson How to compare the relatve performance among competng models? Aprl 16, 2013 19
ROC (Recever Operatng Characterstc) Developed n 1950s for sgnal detecton theor to analze nos sgnals Characterze the trade-off between postve hts and false alarms ROC curve plots TP (on the -axs) aganst FP (on the x-axs) Performance of each classfer represented as a pont on the ROC curve changng the threshold of algorthm, sample dstrbuton or cost matrx changes the locaton of the pont Aprl 16, 2013 20
ROC Curve - 1-dmensonal data set contanng 2 classes (postve and negatve) - an ponts located at x > t s classfed as postve At threshold t: TP=0.5, FN=0.5, FP=0.12, FN=0.88 Aprl 16, 2013 21
ROC Curve (TP,FP): (0,0): declare everthng to be negatve class (1,1): declare everthng to be postve class (1,0): deal Dagonal lne: Random guessng Below dagonal lne: predcton s opposte of the true class Aprl 16, 2013 22
Usng ROC for Model Comparson In general, No model consstentl outperform the other M 1 s better for small FPR M 2 s better for large FPR Aprl 16, 2013 23