Interpretation and evaluation 1. Descriptive tasks Evaluation based on novelty, interestingness, usefulness and understandability Qualitative evaluation: obvious (common sense) knowledge knowledge that corresponds to knowledge of domain experts novel interesting knowledge, that brings new insight knowledge, that must be further analysed by domain experts knowledge, that contradicts to knowledge of domain experts Quantitative evaluation: E.g. support and confidence of association rules not all what is statistically significant is also interesting! P. Berka, 2012 1/18
IF the patient regularly consumed beer (up to 1 liter),then he regularly consumed wine IF the education of the patient was university AND his weight was 176-180 cm, THEN the cause of death was tumor IF the patient has no hypercholesterolemia AND he sometimes follows his diet, THEN the patient will have no hypercholesterolemia within next40 months IF the patient regularly consumed beer, THEN the atherosclerosis risk decreases With increasing education the number of smokers decreases P. Berka, 2012 2/18
2. Classification tasks Evaluation based on classification accuracy on preclassified data Testing models: testing on training set random split the data into training and testing set cross-validation leave-one-out bootstrap (random selection with replacement to get training set) testing on testing set The aim is to figure out in how many times the classification corresponds (resp. contradicts) to the correct class given in the data Confusion matrix Classifier Data + - + TP FN - FP TN P. Berka, 2012 3/18
Evaluation using one/two numbers Overall accuracy or overall error) Acc = Err = TP + TN TP + TN + FP + FN FP + FN TP + TN + FP + FN overall accuracy [Acc def, Acc max ], where Acc def accuracy when classifying all examples to majority class Acc max maximal possible accuracy for given data Only number of errors and/or prices/costs and benefit Error without costs Err = 1 Acc Error with costs Err = FP * c(p,n) + FN * c(n,p) Cost and benefit SAS EM demo P. Berka, 2012 4/18
accuracy for each class separately Acc + = TP TP + FP Acc - = TN TN + FN More suitable for unbalanced classes precision and recall precision = TP TP + FP recall = TP TP + FN Can be combined into F-measure (harmonic mean) F = 2 * precision * recall precision + recall = 2TP 2TP + FP + FN sensitivity and specificity sensitivity = TP TP + FN specificity = TN TN + FP P. Berka, 2012 5/18
Numeric classes (p i predicted and s i actual value) MSE = (p 1-s 1 ) 2 + + (p n -s n ) 2 n RMSE = (p 1 -s 1 ) 2 + + (p n -s n ) 2 MAE = p 1-s 1 + + p n -s n n RSE = n (p 1 -s 1 ) 2 + + (p n -s n ) 2 (s 1 -s') 2 + + (s n -s') 2, where s' = i s i n = S ps S p 2 S s 2, where S ps = i (p i - p')(s i - s') n-1, S p 2 = i (p i - p') 2 n-1, S s 2 = i (s i - s') 2 n-1 P. Berka, 2012 6/18
Evaluation in form of graphs Learning curve relation between classification accuracy and no. examples relation between classification accuracy and no. iterations P. Berka, 2012 7/18
Lift curve relation between successful classification and weight of classification Return on Investments (ROI) curve relation between benefit and weight of classification P. Berka, 2012 8/18
ROC curve relation between TP rate and FP rate for different settings of classifier TP % = TP TP + FN FP % = FP FP + TN TP % = sensitivity, 1 FP % = specificity P. Berka, 2012 9/18
P. Berka, 2012 10/18
Variant (KEX) Evaluate reliability of classification using metalearning; information about correctness of classification introduced as new attribute Dependency of correctness and number of classifications on the threshold ; decision made only if w P. Berka, 2012 11/18
Visualization 1. Visualization of models Decision trees (MineSet) Associations (Clementine) P. Berka, 2012 12/18
Single rule IF unemployed(no) THEN loan(yes) loan(yes) loan(no) unemployed (no) 5 1 6 unemployed(yes) 3 3 6 8 4 12 P. Berka, 2012 13/18
2. Visualization of classifications General logic diagrams (Michalski) Using GIS (KEX) P. Berka, 2012 14/18
Model comparison T-test t(x,y) = x' - y' s(x,y) 1/m + 1/n, where x' = i x i m, y' = i y i n a s 2 (x,y) = (m-1) s x 2 + (n-1) s y 2 m+n-2 Model A is better than model B, iff t(acc A, Acc B ) > t(1- /2, m + n 2) ROC curves Occam s razor minimum description length, MDL P. Berka, 2012 15/18
Selecting best algorithm characteristics of algorithms vs. characteristics of data example representation, expressive power, ability to process numeric/categorical data, ability to process noisy or missing data, ability to work with cost matrix, independence assumption, crisp or fuzzy classification empiric studies Meta-learning on top of results of different algorithms STATLOG (1991-1994) For large data is suitable discriminant analysis (linear, quadratic), No great difference between standard and logistic discriminant analysis, For large data the slowest method is k-nearest neighbor, All TDIDT algorithms gave similar results; the criterion for finding splitting attribute does not play great roleí, Neural networks gave good results for problems where cost matrix was not used. METAL (2000 - ) Focus also on pre-processing P. Berka, 2012 16/18
Model combination Various variants of voting Bagging (bootstrap aggregating) several equally great training sets created using random selection with replacement (bootstrap), same learning algorithm all models same strength of vote Boosting next model focus on data wrongly classified by previous models, same algorithm models in the sequence have increasing strength of vote AdaBoost algorithm learning 1. Assign equal weight to all training examples, 2. For each iteration (built model) 2.1. create model 2.2. compute error err on weighted data 2.3. If err=0 or err>0.5 end 2.4. For every example If classifed correctly then weight w=w*err/(1-err) 2.5. Normalize weights of examples (sum of new weights equals to sum of old weights) classification 1. Assign weight 0 to every class 2. For each model Assign to resulting class weight w=w-log(err/1-err) 3. Return class with highest weight P. Berka, 2012 17/18
Stacking Various algorithms recognize reliability of particular models using meta learning Weka Stacking P. Berka, 2012 18/18