Quick review: Data Mining Tasks... Classifica(on [Predic(ve] Regression [Predic(ve] Clustering [Descrip(ve] Associa(on Rule Discovery [Descrip(ve]

Size: px

Start display at page:

Download "Quick review: Data Mining Tasks... Classifica(on [Predic(ve] Regression [Predic(ve] Clustering [Descrip(ve] Associa(on Rule Discovery [Descrip(ve]"

Barnaby Brown
5 years ago
Views:

1 Evaluation

2 Quick review: Data Mining Tasks... Classifica(on [Predic(ve] Regression [Predic(ve] Clustering [Descrip(ve] Associa(on Rule Discovery [Descrip(ve]

3 Classification: Definition Given a collec(on of records (training set ) Each record contains a set of a*ributes, one of the abributes is the class. Find a model for class abribute as a func(on of the values of other abributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

4 10 10 Decision Tree Classi9ication Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model Decision Tree 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set

legi(mate or fraudulent Classifying secondary structures of protein as alpha- helix,

5 Examples of Classification Task Predic(on Predic(on stock prices Predic(ng tumor cells as benign or malignant Recognizing anomalies: Classifying credit card transac(ons as legi(mate or fraudulent Classifying secondary structures of protein as alpha- helix, beta- sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

6 Change of orientation It is very hard to write programs to solve problems like recognizing a three- dimensional object in a scenario where lightning condi(ons change We do not know how its done in our brain Even if we had an idea how to do the program might be horrendously complicated It is hard to write a program to compute the probability that a a credit card transac(on is fraudulent There may note be any rules that are both simple and reliable Fraud is a moving target, the program needs to keep changing

7 Model Evaluation Metrics for Performance Evalua(on How to evaluate the performance of a model? How to obtain reliable es(mates? Methods for Model Comparison How to compare the rela(ve performance among compe(ng models?

8 Model Evaluation Metrics for Performance Evalua(on How to evaluate the performance of a model? How to obtain reliable es(mates? Methods for Model Comparison How to compare the rela(ve performance among compe(ng models?

9 Difference in Error costs For any predic(on scheme there will be successes and failures, successes and errors Two kinds of successes: Correct predic(ons of posi(ve, true posi(ves, TP Correct predic(ons of nega(ve, true nega(ves, TN Quite owen, the cost (i.e., the benefit) of the two kinds of successes are taken to be the same You can deal with TP + TN together Two kinds of errors: Incorrect predic(ons of posi(ve, false posi(ves, FP Incorrect predic(ons of nega(ve, false nega(ves, FN In virtually every applied situa(on the costs of false posi(ves and false nega(ves materially differ

10 Difference in Error costs Consider a mailing to a predicted poten(al customer who doesn t respond Which type is it? What is the cost? Consider a mailing that was never sent to what would have been a customer Which type is it? What is the cost?

11 Confusion Matrix A graphical way of summarizing informa(on about successes and failures in predic(on The entries in the matrix are the counts of the different kinds of successes and failures for a given situa(on PREDICTED CLASS ACTUAL CLASS Class=Yes Class=No Class=Yes a b Class=No c d a: TP (true positive) b: FN (false negative) c: FP (false positive) d: TN (true negative)

12 Confusion Matrix: Example Predicted class a b c total a Actual b class c Total

13 Evaluating performance with a confussion matrix The idea is that performance can be evaluated on the basis of how much beber it is than what would be achieved by random results

14 Comparing vs random classi9ier Suppose the previous table Is this a fair measure of overall success? How many agreements would you get by chance? The previous predictor predicted 120 a s, 60 b s and 20 c s What if you had a random predictor the predicted the same total number of the three classes but with the original propor(on E.g. divides the 100 a s but keeping the propor(on 100*120/200 = *60/200 = *20/200 = 10

15 Evaluating performance: random classi9ier Predicted class a b c total a Actual b class c Total Predicted class a b c total a Actual b class c Total

16 Evaluating performance: random classi9ier The random predictor is the point of comparison Can you make a quan(ta(ve comparison between the performance of the actual classifica(on scheme and the random predictor? If your classifica(on scheme isn t beber than random, you ve got a big problem. Your predictor is sort of an an(- predictor.

17 Comparing performances Sum the counts on the main diagonal of the matrix for the actual scheme It got = 140 correct Do the same for the random predictor It got = 82 correct Clearly the actual predictor is beber Can this be quan(fied? There are 200 instances altogether The random predictor lew = 118 remaining incorrectly predicted The actual predictor got = 58 more correct than the random predictor The actual predictor got 58/118 = 49.2% of those remaining instances correct

18 Kappa statistic This computa(on is known as the Kappa sta(s(c. 0 means no beber than random 1 means perfect predic(on What would happen if the predictor is worse than a random predictor? Kappa sta(s(c E.g. (140 82)/ (200 82)

19 Imbalance problem Data sets with imbalanced class distribu(ons are quite common in many real applica(ons (credit card fraud detec(on, manufactoring inspec(on, ) The accuracy measure may not be well suited Several solu(ons to handle with these problems are presented in the following slides

20 Exercise Analyze the results of the balance dataset with the J48 algorithms

21 Classi9ication with costs Two cost matrices: Include cost in the classifier To evaluate its performance Success rate is replaced by average cost per predic(on To assign class to a leaf Instead of selec(ng the majority, take into account the cost of successful and unsuccessful predic(ons Cost is given by appropriate entry in the cost matrix When you use the default cost matrices you are simple coun(ng

22 Exercise Repeat the analysis of the balance.arff data set by adding cost to the wrong classifica(ons of class b. Keep increasing the cost un(l the misclassifica(ons of b change significantly. Use the CostSensi(veClassifier

23 Bias to cost If you want accurate performance es(mates, the distribu(on of classifica(ons in the test set should match the distribu(on of classifica(ons in the training set If there was a mismatch between training and test set some test set instances might not classify correctly The mismatch in the sets means that the training set was biased If it was biased against certain classifica(ons, this suggests that it was biased in favor of other classifica(ons Use sampling to bias the classifica(on towards the underrepresented class

24 Bias to cost Suppose that false posi(ves are more costly than false nega(ves Over- represent no instances in the training set When you run the classifica(on algorithm on this training set, it will overtrain on no That means the rules derived by the algorithm will be more likely to reach a conclusion of no Incidentally, this will increase the number of false nega(ves However, it will decrease the number of false posi(ves for these classifica(ons In this way, cost has been taken into account in the rules derived, reducing the likelihood of expensive FP results

25 Sampling Oversampling: Inten(onally bias the training set by increasing the representa(on of some classifica(ons Replicate the minority examples The algorithm will produce rules more likely to correctly predict these classifica(ons Some noise records may be replicated many (mes Overfirng Create new records according to Some criterion: distribu(on of values, neighbors, Undersampling: less records are considered for the training Some useful nega(ve examples may not be chosen Hybrid approach:??

26 Undersampling and oversampling

27 Weights in weka As of weka >=3.5.8 a weight can be associated with an instance in a standard ARFF file by appending it to the end of the line for that instance and enclosing the value in curly braces. 0, X, 0, Y, "class A", {5}

28 SMOTE Normally used with in combina(on with random undersampling Undersampling the majority class Oversampling the minority class Uses k- neighbors to generate new values (k=5) In the SMOTE paper they analyze several combina(ons for undersampling and oversampling Not a golden rule for how much oversampling and undersampling

29 SMOTE

30 SMOTE

31 Exercise Apply undersampling of x% to the majority classes of the balance dataset and then apply smote to the minority class Use Resample to sample without replacement, biasing towards uniform classes (1.0) and genera(ng a sample of the 70% of the original size. Use SMOTE to oversample the minority class up to approximately the same number of the majority classes

32 Lift charts In prac(ce, costs are rarely known Decisions are usually made by comparing possible scenarios Example: promo(onal mailout to 1,000,000 households Mail to all; respond 1000 (0.1%) Iden(fy subset of 400,000 most promising In this one the number of responds is 800 (0.2%) A li2 chart allows a visual comparison in order to iden(fy subpopula(ons samples with a greater likelihood of yes Needs a learning scheme that outputs probabili(es The increase in the response rate is known as the liw factor

33 Lift factor This idea is summarized in the table on the following overhead Sample Yeses Response Rate Li5 Factor 1,000,000 1,000.1% 400, % 2 100, % 4

34 Finding lift factors Given a learning scheme that outputs probabili(es for the predicted class Rank all of the instances by their probability and keep track of their actual class Note that we re opera(ng under the (reasonable) assump(on that the data mining algorithm really works The higher the predicted probability, the more likely an instance really is to take on a certain value You can find the liw factor for a given sample size, or you can find the sample size for a given liw factor

35 Finding lift factors

36 Lift Chart

37 Lift Chart If the algorithm is any good at all, the curve should be above the diagonal Otherwise, the algorithm is determining the yes probability for instances with a success rate lower than a random sample In general, the closer the liw chart curve comes to the upper lew hand corner, the beber A minimal sample size and a maximum response rate is good The hypothe(cal ideal would be a mailing only to those who would respond yes, namely a 100% success rate, with no one lew out (no false nega(ves)

38 Cost Bene9it Curves The costs and benefits of differing sample sizes/mailing scenarios will differ E.g. Mailing scenario The cost of interest now is the cost of mailing an individual item We ll assume that the cost is constant per item The benefit of interest is the value of the business generated per yes response We ll assume that the benefit is constant per posi(ve response It is now possible to form a cost/benefit curve across the same domain (percent of sample size) as the liw chart

39 Cost Bene9it curves

40 Exercise With the balance data set, analyze the cost/benefit for the L class if the benefit of achieving a true posi(ve would be 15 and the cost of a false posi(ve 7

41 ROC Curves Stands for receiver opera(ng characteris(c Used to show tradeoff between hit rate and false alarm rate over noisy channel Similar to liw charts It provides a way of drawing conclusions about one data mining algorithm Needs that the algorithm returns a numeric value Analyzes the evolu(on of the TPR and FPR with different thresholds Differences to liw chart: y axis shows percentage of true posi(ves in sample rather than absolute number TPR=100*TP/(TP+FN) x axis shows percentage of false posi(ves in sample rather than sample size FPR=100*FP/(FP+TN)

42 ROC Curve (TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class

43 How to construct a ROC curve Threshold >= Class TP FP TN FN TPR FPR ROC Curve:

44 How is the ROC Curve of this data?

45 ROC Curve It represents the probability that a randomly chosen nega(ve example will have a smaller es(mated probability of belonging to the posi(ve class than a randomly chosen posi(ve example

46 ROC Curve

47 Exercise Obtain the ROC curve for the L class value and the following algorithms: IBK (k=3), NaiveBayes and J48 Visualize the B class also and compare results

48 Recall and precision The general idea behind liw charts and ROC curves is trade- off They are measures of good outcomes vs. unsuccessful outcomes Trying to measure the tradeoff between desirable and undesirable outcomes occurs in many different problem domains In the area of informa(on retrieval, two measures are used: Recall and precision

49 Recall and precission Recall = TP/(TP+FN) Precision = TP/(TP+FP) General idea: You can increase the number of relevant documents you retrieve by increasing the total number of documents you retrieve But as you do so, the propor(on of relevant documents falls

50 Recall and precision Consider the extreme case How would you be guaranteed of always retrieving all relevant documents?

51 Sensitivity and speci9icity Medical tes(ng domain Similar idea For a given medical test: Sensi(vity = propor(on of people with the disease who test posi(ve Specificity = propor(on of people without the disease who test nega(ve Sensi(vity = TP/(TP+FN) (Recall/TPR) Specificity = TN/(TN+FP) For both measures high is good

52 Measures explained Sensi:vity/recall/TPR (TP/(TP+FN)): How good a test is detec(ng the posi(ves. How can a test cheat this measure? Specificity/1- FPR (TN/(TN+FP)): How good is a test avoiding false alarms How can a test cheat this measure? Precision: (TP/(TP+FP)): How many of the positevely classified were relevant How can a test cheat this measure? False posi:ve rate/1- Specificity: (FP/(FP+TN)): How many false are detected as true. How can a test cheat this measure? Accuracy: (TP+TN)/(TP+TN+FP+FN) or (TP+TN)/N Error Rate: (FP+FN)/(FP+FN+TP+TN) or (FP+FN)/N

53 One 9igure measures In addi(on to 2- dimensional graphs there are also techniques for trying to express the goodness of a scheme in a single number by the combina(on of several measures For example, in informa(on retrieval there is the concept of average recall measures Averages of precision over several recall values Three- point average recall is the average precision for recall figures of 20%, 50%, and 80% Eleven- point average recall is the average precision for figures of 0%- 100% by 10 s F- measure 1 = 1 2 ( 1 recall + 1 precision ) 2 precision recall precision + recall = 2 TP 2 TP + FN + FP Success rate/accuracy (TP+TN)/(TP+FN+TN+FP) Error Rate: (FP+FN)/(FP+FN+TP+TN) or (FP+FN)/N

54 Area under ROC Curve (AUC) To summarize ROC Curves in a single quan(ty Roughly speaking, the larger the beber It represents the probability that a randomly chosen nega(ve example will have a smaller es(mated probability of belonging to the posi(ve class than a randomly chosen posi(ve example Perfect value = 1.0 Random Guessing = 0.5

55 Model Evaluation Metrics for Performance Evalua(on How to evaluate the performance of a model? How to obtain reliable es(mates? Methods for Model Comparison How to compare the rela(ve performance among compe(ng models?

56 Training and testing Natural performance measure for classifica(on problems: error rate Success: instance s class is predicted correctly Error: instance s class is predicted incorrectly Error rate: propor(on of errors made over the whole set of instances ResubsFtuFon error: error rate obtained from training data Resubs(tu(on error is (hopelessly) op(mis(c!

57 Training and testing II Test set: independent instances that have played no part in forma(on of classifier Assump(on: both training data and test data are representa(ve samples of the underlying problem Test and training data may differ in nature

58 Splitting the data Once evalua(on is complete, all the data can be used to build the final classifier Generally, the larger the training data the beber the classifier (but returns diminish) The larger the test data the more accurate the error es(mate Holdout procedure: method of splirng original data into training and test set Dilemma: ideally both training set and test set should be large!

59 Holdout estimation What to do if the amount of data is limited? The holdout method reserves a certain amount for tes(ng and uses the remainder for training Usually: one third for tes(ng, the rest for training Problem: the samples might not be representa(ve Example: class might be missing in the test data Advanced version uses stra%fica%on Ensures that each class is represented with approximately equal propor(ons in both subsets

60 Repeated holdout method Holdout es(mate can be made more reliable by repea(ng the process with different subsamples In each itera(on, a certain propor(on is randomly selected for training (possibly with stra(ficia(on) The error rates on the different itera(ons are averaged to yield an overall error rate This is called the repeated holdout method S(ll not op(mum: the different test sets overlap Can we prevent overlapping?

61 Cross- validation Cross- validafon avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for tes(ng, the remainder for training Called k- fold cross- validafon

62 10 fold cross- validation Standard method for evalua(on: stra(fied ten- fold cross- valida(on The data is divided randomly in 10 parts in which the class is represented in appoximately the same propor(ons Why ten? Extensive experiments have shown that this is the best choice to get an accurate es(mate There is also some theore(cal evidence for this Stra(fica(on reduces the es(mate s variance Lots of data? use percentage split Else stra(fied 10- fold- cross- valida(on

63 Leave- one- out cross- validation This is basically n- fold cross valida(on taken to the max For a data set with n instances, you hold out one for tes(ng and train on the remaining (n 1) You do this for each of the instances and then average the results Advantages: It s determinis(c: There s no sampling In a sense, you maximize the informa(on you can squeeze out of the data set Disadvantages It s computa(onally intensive By defini(on, a holdout of 1 can t be stra(fied

64 X- fold CV vs leave- one out The Elements of Sta(s(cal Learning Chapter 7 "On the other hand, leave- one- out cross- valida(on has low bias but can have high variance. Overall, five- or tenfold cross- valida(on are recommended as a good compromise: see Breiman and Spector (1992) and Kohavi (1995)"

65 The bootstrap This term comes from the phrase pulling oneself up by one s bootstraps which is a metaphor for accomplishing task without any outside help Originally used to es(mate a parameter from only a sample Idea: Take a bootstrap sample (random sample taken with replacement from the original sample of the same size) Calculate the bootstrap sta(s(c computed on the bootstrap sample Repeat these steps many (me to create a bootstrap distribu(on

The bootstrap CV uses sampling without replacement The same instance, once selected, can not be selected again for a par(cular training/ test set The bootstrap uses sampling with replacement to form

66 The bootstrap CV uses sampling without replacement The same instance, once selected, can not be selected again for a par(cular training/ test set The bootstrap uses sampling with replacement to form the training set Sample a dataset of n instances n (mes with replacement to form a new dataset of n instances Use this data as the training set Use the instances from the original dataset that don t occur in the newtraining set for tes:ng

67 Bootstrap A par(cular instance has a probability of 1 1/n of not being picked Thus its probability of not ending up in the test data is: This means the training data will contain approximately 63.2% of the instances

68 The bootstrap The error es(mate on the test data will be very pessimis(c Trained on just ~63% of the instances Therefore, combine it with the resubs(tu(on error: The resubs(tu(on error gets less weight than the error on the test data Repeat process several (mes with different replacement samples; average the results

69 More on the bootstrap Probably the best way of es(ma(ng performance for very small datasets (<30) However, it has some problems Compared to basic CV, the bootstrap increases the variance that can occur in each fold [Efron and Tibshirani, 1993] Could be argueed that is desirable since it is more realis(c Consider a random dataset with two classes of equal size True error rate 50% for any predic(on rule A perfect memorizer will achieve 0% resubs(tu(on error and ~50% error on test data Bootstrap es(mate for this classifier: Err = 0.632*50% *0% = 31.6% Misleadingly op(mis(c

70 Model Evaluation Metrics for Performance Evalua(on How to evaluate the performance of a model? How to obtain reliable es(mates? Methods for Model Comparison How to compare the rela(ve performance among compe(ng models?

71 Comparing models Frequent ques(on: which of two learning schemes performs beber? Note: this is domain dependent! Obvious way: compare 10- fold CV es(mates Generally sufficient in applica(ons (we don't loose if the chosen method is not truly beber) However, what about machine learning research? Need to show convincingly that a par(cular method works beber

72 Comparing models Want to show that scheme A is beber than scheme B in a par(cular domain For a given amount of training data On average, across all possible training sets However, just using the mean values is not enough

73 Hypothesis testing In inferen(al sta(s(s sample data are employed in two ways to draw inferences about one or more popula(ons Hypothesis tes(ng and es(ma(on of popula(on parameters Hypothesis tes(ng is a procedure in which sample data are employed to evaluate a hypothesis In order to evaluate a research hypothesis, it is restated within the framework of two sta(s(cal hypothesis Null hypothesis (H 0 ): statement of no effect or no difference Alterna(ve hypothesis (H 1 ): sta(s(cal statement indica(ng the presence of an effect of a difference

74 Hypothesis testing These types of tests allow a researcher to determine whether or not the result of a study is sta(s(cally significant This implies that one is determining whether or not an obtained difference in an experiment is likely to be due to chance or due to the presence of genuine experimental effect Normally an sta(s(c (measure/characteris(c of a sample) is obtained and compared in reference to a sampling distribu(on Theore(cal distribu(on of all the possible values the test sta(s(c can assume

75 Analyzing the statistic value

76 Hypothesis testing Scien(fic conven(on has established that in order to declare a difference sta(s(cally significant No more than 5% likelihood that the difference is due to chance 1% in medical domains Within the framework of hypothesis tes(ng it is possible for a researcher to commite two types of errors Type I: When a true null hypothesis is rejected One concludes that there are difference when in reality there are none Represented by α Type II: When a false null hypothesis is retained One concludes that a true alterna(ve hypothesis is false Represented by β Power = 1- β They are inversely related As one decreases the other one increases

78 α vs β How is the graphical representa(on of α vs β?

79 α vs β

80 Paired t- test In prac(ce we have limited data and a limited number of es(mates for compu(ng the mean Student s t- test tells whether the means of two samples are significantly different In our case the samples are cross- valida(on es(mates for different datasets from the domain Use a paired t- test because the individual samples are paired We test the null hypothesis: H 0 : µ 1 = µ 2 Against H 1 : µ 1 µ 2

81 T- test Used with interval/ra(o data Used it when the researcher does not know the value of the popula(on standard devia(on and must es(mate it by compu(ng the sample standard devia(on As opposed to the z- test T distribu(on in contrast to the normal distribu(on of the z test Assump(ons: Sample has been randomly selected from the popula(on it represents The distribu(on of the data in the underlying popula(on the sample represents is normal Homoscedas(city of variances The rela(onship between X and Y variables is of equal strength across the whole range

82 T- test for paired samples Used when there are two samples that have been matched or paired If we are comparing two models constructed from the same dataset the samples are matched We are going to compute an sta(s(c based on the difference between the samples

83 Standard Error of the mean Sampling distribu(on of the sample mean Distribu(on of the mean of a sample for all possible samples from the same popula(on of a given size The standard error of the mean is the standard devia(on of the sample means es(mate of a popula(on mean. Is usually es(mated by using the sample standard devia(on Sample standard devia(on divided by the square root of the sample size SE x = es p n s P X 2 ( P X) 2 SD x = p n es = n 1 n

84 Distribution of the differences Let m d = m x m y The difference of the means (m d ) has a Student s distribu(on with k 1 degrees of freedom Let σ d 2 be the variance of the difference The standardized version of m d is called the t- sta(s(c: t = m d 0 σ d 2 / k We use t to perform the t- test

85 Student s distribution With small samples (k < 100) the mean follows Student s distribufon with k 1 degrees of freedom With infinite degrees of freedom the distribu(on is the same as the normal distribu(on

86 Student s distribution

Performing the test Fix a significance level If a difference is significant at the a% level,there is a (100- a)% chance that the true means differ Look up the value

87 Performing the test Fix a significance level If a difference is significant at the a% level,there is a (100- a)% chance that the true means differ Look up the value for z that corresponds to the significance level If t z or t z then the difference is significant I.e. the null hypothesis (that the difference is zero) can be rejected

88 Non- parametric tests What to do when the distribu(on does not sa(sfy the t- test requirements Normality distribu(on Shapiro Test Homoscedas(city of variances The rela(onship between X and Y variables is of equal strength across the whole range They are not spread equally Levene test If not the probability of obtaining a significant result may be greater Use non- parametric tests Do not assume any distribu(on

89 Wilcoxon test Good results when no distribu(on can be assumed Null hypothesis The median of the difference scores equals zero H 0 :θ D = 0 H 1 :θ D 0 Direc(onal alterna(ve hypothesis H 1 :θ D > 0 H 1 :θ D < 0

90 Wilcoxon test: Computation First, compute the differences d i between the performance scores of the two classifiers on the i- th out of N data sets The differences are ranked according to their absolute values Average ranks are assigned in case of (es Let R+ be the sum of ranks for the data sets on which the second algorithm outperformed the first Let R- be the sum of ranks for the opposite Ranks of d i = 0 are split evenly (if there is an odd number one is ignored) Let T be the smaller of the sums T=min(R+,R- )

91 Wilcoxon test For N up to 25 there are tables of exact cri(cal values T must be equal or less than tabled cri(cal value For larger values, the following sta(s(c is distributed approximately normal.

92 Wilcoxon test: Example

93 Wilcoxon test: Example

94 Analysis of variance In a set of k independent samples do at least two of the samples represent popula(on with different mean values? If the computed test sta(s(c is significant, it indicates there is a significant difference between at least two of the sample means in the set of k means There is a high likelihood that at least two of the samples represent popula(ons with different mean values To compute the test sta(s(c the total variability is divided into between- groups variability and within- groups variability Between- groups: measure of the variability of the means of the k samples Within groups: (average of the variance within each group) is abributable to chance factors (random error)

95 Analysis of variance: computation

96 Example

97 Analysis of variance: computation Mean square between groups Mean square within-groups Fratio (F distribution)

98 Analysis of variance assumptions Sample has been randomly selected from the popula(on it represents The distribu(on of the data in the underlying popula(on the sample represents is normal Homoscedas(city of variances The rela(onship between X and Y variables is of equal strength across the whole range What to do when the assump(ons are not sa(sfied? An alterna(ve is the non- parametric Friedman test

99 Friedman Test Non- parametric equivalent of ANOVA test Based on rank performances rather than actual performances es(mates All classifiers are ranked according to their performance in ascending order for each data set and the mean rank of a classifier i, AR i, is computed across all data sets. The test sta(s(c of the Friedman test is calculated as: 2 F = 12K L(L + 1) " L X AR 2 i # L(L + 1) 2 i=1 j=1 With K represen(ng the overall number of data sets, L the number of classifiers and r i j the rank of classifier i on data set j. The sta(s(c is distributed according to the Chi- square distribu(on with L- 1 degrees of freedom 4 AR i = 1 K KX r i j

100 Multiple tests comparisons When conduc(ng several pairwise tests to obtain a global conclusion (e.g. Model 1 is beber than Model 2, Model 3, ) the family wise error rate should be taken into account If 10 models and each of the comparisons Model1 vs Model2, Model1 v Model3, has a 5% probability for affirming that there is a significant difference when there is none (Type I error), what is the probability of crea(ng a Type I error if we say that Type I error is beber than the remaining 9 models Probability ~ 37% We need to correct it In the example the new p- value = Other alterna(ves: Bonferroni- Dunn, Holm, Hochberg,

101 Tests comparisons: R t.test wilcoxon.test Checking normality: shapiro.test Checking homocedas(city: levenetest (required library car) aov (ANOVA) friedman.test

102 Some R examples

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Model Evaluation Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates? Methods for Model Comparison How to