I211: Information infrastructure II

Data Mining: Classifier Evaluation I211: Information infrastructure II

3-nearest neighbor labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1 1 1 2 3 0 6 1 1 1 0 0 1 0 2.1 0 1 3.2 0 The closest 3 labeled data points to the first unlabeled data point is data point 1 are at distance 1, 1.7 and 2.1. We will predict that the label is 0 based on the majority rule (0 twice, 1 once).

3-nearest neighbor to output probabilities labeled data find class labels for the 4 data points 1 0 0 6 0 0 0 5 17 1.7 1 1 4 1 7.1 1 1 1 0.4 1 2 1 3.0 0 0.1 1 1 2 3 0 6 1 1 1 0 0 1 0 2.1 0 1 3.2 0 The closest 3 labeled data points to the first unlabeled data point is data point 1 are at distance 1, 1.7 and 2.1. We will predict that the probability of class 0 is 2/3 and probability of class 1 is 1/3.

K algorithm for regression same as for classification, except that we can predict the target variable by using mean, instead of the majority rule for example labeled data find class labels for the 4 data points 0 0 5 0 0 6 13.2 1 1 4 17.8 1 1 1 0.4 1 2 1. 6 0 0.1 1 1 2 3 24.1 0 0 1 1.1 1 1 1 0 1 3.2 0.5 target variable

3-nearest neighbor for regression labeled data 1 0 0 5 17 1.7 7.1 1 1 1 3.0 0 0.1 1 1 6 1 1 1 0 0 6 13. 2 1 1 4 17.8 0.4 1 2 1.6 1 2 3 24.1 0 0 1 1.1 0 1 3.2 0.5 2.1 find target values for the 4 data points The closest 3 labeled data points to the first unlabeled data point is data point 1 are at distance 1, 1.7 and 2.1. We will predict that the target is 10.5. This is obtained as (13.2 + 17.8 + 0.5) / 3.

K-nearest neighbor algorithm find closest K data points and use mean rule to assign target; More formally given: Data set D = {(x i, y i ), i = 1,..., n} where, for example, x i R k y i R is the target variable K is an integer (odd number preferably) find target for every unlabeled data point u i R k Algorithm find K closest data points to data point u assign target to be the mean value of the targets for the K nearest neighbors

Classifier Evaluation 1. Question: how good is our algorithm? how will we estimate its performance? 2. Question: what is the performance measure we should use? 1. Classification: we will use accuracy 2. Regression: we will use mean square error and R 2 measure

4-Fold Cross-Validation D 20 data points

4-Fold Cross-Validation D 20 data points Randomly and evenly split into 4 non-overlapping partitions

How to make a random split? Use random number generator!!! D 5 data points >> x = rand(1, 5); >> [a b] = sort(x) a = 0.1576 0.4854 0.8003 0.9572 0.9706 b = 1 4 5 3 2 >> p1 = b(1 : 2 : length(b)) p1 = 1 5 2 >> p2 = b(2 : 2 : length(b)) p2 = 4 3 partition 1 partition 2

4-Fold Cross-Validation D 20 data points Randomly and evenly split into 4 non-overlapping partitions Partition 1. Data points: 1, 3, 5, 15, 16 Partition 2. Data points: 6, 10, 11, 14, 17 Partition 3. Data points: 4, 9, 12, 19, 20 Partition 4. Data points: 2, 7, 8, 13, 17

4-Fold Cross-Validation Step 1: Use partition 1 as test and partitions 2-4 as training Test Training Pretend class (target) t) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 2: Use partition 2 as test and partitions 1, 3 and 4 as training Test Training Pretend class (target) t) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 3: Use partition 3 as test and partitions 1, 2 and 4 as training Test Training Pretend class (target) t) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation Step 4: Use partition 4 as test and partitions 1-3 as training Test Training Pretend class (target) t) is not known for the data points in Test. Use Training set to predict class (target) for the data points in Test Count number of misses (classification) or the error (regression)

4-Fold Cross-Validation True class: 1 Predicted class: 1 True class: 0 Predicted class: 0... True class: 1 Predicted class: 0 For each of 20 data points there is true and predicted class

Confusion Matrix (Binary Class) True class edicted d class Pr 0 1 0 1 00 01 10 11 Accuracy = 00 + 00 + + 10 11 01 + 11 umber of data points whose true class was 0 but predicted class was 1. Error = 1 Accuracy

Classification Accuracy and Error Accuracy correct = correct : number of correctly classified data points : ttl total number of fdt data points it Error =1 correct Another naming convention: 00 = true negatives 01 = false negatives 10 = false positives 11 = true positives

More on Accuracy Binary Class Case sn 11 = Sensitivity or accuracy on data points whose 01 + 11 class is 1. Also called true positive rate. sp + 00 = Specificity or accuracy on data points whose 10 00 class is 0. Also called true negative rate. 1 sn 11 = 1 01 + False negative rate. 11 1 sp = 1 10 00 + 00 False positive rate. sn + sp Accuracy B 2 = Balanced-sample accuracy

Trivial and Random Classifiers 16 data points have class 0 (majority class) 4 data points have class 1 (minority class) 0 Trivial classifier: always predict majority class Accuracy of a trivial classifier is: 16/20 = 80% 1 Random classifier: predict class 0 with probability 0.8 and class 1 with probability 0.2 Accuracy of the random classifier: 68% (0.8 2 + 0.2 2 = 0.68)

-Fold Cross-Validation Is there a chance that our accuracy estimate depends on the random partition we did at the beginning? Yes! One way to overcome this is to generate several random splits into folds, estimate accuracy for each fold and finally average this number.

Leave-One-Out Performance Estimation Test Step 1 Step 2 Step D... Leave-one-out is an extreme case of cross-validation when is equal to the number of data points in data set D.

In Short... 1. Whatever you use for -fold cross-validation, every training i data point will be used exactly once as test. t 2. Each data points will thus have one true value and one predicted value. 3. From these two values we compute accuracy. 4. Terminology: we always say that we are estimating classifier s performance. If the data set is large enough and representative this will be a good estimate. The truth is, we can only hope that our test cases will be representative and that our estimate is meaningful.

Error Bars on Accuracy Estimates For example, data set contains 20 data points and the confusion matrix looks like True class Pr redicted d class s 0 1 0 10 1 1 4 5 15 Accuracy ( 1 Accuracy ) 0.75 0.25 Accuracy = = 0.75 σ = = = 0. 10 20 n 20

How to Report Accuracy Estimates For example, the estimated accuracy is 0.75 Let s see what the estimate is for = 10, 100, and 1000. σ 0.75 0.25 10 10 = = 0.14 σ 0.75 0.25 100 100 = = 0.04 For = 100 we will report accuracy as: Accuracy = 0.75 ± 0.04 σ 0.75 0.25 1000 1000 = = 0.01

Regression MSE = 1 2 ( p i t i ) i= 1 p i : predicted d target for data point i t i : true target for data point i : total number of data points R 2 = ( p i i t 1 i = 1 = ( t i i m 1 ) ) 2 2 m: mean value for the target variable R 2 is the percentage of variance explained by the regression method. Values close to 1 indicate perfect predictor.

Trivial Classifiers for Regression Trivial i classifier: always predicts target t mean R 2 for the trivial classifier is 0.