Probabilistic Classifiers DWML, /27

Probabilistic Classifiers DWML, 2007 1/27

Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good............... P(Risk = Good Savings = Medium, Assets = High, Income = 75) = 2/3 P(Risk = Bad Savings = Medium, Assets = High, Income = 75) = 1/3 DWML, 2007 2/27

Probabilistic Classifiers Empirical Distribution The training data defines the empirircal distribution, which can be represented in a table. Empirical distribution obtained from 1000 data instances: Gender Blood Pressure Weight Smoker Stroke P m low under no no 32/1000 m low under no yes 1/1000 m low under yes no 27/1000.................. f normal normal no yes 0/1000.................. f high over yes yes 54/1000 Such a table is not a suitable probabilistic model, because Size of representation It overfits the data DWML, 2007 3/27

Probabilistic Classifiers Model View data as being produced by a random process that is described by a joint probability distribution P on States(A 1,..., A n, C), i.e. P assigns a probability P(a 1,..., a n, c) [0,1] to every tuple (a 1,..., a n, c) of values for the attribute and class variables, s.t. X (a 1,...,a n,c) States(A 1,...,A n,c) P(a 1,..., a n, c) = 1 (for discrete attributes; integration instead of summation for continuous attributes) Conditional Probability The joint distribution P also defines the conditional probability distribution of C, given A 1,..., A n, i.e. values P(c a 1,..., a n ) := P(a 1,..., a n, c) P(a 1,..., a n ) = P(a 1,..., a n, c) P c P(a 1,..., a n, c ) that represent the probability that C = c given that it is known that A 1 = a 1,..., A n = a n. DWML, 2007 4/27

Probabilistic Classifiers Classification Rule C(a 1,..., a n ) := arg max c States(C) P(c a 1,..., a n ) In binary case, e.g. States(C) = {not infected, infected}, also with variable threshold t: C(a 1,..., a n ) = not infected : P(not infected a 1,..., a n ) t. (this can also be generalized for non-binary attributes). DWML, 2007 5/27

Naive Bayes The Naive Bayes Model Structural assumption: P(a 1,..., a n, c) = P(a 1 c) P(a 2 c) P(a n c) P(c) Graphical representation as a Bayesian Network: C A 1 A 2 A 3 A 4 A 5 A 6 A 7 Interpretation: Given the true class labels, the different attributes take their value independently. DWML, 2007 6/27

Naive Bayes The naive Bayes assumption I 1 2 3 4 7 5 8 6 9 For example: P(Cell-2 = b Cell-5 = b, Symbol = 1) > P(Cell-2 = b Symbol = 1) Attributes not independent given Symbol=1! DWML, 2007 7/27

Naive Bayes The naive Bayes assumption II For spam example e.g.: P(Body nigeria =y Body confidential =y, Spam=y) P(Body nigeria =y Spam=y) Attributes not independent given Spam=yes! Naive Bayes assumption often not realistic. Nevertheless, Naive Bayes often successful. DWML, 2007 8/27

Naive Bayes Learning a Naive Bayes Classifier Determine parameters P(a i c) (a i States(A i ), c States(C)) from empirical counts in the data. Missing values are easily handled: instances for which A i is missing are ignored for P(a i c). Discrete and continuous attributes can be mixed. DWML, 2007 9/27

Naive Bayes The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97] 1 : P(C = a 1,..., a n ) (real) 0.5 0 States(A 1,..., A n ) DWML, 2007 10/27

Naive Bayes The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97] 1 0.5 0 : P(C = a 1,..., a n ) (real) : P(C = a 1,..., a n ) (Naive Bayes) States(A 1,..., A n ) DWML, 2007 10/27

Naive Bayes When Naive Bayes must fail No Naive Bayes Classifier can produce the following classification: because assume it did, then: A B Class yes yes yes no no yes no no 1. P(A = y )P(B = y )P() > P(A = y )P(B = y )P( ) 2. P(A = y )P(B = n )P( ) > P(A = y )P(B = n )P() 3. P(A = n )P(B = y )P( ) > P(A = n )P(B = y )P() 4. P(A = n )P(B = n )P() > P(A = n )P(B = n )P( ) DWML, 2007 11/27

Naive Bayes Multiplying the four left sides and the four right sides of these inequalities: 4Y (left side of i.) > i=1 4Y (right side of i.) i=1 But this is false, because both products are actually equal. DWML, 2007 12/27

Naive Bayes Tree Augmented Naive Bayes A 2 Model: all Bayesian network structures where A 7 - The class node is parent of each attribute node C A 3 - The substructure on the attribute nodes is a tree A 1 A 4 A 5 A 6 Learning TAN classifier: learning the tree structure and parameters. Optimal tree structure can be found efficiently (Chow, Liu 1968, Friedman et al. 1997). DWML, 2007 13/27

Naive Bayes A B Class TAN classifier for yes yes yes no no yes no no : A C yes no 0.5 0.5 0.5 0.5 0.5 0.5 C C A yes no B yes 1.0 0.0 no 0.0 1.0 yes 0.0 1.0 no 1.0 0.0 DWML, 2007 14/27

Evaluating Classifiers DWML, 2007 15/27

Evaluating Classifiers Validation Evaluation: estimate of the performance of a classifier on future data. Estimate obtained by measuring performance on validation set (distinct from test set used for parameter tuning!); or by cross-validation. Classification Error Classifier C (e.g. decision tree) is used to classify instances a 1,...,a N with true class labels c 1,..., c N. Class labels assigned by C : c 1,..., c N. Classification error: {i 1,..., N c i c i } /N DWML, 2007 16/27

Evaluating Classifiers Expected Loss A more detailed picture is provided by the confusion matrix and a cost function: (e.g. for States(C) = {a, b, c} and n = 150): true predicted a b c a 45/150 4/150 3/150 b 2/150 39/150 1/150 c 3/150 7/150 46/150 true predicted a b c a -5 3 3 b 12-1 3 c 4 3 0 Confusion matrix: Fractions of cases with true/predicted combination Loss matrix Expected Loss: X x,y {a,b,c} Confusion(x, y) Loss(x, y) When cost function given, try to minimize expected loss (minimizing classification error is special case for 0-1 loss: Loss(x, x) = 0 and Loss(x, y) = 1 for x y)! DWML, 2007 17/27

Evaluating Classifiers Classifiers with Confidence Most classifiers (implicitly) provide a numeric measurement for the likelihood of class label c for instance a: Probabilistic classifier: Probability of c given a. Decision Tree: Frequency of label c (among training cases) in leaf reached by a. k-nearest-neighbor: Frequency of label c among k nearest neighbors of a. Neural Network: Output value of c output neuron given input a. DWML, 2007 18/27

Evaluating Classifiers Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a 3 a 5 a 1 a 7 a 8 a 4 a 2 a 10 a 6 a 9 P(c): 0.96 0.91 0.86 0.83 0.74 0.55 0.51 0.42 0.11 0.06 c i = c: yes yes no yes yes no yes no no no The 40% quantile consists of the 40% of cases with highest confidence in c. Given the correct class labels, can compute accuracy in 40% quantile (3/4), and ratio of this accuracy and base rate of c label: Lift(40%, C, c) = 3/4 5/10 = 1.5 DWML, 2007 19/27

Evaluating Classifiers Lift plotted for different quantiles: Lift Charts 2 Lift(C, c) 1.8 1.6 1.4 1.2 1 0.8 10 20 30 40 50 60 70 80 90 100 DWML, 2007 20/27

Evaluating Classifiers Lift Charts Lift plotted for different quantiles: 2 1.8 Lift(C, c) Lift(C, c) 1.6 1.4 1.2 1 0.8 10 20 30 40 50 60 70 80 90 100 Lift for a classifier C generating a perfect ordering: Instance: a 7 a 5 a 2 a 3 a 8 a 9 a 1 a 10 a 6 a 4 P(c): 0.98 0.97 0.97 0.87 0.74 0.34 0.29 0.12 0.11 0.02 c i = c: yes yes yes yes yes no no no no no DWML, 2007 20/27

Evaluating Classifiers Lift and Costs What is better predicting C = c for all instances in the 40% quantile (say lift=1.5), and C c for all others, or predicting C = c for all instances in the 60% quantile (say lift=1.333), and C c for all others? That depends on the cost function! First option will be better when wrong predictions of C = c are very expensive, second option will be better when wrong predictions of C c are very expensive. DWML, 2007 21/27

Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 tpr 0 fpr 1 DWML, 2007 22/27

Evaluating Classifiers Comparison One classifier is strictly better than another, if its tpr/fpr point is to the left and above in ROC space: 1 C 1 C 2 tpr C 3 0 fpr 1 C 1 better than C 2. C 3 incomparable with C 1 and C 2. DWML, 2007 23/27

Evaluating Classifiers ROC curves Probabilistic classifiers (and many others) are parameterized by an acceptance threshold. Plotting the tpr/fpr values for all parameters (and a given dataset) gives a ROC curve: 1 tpr 0 fpr 1 DWML, 2007 24/27

Optimizing Predictive Performance Overfitting again Performance Measure Model parameter Future data Training data Possible performance measures: Misclassification rate Expected loss Model parameter: Pruning parameter for decision trees k in k-nearest neighbor AUC... Complexity of probabilistic model (e.g. Naive Bayes, TAN,... )... How do we determine the model performing best on future data? DWML, 2007 25/27

Optimizing Predictive Performance Test Set Set aside part (e.g. one third) of the available data as a test set Learn models with different parameters using the remaining data as the training data Measure the performance of each learned model on the test set Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Problem: for small datasets cannot afford to set aside test set. DWML, 2007 26/27

Optimizing Predictive Performance Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each model parameter setting: - for i = 1 to n: learn a model using folds 1,..., i 1, i + 1,..., n as training data measure performance on fold i - model performance = average performance on the n test sets Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data DWML, 2007 27/27