Probabilistic Classifiers DWML, /27

Size: px

Start display at page:

Download "Probabilistic Classifiers DWML, /27"

Garey Barton
6 years ago
Views:

1 Probabilistic Classifiers DWML, /27

2 Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good DWML, /27

3 Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good P(Risk = Good Savings = Medium, Assets = High, Income = 75) = 2/3 P(Risk = Bad Savings = Medium, Assets = High, Income = 75) = 1/3 DWML, /27

4 Probabilistic Classifiers Empirical Distribution The training data defines the empirircal distribution, which can be represented in a table. Empirical distribution obtained from 1000 data instances: Gender Blood Pressure Weight Smoker Stroke P m low under no no 32/1000 m low under no yes 1/1000 m low under yes no 27/ f normal normal no yes 0/ f high over yes yes 54/1000 Such a table is not a suitable probabilistic model, because Size of representation It overfits the data DWML, /27

5 Probabilistic Classifiers Model View data as being produced by a random process that is described by a joint probability distribution P on States(A 1,..., A n, C), i.e. P assigns a probability P(a 1,..., a n, c) [0,1] to every tuple (a 1,..., a n, c) of values for the attribute and class variables, s.t. X (a 1,...,a n,c) States(A 1,...,A n,c) P(a 1,..., a n, c) = 1 (for discrete attributes; integration instead of summation for continuous attributes) Conditional Probability The joint distribution P also defines the conditional probability distribution of C, given A 1,..., A n, i.e. values P(c a 1,..., a n ) := P(a 1,..., a n, c) P(a 1,..., a n ) = P(a 1,..., a n, c) P c P(a 1,..., a n, c ) that represent the probability that C = c given that it is known that A 1 = a 1,..., A n = a n. DWML, /27

6 Probabilistic Classifiers Classification Rule C(a 1,..., a n ) := arg max c States(C) P(c a 1,..., a n ) In binary case, e.g. States(C) = {not infected, infected}, also with variable threshold t: C(a 1,..., a n ) = not infected : P(not infected a 1,..., a n ) t. (this can also be generalized for non-binary attributes). DWML, /27

7 Naive Bayes The Naive Bayes Model Structural assumption: P(a 1,..., a n, c) = P(a 1 c) P(a 2 c) P(a n c) P(c) Graphical representation as a Bayesian Network: C A 1 A 2 A 3 A 4 A 5 A 6 A 7 Interpretation: Given the true class labels, the different attributes take their value independently. DWML, /27

8 Naive Bayes The naive Bayes assumption I For example: P(Cell-2 = b Cell-5 = b, Symbol = 1) > P(Cell-2 = b Symbol = 1) Attributes not independent given Symbol=1! DWML, /27

9 Naive Bayes The naive Bayes assumption II For spam example e.g.: P(Body nigeria =y Body confidential =y, Spam=y) P(Body nigeria =y Spam=y) Attributes not independent given Spam=yes! Naive Bayes assumption often not realistic. Nevertheless, Naive Bayes often successful. DWML, /27

10 Naive Bayes Learning a Naive Bayes Classifier Determine parameters P(a i c) (a i States(A i ), c States(C)) from empirical counts in the data. Missing values are easily handled: instances for which A i is missing are ignored for P(a i c). Discrete and continuous attributes can be mixed. DWML, /27

11 Naive Bayes The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97] 1 : P(C = a 1,..., a n ) (real) States(A 1,..., A n ) DWML, /27

12 Naive Bayes The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97] : P(C = a 1,..., a n ) (real) : P(C = a 1,..., a n ) (Naive Bayes) States(A 1,..., A n ) DWML, /27

13 Naive Bayes When Naive Bayes must fail No Naive Bayes Classifier can produce the following classification: because assume it did, then: A B Class yes yes yes no no yes no no 1. P(A = y )P(B = y )P() > P(A = y )P(B = y )P( ) 2. P(A = y )P(B = n )P( ) > P(A = y )P(B = n )P() 3. P(A = n )P(B = y )P( ) > P(A = n )P(B = y )P() 4. P(A = n )P(B = n )P() > P(A = n )P(B = n )P( ) DWML, /27

14 Naive Bayes Multiplying the four left sides and the four right sides of these inequalities: 4Y (left side of i.) > i=1 4Y (right side of i.) i=1 But this is false, because both products are actually equal. DWML, /27

15 Naive Bayes Tree Augmented Naive Bayes A 2 Model: all Bayesian network structures where A 7 - The class node is parent of each attribute node C A 3 - The substructure on the attribute nodes is a tree A 1 A 4 A 5 A 6 Learning TAN classifier: learning the tree structure and parameters. Optimal tree structure can be found efficiently (Chow, Liu 1968, Friedman et al. 1997). DWML, /27

16 Naive Bayes A B Class TAN classifier for yes yes yes no no yes no no : A C yes no C C A yes no B yes no yes no DWML, /27

17 Evaluating Classifiers DWML, /27

18 Evaluating Classifiers Validation Evaluation: estimate of the performance of a classifier on future data. Estimate obtained by measuring performance on validation set (distinct from test set used for parameter tuning!); or by cross-validation. Classification Error Classifier C (e.g. decision tree) is used to classify instances a 1,...,a N with true class labels c 1,..., c N. Class labels assigned by C : c 1,..., c N. Classification error: {i 1,..., N c i c i } /N DWML, /27

19 Evaluating Classifiers Expected Loss A more detailed picture is provided by the confusion matrix and a cost function: (e.g. for States(C) = {a, b, c} and n = 150): true predicted a b c a 45/150 4/150 3/150 b 2/150 39/150 1/150 c 3/150 7/150 46/150 true predicted a b c a b c Confusion matrix: Fractions of cases with true/predicted combination Loss matrix Expected Loss: X x,y {a,b,c} Confusion(x, y) Loss(x, y) When cost function given, try to minimize expected loss (minimizing classification error is special case for 0-1 loss: Loss(x, x) = 0 and Loss(x, y) = 1 for x y)! DWML, /27

20 Evaluating Classifiers Classifiers with Confidence Most classifiers (implicitly) provide a numeric measurement for the likelihood of class label c for instance a: Probabilistic classifier: Probability of c given a. Decision Tree: Frequency of label c (among training cases) in leaf reached by a. k-nearest-neighbor: Frequency of label c among k nearest neighbors of a. Neural Network: Output value of c output neuron given input a. DWML, /27

21 Evaluating Classifiers Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a 3 a 5 a 1 a 7 a 8 a 4 a 2 a 10 a 6 a 9 P(c): DWML, /27

22 Evaluating Classifiers Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a 3 a 5 a 1 a 7 a 8 a 4 a 2 a 10 a 6 a 9 P(c): The 40% quantile consists of the 40% of cases with highest confidence in c. DWML, /27

23 Evaluating Classifiers Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a 3 a 5 a 1 a 7 a 8 a 4 a 2 a 10 a 6 a 9 P(c): c i = c: yes yes no yes yes no yes no no no The 40% quantile consists of the 40% of cases with highest confidence in c. Given the correct class labels, can compute accuracy in 40% quantile (3/4), and ratio of this accuracy and base rate of c label: Lift(40%, C, c) = 3/4 5/10 = 1.5 DWML, /27

24 Evaluating Classifiers Lift plotted for different quantiles: Lift Charts 2 Lift(C, c) DWML, /27

25 Evaluating Classifiers Lift Charts Lift plotted for different quantiles: Lift(C, c) Lift(C, c) Lift for a classifier C generating a perfect ordering: Instance: a 7 a 5 a 2 a 3 a 8 a 9 a 1 a 10 a 6 a 4 P(c): c i = c: yes yes yes yes yes no no no no no DWML, /27

26 Evaluating Classifiers Lift and Costs What is better predicting C = c for all instances in the 40% quantile (say lift=1.5), and C c for all others, or predicting C = c for all instances in the 60% quantile (say lift=1.333), and C c for all others? That depends on the cost function! First option will be better when wrong predictions of C = c are very expensive, second option will be better when wrong predictions of C c are very expensive. DWML, /27

27 Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 tpr 0 fpr 1 DWML, /27

28 Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 always classify positive tpr 0 fpr 1 DWML, /27

29 Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 always classify positive always classify negative tpr 0 fpr 1 DWML, /27

30 Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 always classify positive always classify negative tpr classify positive with probability q q 0 q fpr 1 DWML, /27

31 Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 always classify positive always classify negative tpr q classify positive with probability q perfect classification 0 q fpr 1 DWML, /27

32 Evaluating Classifiers Comparison One classifier is strictly better than another, if its tpr/fpr point is to the left and above in ROC space: 1 C 1 C 2 tpr C 3 0 fpr 1 C 1 better than C 2. C 3 incomparable with C 1 and C 2. DWML, /27

33 Evaluating Classifiers ROC curves Probabilistic classifiers (and many others) are parameterized by an acceptance threshold. Plotting the tpr/fpr values for all parameters (and a given dataset) gives a ROC curve: 1 tpr 0 fpr 1 DWML, /27

34 Evaluating Classifiers ROC curves Probabilistic classifiers (and many others) are parameterized by an acceptance threshold. Plotting the tpr/fpr values for all parameters (and a given dataset) gives a ROC curve: 1 tpr 0 fpr 1 Performance measure for (parameterized family) classifier: area under (ROC) curve (AUC). DWML, /27

35 Optimizing Predictive Performance Overfitting again Performance Measure Model parameter Future data Training data Possible performance measures: Misclassification rate Expected loss Model parameter: Pruning parameter for decision trees k in k-nearest neighbor AUC... Complexity of probabilistic model (e.g. Naive Bayes, TAN,... )... How do we determine the model performing best on future data? DWML, /27

36 Optimizing Predictive Performance Test Set Set aside part (e.g. one third) of the available data as a test set Learn models with different parameters using the remaining data as the training data Measure the performance of each learned model on the test set Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Problem: for small datasets cannot afford to set aside test set. DWML, /27

37 Optimizing Predictive Performance Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each model parameter setting: - for i = 1 to n: learn a model using folds 1,..., i 1, i + 1,..., n as training data measure performance on fold i - model performance = average performance on the n test sets Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data DWML, /27

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Data Mining Classification: Alternative Techniques Imbalanced Class Problem Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar Class Imbalance Problem Lots of classification problems