Probabilistic Classifiers DWML, /27

Similar documents
Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

List of Exercises: Data Mining 1 December 12th, 2015

Evaluating Classifiers

Evaluating Classifiers

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery: Practice Notes

Data Mining and Knowledge Discovery: Practice Notes

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Classification Algorithms in Data Mining

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Evaluating Machine-Learning Methods. Goals for the lecture

Data Mining Concepts & Techniques

Logistic Regression: Probabilistic Interpretation

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Classification. Instructor: Wei Ding

CS4491/CS 7265 BIG DATA ANALYTICS

Building Classifiers using Bayesian Networks

Part II: A broader view

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Data Warehousing and Machine Learning

CS145: INTRODUCTION TO DATA MINING

CLASSIFICATION JELENA JOVANOVIĆ. Web:

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Evaluating Classifiers

CS 229 Midterm Review

Data Mining Classification: Bayesian Decision Theory

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

CS 584 Data Mining. Classification 3

Data Mining and Knowledge Discovery: Practice Notes

Evaluating Machine Learning Methods: Part 1

Weka ( )

Classification Part 4

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Use of Synthetic Data in Testing Administrative Records Systems

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

DATA MINING OVERFITTING AND EVALUATION

Chapter 3: Supervised Learning

Data Mining and Knowledge Discovery: Practice Notes

Classification. Slide sources:

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Network Traffic Measurements and Analysis

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CSE4334/5334 DATA MINING

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Data Mining and Knowledge Discovery: Practice Notes

Tutorials Case studies

Machine Learning. Classification

ECE 5470 Classification, Machine Learning, and Neural Network Review

CISC 4631 Data Mining

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

CREDIT RISK MODELING IN R. Finding the right cut-off: the strategy curve

Classification: Basic Concepts, Decision Trees, and Model Evaluation

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

2. On classification and related tasks

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Classification and Regression

Bayesian Networks Inference (continued) Learning

CSE 5243 INTRO. TO DATA MINING

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Mining D E C I S I O N T R E E. Matteo Golfarelli

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Machine Learning: Symbolische Ansätze

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Machine Learning and Bioinformatics 機器學習與生物資訊學

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Lecture 25: Review I

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Nächste Woche. Dienstag, : Vortrag Ian Witten (statt Vorlesung) Donnerstag, 4.12.: Übung (keine Vorlesung) IGD, 10h. 1 J.

10 Classification: Evaluation

CS249: ADVANCED DATA MINING

Statistics 202: Statistical Aspects of Data Mining

Nonparametric Methods Recap

Part I. Instructor: Wei Ding

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

MEASURING CLASSIFIER PERFORMANCE

Lecture Notes for Chapter 4

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Machine Learning. Decision Trees. Manfred Huber

Pattern recognition (4)

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Lecture 7: Decision Trees

CISC 4631 Data Mining

Interpretation and evaluation

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

3 Virtual attribute subsetting

Exam Advanced Data Mining Date: Time:

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

CS570: Introduction to Data Mining

Data Mining Classification - Part 1 -

Performance Evaluation of Various Classification Algorithms

The exam is closed book, closed notes except your one-page cheat sheet.

SUPERVISED LEARNING WITH SCIKIT-LEARN. How good is your model?

Machine Problem 8 - Mean Field Inference on Boltzman Machine

Transcription:

Probabilistic Classifiers DWML, 2007 1/27

Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good............... DWML, 2007 2/27

Probabilistic Classifiers Conditional class probabilities Id. Savings Assets Income Credit risk 1 Medium High 75 Good 2 Low Low 50 Bad 3 High Medium 25 Bad 4 Medium High 75 Good 5 Low Medium 100 Good 6 High High 25 Good 7 Medium High 75 Bad 8 Medium Medium 75 Good............... P(Risk = Good Savings = Medium, Assets = High, Income = 75) = 2/3 P(Risk = Bad Savings = Medium, Assets = High, Income = 75) = 1/3 DWML, 2007 2/27

Probabilistic Classifiers Empirical Distribution The training data defines the empirircal distribution, which can be represented in a table. Empirical distribution obtained from 1000 data instances: Gender Blood Pressure Weight Smoker Stroke P m low under no no 32/1000 m low under no yes 1/1000 m low under yes no 27/1000.................. f normal normal no yes 0/1000.................. f high over yes yes 54/1000 Such a table is not a suitable probabilistic model, because Size of representation It overfits the data DWML, 2007 3/27

Probabilistic Classifiers Model View data as being produced by a random process that is described by a joint probability distribution P on States(A 1,..., A n, C), i.e. P assigns a probability P(a 1,..., a n, c) [0,1] to every tuple (a 1,..., a n, c) of values for the attribute and class variables, s.t. X (a 1,...,a n,c) States(A 1,...,A n,c) P(a 1,..., a n, c) = 1 (for discrete attributes; integration instead of summation for continuous attributes) Conditional Probability The joint distribution P also defines the conditional probability distribution of C, given A 1,..., A n, i.e. values P(c a 1,..., a n ) := P(a 1,..., a n, c) P(a 1,..., a n ) = P(a 1,..., a n, c) P c P(a 1,..., a n, c ) that represent the probability that C = c given that it is known that A 1 = a 1,..., A n = a n. DWML, 2007 4/27

Probabilistic Classifiers Classification Rule C(a 1,..., a n ) := arg max c States(C) P(c a 1,..., a n ) In binary case, e.g. States(C) = {not infected, infected}, also with variable threshold t: C(a 1,..., a n ) = not infected : P(not infected a 1,..., a n ) t. (this can also be generalized for non-binary attributes). DWML, 2007 5/27

Naive Bayes The Naive Bayes Model Structural assumption: P(a 1,..., a n, c) = P(a 1 c) P(a 2 c) P(a n c) P(c) Graphical representation as a Bayesian Network: C A 1 A 2 A 3 A 4 A 5 A 6 A 7 Interpretation: Given the true class labels, the different attributes take their value independently. DWML, 2007 6/27

Naive Bayes The naive Bayes assumption I 1 2 3 4 7 5 8 6 9 For example: P(Cell-2 = b Cell-5 = b, Symbol = 1) > P(Cell-2 = b Symbol = 1) Attributes not independent given Symbol=1! DWML, 2007 7/27

Naive Bayes The naive Bayes assumption II For spam example e.g.: P(Body nigeria =y Body confidential =y, Spam=y) P(Body nigeria =y Spam=y) Attributes not independent given Spam=yes! Naive Bayes assumption often not realistic. Nevertheless, Naive Bayes often successful. DWML, 2007 8/27

Naive Bayes Learning a Naive Bayes Classifier Determine parameters P(a i c) (a i States(A i ), c States(C)) from empirical counts in the data. Missing values are easily handled: instances for which A i is missing are ignored for P(a i c). Discrete and continuous attributes can be mixed. DWML, 2007 9/27

Naive Bayes The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97] 1 : P(C = a 1,..., a n ) (real) 0.5 0 States(A 1,..., A n ) DWML, 2007 10/27

Naive Bayes The paradoxical success of Naive Bayes One explanation for the surprisingly good performance of Naive Bayes in many domains: do not require exact distribution for classification, only the right decision boundaries [Domingos, Pazzani 97] 1 0.5 0 : P(C = a 1,..., a n ) (real) : P(C = a 1,..., a n ) (Naive Bayes) States(A 1,..., A n ) DWML, 2007 10/27

Naive Bayes When Naive Bayes must fail No Naive Bayes Classifier can produce the following classification: because assume it did, then: A B Class yes yes yes no no yes no no 1. P(A = y )P(B = y )P() > P(A = y )P(B = y )P( ) 2. P(A = y )P(B = n )P( ) > P(A = y )P(B = n )P() 3. P(A = n )P(B = y )P( ) > P(A = n )P(B = y )P() 4. P(A = n )P(B = n )P() > P(A = n )P(B = n )P( ) DWML, 2007 11/27

Naive Bayes Multiplying the four left sides and the four right sides of these inequalities: 4Y (left side of i.) > i=1 4Y (right side of i.) i=1 But this is false, because both products are actually equal. DWML, 2007 12/27

Naive Bayes Tree Augmented Naive Bayes A 2 Model: all Bayesian network structures where A 7 - The class node is parent of each attribute node C A 3 - The substructure on the attribute nodes is a tree A 1 A 4 A 5 A 6 Learning TAN classifier: learning the tree structure and parameters. Optimal tree structure can be found efficiently (Chow, Liu 1968, Friedman et al. 1997). DWML, 2007 13/27

Naive Bayes A B Class TAN classifier for yes yes yes no no yes no no : A C yes no 0.5 0.5 0.5 0.5 0.5 0.5 C C A yes no B yes 1.0 0.0 no 0.0 1.0 yes 0.0 1.0 no 1.0 0.0 DWML, 2007 14/27

Evaluating Classifiers DWML, 2007 15/27

Evaluating Classifiers Validation Evaluation: estimate of the performance of a classifier on future data. Estimate obtained by measuring performance on validation set (distinct from test set used for parameter tuning!); or by cross-validation. Classification Error Classifier C (e.g. decision tree) is used to classify instances a 1,...,a N with true class labels c 1,..., c N. Class labels assigned by C : c 1,..., c N. Classification error: {i 1,..., N c i c i } /N DWML, 2007 16/27

Evaluating Classifiers Expected Loss A more detailed picture is provided by the confusion matrix and a cost function: (e.g. for States(C) = {a, b, c} and n = 150): true predicted a b c a 45/150 4/150 3/150 b 2/150 39/150 1/150 c 3/150 7/150 46/150 true predicted a b c a -5 3 3 b 12-1 3 c 4 3 0 Confusion matrix: Fractions of cases with true/predicted combination Loss matrix Expected Loss: X x,y {a,b,c} Confusion(x, y) Loss(x, y) When cost function given, try to minimize expected loss (minimizing classification error is special case for 0-1 loss: Loss(x, x) = 0 and Loss(x, y) = 1 for x y)! DWML, 2007 17/27

Evaluating Classifiers Classifiers with Confidence Most classifiers (implicitly) provide a numeric measurement for the likelihood of class label c for instance a: Probabilistic classifier: Probability of c given a. Decision Tree: Frequency of label c (among training cases) in leaf reached by a. k-nearest-neighbor: Frequency of label c among k nearest neighbors of a. Neural Network: Output value of c output neuron given input a. DWML, 2007 18/27

Evaluating Classifiers Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a 3 a 5 a 1 a 7 a 8 a 4 a 2 a 10 a 6 a 9 P(c): 0.96 0.91 0.86 0.83 0.74 0.55 0.51 0.42 0.11 0.06 DWML, 2007 19/27

Evaluating Classifiers Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a 3 a 5 a 1 a 7 a 8 a 4 a 2 a 10 a 6 a 9 P(c): 0.96 0.91 0.86 0.83 0.74 0.55 0.51 0.42 0.11 0.06 The 40% quantile consists of the 40% of cases with highest confidence in c. DWML, 2007 19/27

Evaluating Classifiers Quantiles For a given class label c sort instances according to decreasing confidence in c: Instance: a 3 a 5 a 1 a 7 a 8 a 4 a 2 a 10 a 6 a 9 P(c): 0.96 0.91 0.86 0.83 0.74 0.55 0.51 0.42 0.11 0.06 c i = c: yes yes no yes yes no yes no no no The 40% quantile consists of the 40% of cases with highest confidence in c. Given the correct class labels, can compute accuracy in 40% quantile (3/4), and ratio of this accuracy and base rate of c label: Lift(40%, C, c) = 3/4 5/10 = 1.5 DWML, 2007 19/27

Evaluating Classifiers Lift plotted for different quantiles: Lift Charts 2 Lift(C, c) 1.8 1.6 1.4 1.2 1 0.8 10 20 30 40 50 60 70 80 90 100 DWML, 2007 20/27

Evaluating Classifiers Lift Charts Lift plotted for different quantiles: 2 1.8 Lift(C, c) Lift(C, c) 1.6 1.4 1.2 1 0.8 10 20 30 40 50 60 70 80 90 100 Lift for a classifier C generating a perfect ordering: Instance: a 7 a 5 a 2 a 3 a 8 a 9 a 1 a 10 a 6 a 4 P(c): 0.98 0.97 0.97 0.87 0.74 0.34 0.29 0.12 0.11 0.02 c i = c: yes yes yes yes yes no no no no no DWML, 2007 20/27

Evaluating Classifiers Lift and Costs What is better predicting C = c for all instances in the 40% quantile (say lift=1.5), and C c for all others, or predicting C = c for all instances in the 60% quantile (say lift=1.333), and C c for all others? That depends on the cost function! First option will be better when wrong predictions of C = c are very expensive, second option will be better when wrong predictions of C c are very expensive. DWML, 2007 21/27

Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 tpr 0 fpr 1 DWML, 2007 22/27

Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 always classify positive tpr 0 fpr 1 DWML, 2007 22/27

Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 always classify positive always classify negative tpr 0 fpr 1 DWML, 2007 22/27

Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 always classify positive always classify negative tpr classify positive with probability q q 0 q fpr 1 DWML, 2007 22/27

Evaluating Classifiers ROC Space Confusion matrix for binary classification problems: True positive rate (tpr): tp/(tp + fn) False positive rate (fpr): fp/(fp + tn) true predicted pos neg pos true positives (tp) false positives(fp) neg false negatives(fn) true negatives (tn) Each classifier (applied to some dataset) defines a point in ROC space: 1 always classify positive always classify negative tpr q classify positive with probability q perfect classification 0 q fpr 1 DWML, 2007 22/27

Evaluating Classifiers Comparison One classifier is strictly better than another, if its tpr/fpr point is to the left and above in ROC space: 1 C 1 C 2 tpr C 3 0 fpr 1 C 1 better than C 2. C 3 incomparable with C 1 and C 2. DWML, 2007 23/27

Evaluating Classifiers ROC curves Probabilistic classifiers (and many others) are parameterized by an acceptance threshold. Plotting the tpr/fpr values for all parameters (and a given dataset) gives a ROC curve: 1 tpr 0 fpr 1 DWML, 2007 24/27

Evaluating Classifiers ROC curves Probabilistic classifiers (and many others) are parameterized by an acceptance threshold. Plotting the tpr/fpr values for all parameters (and a given dataset) gives a ROC curve: 1 tpr 0 fpr 1 Performance measure for (parameterized family) classifier: area under (ROC) curve (AUC). DWML, 2007 24/27

Optimizing Predictive Performance Overfitting again Performance Measure Model parameter Future data Training data Possible performance measures: Misclassification rate Expected loss Model parameter: Pruning parameter for decision trees k in k-nearest neighbor AUC... Complexity of probabilistic model (e.g. Naive Bayes, TAN,... )... How do we determine the model performing best on future data? DWML, 2007 25/27

Optimizing Predictive Performance Test Set Set aside part (e.g. one third) of the available data as a test set Learn models with different parameters using the remaining data as the training data Measure the performance of each learned model on the test set Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data Problem: for small datasets cannot afford to set aside test set. DWML, 2007 26/27

Optimizing Predictive Performance Cross Validation Partition the data into n subsets or folds (typically: n = 10). For each model parameter setting: - for i = 1 to n: learn a model using folds 1,..., i 1, i + 1,..., n as training data measure performance on fold i - model performance = average performance on the n test sets Choose parameter setting with best performance Learn final model with chosen parameter setting using the whole available data DWML, 2007 27/27