Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Similar documents
Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

ECLT 5810 Evaluation of Classification Quality

CS145: INTRODUCTION TO DATA MINING

CS249: ADVANCED DATA MINING

Machine Learning and Bioinformatics 機器學習與生物資訊學

Machine Learning Techniques for Data Mining

Weka ( )

Naïve Bayes Slides are adapted from Sebastian Thrun (Udacity ), Ke Chen Jonathan Huang and H. Witten s and E. Frank s Data Mining and Jeremy Wyatt,

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A.

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

How do we obtain reliable estimates of performance measures?

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Evaluating Machine-Learning Methods. Goals for the lecture

CS4491/CS 7265 BIG DATA ANALYTICS

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

CS570: Introduction to Data Mining

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Evaluating Machine Learning Methods: Part 1

Information Management course

Best First and Greedy Search Based CFS and Naïve Bayes Algorithms for Hepatitis Diagnosis

Evaluating Classifiers

List of Exercises: Data Mining 1 December 12th, 2015

Data Mining Algorithms: Basic Methods

Classification with Decision Tree Induction

Probabilistic Classifiers DWML, /27

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Regularization and model selection

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Logistic Regression: Probabilistic Interpretation

CSE 5243 INTRO. TO DATA MINING

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Tree-based methods for classification and regression

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Network Traffic Measurements and Analysis

Classification Part 4

CS 188: Artificial Intelligence Fall Machine Learning

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

Evaluating Classifiers

Data Mining and Knowledge Discovery: Practice Notes

CS 188: Artificial Intelligence Spring Announcements

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Evaluating Classifiers

CS 584 Data Mining. Classification 3

Artificial Intelligence Naïve Bayes

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CLASSIFICATION JELENA JOVANOVIĆ. Web:

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Data Mining and Knowledge Discovery Practice notes 2

DATA MINING LAB MANUAL

CS 188: Artificial Intelligence Fall 2011

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Data Mining and Knowledge Discovery: Practice Notes

10 Classification: Evaluation

Chapter 3: Supervised Learning

CS 188: Artificial Intelligence Fall Announcements

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification

INF 4300 Classification III Anne Solberg The agenda today:

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Data Mining Classification: Bayesian Decision Theory

Pattern recognition (4)

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Business Club. Decision Trees

Introduction to Machine Learning Prof. Mr. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Data Mining Practical Machine Learning Tools and Techniques

CS 188: Artificial Intelligence Fall 2008

Chapter 4: Algorithms CS 795

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Machine Learning. Supervised Learning. Manfred Huber

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Slides for Data Mining by I. H. Witten and E. Frank

Nearest Neighbor Classification

CSE 158. Web Mining and Recommender Systems. Midterm recap

The Explorer. chapter Getting started

Ensemble Methods, Decision Trees

Mining di Dati Web. Lezione 3 - Clustering and Classification

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis

Variable Selection 6.783, Biomedical Decision Support

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Machine Learning / Jan 27, 2010

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

1 Machine Learning System Design

Random Forest A. Fornaser

Universität Freiburg Lehrstuhl für Maschinelles Lernen und natürlichsprachliche Systeme. Machine Learning (SS2012)

Probabilistic Learning Classification using Naïve Bayes

Chapter 4: Algorithms CS 795

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

CS 229 Midterm Review

Classification. Instructor: Wei Ding

BITS F464: MACHINE LEARNING

Building Classifiers using Bayesian Networks

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

Knowledge Discovery and Data Mining

Transcription:

Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Things We d Like to Do Spam Classification Given an email, predict whether it is spam or not Medical Diagnosis Given a list of symptoms, predict whether a patient has disease X or not Weather Based on temperature, humidity, etc predict if it will rain tomorrow

Bayesian Classification Problem statement: Given features X 1,X 2,,X n Predict a label Y

Another Application Digit Recognition Classifier 5 X 1,,X n {0,1} (Black vs. White pixels) Y {5,6} (predict whether a digit is a 5 or a 6)

The Bayes Classifier In class, we saw that a good strategy is to predict: (for example: what is the probability that the image represents a 5 given its pixels?) So How do we compute that?

The Bayes Classifier Use Bayes Rule! Likelihood Prior Normalization Constant Why did this help? Well, we think that we might be able to specify how features are generated by the class label

The Bayes Classifier Let s expand this for our digit recognition task: To classify, we ll simply compute these two probabilities and predict based on which one is greater

Model Parameters For the Bayes classifier, we need to learn two functions, the likelihood and the prior How many parameters are required to specify the prior for our digit recognition example?

Model Parameters How many parameters are required to specify the likelihood? (Supposing that each image is 30x30 pixels)?

Model Parameters The problem with explicitly modeling P(X 1,,X n Y) is that there are usually way too many parameters: We ll run out of space We ll run out of time And we ll need tons of training data (which is usually not available)

The Naïve Bayes Model The Naïve Bayes Assumption: Assume that all features are independent given the class label Y Equationally speaking: (We will discuss the validity of this assumption later)

Why is this useful? # of parameters for modeling P(X 1,,X n Y): 2(2 n -1) # of parameters for modeling P(X 1 Y),,P(X n Y) 2n

Naïve Bayes Training Now that we ve decided to use a Naïve Bayes classifier, we need to train it with some data: MNIST Training Data

Naïve Bayes Training Training in Naïve Bayes is easy: Estimate P(Y=v) as the fraction of records with Y=v Estimate P(X i =u Y=v) as the fraction of records with Y=v for which X i =u (This corresponds to Maximum Likelihood estimation of model parameters)

Naïve Bayes Training In practice, some of these counts can be zero Fix this by adding virtual counts: This is like putting a prior on parameters and doing MAP (Maximum a-posteriori) estimation instead of MLE (Maximum Liklihood Estimate) This is called Smoothing

Naïve Bayes Training For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.

Naïve Bayes Classification

Another Example of the Naïve Bayes Classifier The weather data, with counts and probabilities outlook temperature humidity windy play yes no yes no yes no yes no yes no sunny 2 3 hot 2 2 high 3 4 false 6 2 9 5 overcast 4 0 mild 4 2 normal 6 1 true 3 3 rainy 3 2 cool 3 1 sunny 2/9 3/5 hot 2/9 2/5 high 3/9 4/5 false 6/9 2/5 9/14 5/14 overcast 4/9 0/5 mild 4/9 2/5 normal 6/9 1/5 true 3/9 3/5 rainy 3/9 2/5 cool 3/9 1/5 A new day outlook temperature humidity windy play sunny cool high true?

Likelihood of yes 2 9 3 9 3 9 3 9 Likelihood of no 9 14 0.0053 3 1 4 3 5 0.0206 5 5 5 5 14 Therefore, the prediction is No

The Naive Bayes Classifier for Data Sets with Numerical Attribute Values One common practice to handle numerical attribute values is to assume normal distributions for numerical attributes.

The numeric weather data with summary statistics outlook temperature humidity windy play yes no yes no yes no yes no yes no sunny 2 3 83 85 86 85 false 6 2 9 5 overcast 4 0 70 80 96 90 true 3 3 rainy 3 2 68 65 80 70 64 72 65 95 69 71 70 91 75 80 75 70 72 90 81 75 sunny 2/9 3/5 mean 73 74.6 mean 79.1 86.2 false 6/9 2/5 9/14 5/14 overcast 4/9 0/5 std dev rainy 3/9 2/5 6.2 7.9 std dev 10.2 9.7 true 3/9 3/5

Let x 1, x 2,, x n be the values of a numerical attribute in the training data set. 2 2 2 1 ) ( 1 1 1 1 2 1 w e w f x n x n n i i n i i

For examples, f temperature 66 Yes Likelihood of Yes = 2 9 1 2 6.2 0.03400.0221 e 6673 2 2 6.2 2 0. 0340 3 9 9 14 0.000036 Likelihood of No = 3 5 0.0291 0.038 3 5 5 14 0.000136

Outputting Probabilities What s nice about Naïve Bayes (and generative models in general) is that it returns probabilities These probabilities can tell us how confident the algorithm is So don t throw away those probabilities!

Performance on a Test Set Naïve Bayes is often a good choice if you don t have much training data!

Naïve Bayes Assumption Recall the Naïve Bayes assumption: that all features are independent given the class label Y Does this hold for the digit recognition problem?

Exclusive-OR Example For an example where conditional independence fails: Y=XOR(X 1,X 2 ) X 1 X 2 P(Y=0 X 1,X 2 ) P(Y=1 X 1,X 2 ) 0 0 1 0 0 1 0 1 1 0 0 1 1 1 1 0

Actually, the Naïve Bayes assumption is almost never true Still Naïve Bayes often performs surprisingly well even when its assumptions do not hold

On naïve Bayesian classifier Advantages: Easy to implement Very efficient Good results obtained in many applications Disadvantages Assumption: class conditional independence, therefore loss of accuracy when the assumption is seriously violated (those highly correlated data sets) CS583, Bing Liu, UIC 29

Naïve Bayesian Classifier: Comments Advantages Easy to implement Good results obtained in most of the cases Disadvantages Assumption: class conditional independence, therefore loss of accuracy Practically, dependencies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Naïve Bayesian Classifier How to deal with these dependencies? Bayesian Belief Networks

Numerical Stability It is often the case that machine learning algorithms need to work with very small numbers Imagine computing the probability of 2000 independent coin flips MATLAB thinks that (.5) 2000 =0

Underflow Prevention Multiplying lots of probabilities floating-point underflow. Recall: log(xy) = log(x) + log(y), better to sum logs of probabilities rather than multiplying probabilities.

Underflow Prevention Class with highest final un-normalized log probability score is still the most probable. c NB argmax c C j log P( c j ) log ipositions P( x i c j )

Numerical Stability Instead of comparing P(Y=5 X 1,,X n ) with P(Y=6 X 1,,X n ), Compare their logarithms

Recovering the Probabilities What if we want the probabilities though?? Suppose that for some constant K, we have: And How would we recover the original probabilities?

Recovering the Probabilities Given: Then for any constant C: One suggestion: set C such that the greatest i is shifted to zero:

Recap We defined a Bayes classifier but saw that it s intractable to compute P(X 1,,X n Y) We then used the Naïve Bayes assumption that everything is independent given the class label Y A natural question: is there some happy compromise where we only assume that some features are conditionally independent? Stay Tuned

Conclusions Naïve Bayes is: Really easy to implement and often works well Often a good first thing to try Commonly used as a punching bag for smarter algorithms

Evaluating classification algorithms You have designed a new classifier. You give it to me, and I try it on my image dataset

Evaluating classification algorithms I tell you that it achieved 95% accuracy on my data. Is your technique a success?

Types of errors But suppose that The 95% is the correctly classified pixels Only 5% of the pixels are actually edges It misses all the edge pixels How do we count the effect of different types of error?

Ground Truth Not Edge Edge Types of errors Edge Prediction Not edge True Positive False Positive False Negative True Negative

Two parts to each: whether you got it correct or not, and what you guessed. For example for a particular pixel, our guess might be labelled True Positive Did we get it correct? True, we did get it correct. What did we say? We said positive, i.e. edge. or maybe it was labelled as one of the others, maybe False Negative Did we get it correct? False, we did not get it correct. What did we say? We said negative, i.e. not edge.

Sensitivity and Specificity Count up the total number of each label (TP, FP, TN, FN) over a large dataset. In ROC analysis, we use two statistics: Sensitivity = TP TP+FN Can be thought of as the likelihood of spotting a positive case when presented with one. Or the proportion of edges we find. Specificity = TN TN+FP Can be thought of as the likelihood of spotting a negative case when presented with one. Or the proportion of non-edges that we find

Classifier Accuracy Measures C 1 C 2 C 1 True positive False negative C 2 False positive True negative classes buy_computer = yes buy_computer = no total recognition(%) buy_computer = yes 6954 46 7000 99.34 buy_computer = no 412 2588 3000 86.27 total 7366 2634 10000 95.52 Accuracy of a classifier M, acc(m): percentage of test set tuples that are correctly classified by the model M Error rate (misclassification rate) of M = 1 acc(m) Given m classes, CM i,j, an entry in a confusion matrix, indicates # of tuples in class i that are labeled by the classifier as class j Alternative accuracy measures (e.g., for cancer diagnosis) sensitivity = t-pos/pos /* true positive recognition rate */ specificity = t-neg/neg /* true negative recognition rate */ precision = t-pos/(t-pos + f-pos) accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg) This model can also be used for cost-benefit analysis

TP Sensitivity = =? TP+FN TN Specificity = =? TN+FP Ground Truth 1 0 Prediction 1 0 60 30 80 20 60+30 = 90 cases in the dataset were class 1 (edge) 80+20 = 100 cases in the dataset were class 0 (non-edge) 90+100 = 190 examples (pixels) in the data overall

The ROC space 1.0 This is edge detector A This is edge detector B Sensitivity Note 0.0 1.0 1 - Specificity

The ROC Curve Draw a convex hull around many points: Sensitivity This point is not on the convex hull. 1 - Specificity

ROC Analysis All the optimal detectors lie on the convex hull. sensitivity Which of these is best depends on the ratio of edges to non-edges, and the different cost of misclassification 1 - specificity Any detector on this side can lead to a better detector by flipping its output. Take-home point : You should always quote sensitivity and specificity for your algorithm, if possible plotting an ROC graph. Remember also though, any statistic you quote should be an average over a suitable range of tests for your algorithm.

Predictor Error Measures Measure predictor accuracy: measure how far off the predicted value is from the actual known value Loss function: measures the error betw. y i and the predicted value y i Absolute error: y i y i Squared error: (y i y i ) 2 Test error (generalization error): the average loss over the test set Mean absolute error: Mean squared error: Relative absolute error: Relative squared error: The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-square error, similarly, root relative squared error d y y d i i i 1 ' d y y d i i i 1 2 ') ( d i i d i i i y y y y 1 1 ' d i i d i i i y y y y 1 2 1 2 ) ( ') (

Evaluating the Accuracy of a Holdout method Classifier or Predictor (I) Given data is randomly partitioned into two independent sets Training set (e.g., 2/3) for model construction Test set (e.g., 1/3) for accuracy estimation Random sampling: a variation of holdout Repeat holdout k times, accuracy = avg. of the accuracies obtained Cross-validation (k-fold, where k = 10 is most popular) Randomly partition the data into k mutually exclusive subsets, each approximately equal size At i-th iteration, use D i as test set and others as training set Leave-one-out: k folds where k = # of tuples, for small sized data Stratified cross-validation: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data

Holdout estimation What to do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training Usually: one third for testing, the rest for training

Holdout estimation Problem: the samples might not be representative Example: class might be missing in the test data Advanced version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets

Repeated holdout method Repeat process with different subsamples more reliable In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) The error rates on the different iterations are averaged to yield an overall error rate

Repeated holdout method Still not optimum: the different test sets overlap Can we prevent overlapping? Of course!

Cross-validation Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation

Cross-validation Often the subsets are stratified before the crossvalidation is performed The error estimates are averaged to yield an overall error estimate

More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation Why ten? Empirical evidence supports this as a good choice to get an accurate estimate There is also some theoretical evidence for this Stratification reduces the estimate s variance Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)

Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: Set number of folds to number of training instances I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Very computationally expensive (exception: NN)

Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible It guarantees a non-stratified sample because there is only one instance in the test set!