Naïve Bayes Slides are adapted from Sebastian Thrun (Udacity ), Ke Chen Jonathan Huang and H. Witten s and E. Frank s Data Mining and Jeremy Wyatt,

Size: px

Start display at page:

Download "Naïve Bayes Slides are adapted from Sebastian Thrun (Udacity ), Ke Chen Jonathan Huang and H. Witten s and E. Frank s Data Mining and Jeremy Wyatt,"

Alexandra O’Neal’
6 years ago
Views:

1 Naïve Bayes Slides are adapted from Sebastian Thrun (Udaity ), Ke Chen Jonathan Huang and H. Witten s and E. Frank s Data Mining and Jeremy Wyatt,

2 Bakground There are three methods to establish a lassifier a) Model a lassifiation rule diretly Examples: k-nn, deision trees, pereptron, SVM b) Model the probability of lass memberships given input data Example: pereptron with the ross-entropy ost ) Make a probabilisti model of data within eah lass Examples: naive Bayes, model based lassifiers a) and b) are examples of disriminative lassifiation ) is an example of generative lassifiation b) and ) are both examples of probabilisti lassifiation 2

3 Probability Basis Prior, onditional and joint probability for random variables Prior probability: Conditional probability: Joint probability: Relationship: Independene: Bayesian Rule P(x) P( 2 1 x1 x2 ), P(x x ) x ( x1, x2 ), P( x) P(x1,x2) P(x 1,x2) P( x2 x1) P( x1) P( x1 x2) P( x2) P( x2 x1) P( x2), P( x1 x2) P( x1), P(x1,x2) P( x1) P( x2) P( x) P( x ) P( ) P( x) Disriminative Generative Likelihood Prior Posterior Evidene 3

4 Probabilisti Classifiation Establishing a probabilisti model for lassifiation Disriminative model P( x) 1,,L, x (x1,, xn) P ( 1 x) P ( 2 x) P( x) Disriminative Probabilisti Classifier x1 x2 x ( x, x2,, x L 1 n xn ) 4 To train a disriminative lassifier regardless its probabilisti or nonprobabilisti nature, all training examples of different lasses must be jointly used to build up a single disriminative lassifier. Output L probabilities for L lass labels in a probabilisti lassifier while a single label is ahieved by a non-probabilisti lassifier.

5 Probabilisti Classifiation Establishing a probabilisti model for lassifiation (ont.) Generative model (must be probabilisti) P( x ) 1,,L, x (x1,, xn) x 1 P( x 1) Generative Probabilisti Model for Class 1 x2 x xn ( x, x2 x 1,, x 1 n Generative Probabilisti Model for Class L x2 ) P (x L ) xn L probabilisti models have to be trained independently Eah is trained on only the examples of the same label Output L probabilities for a given input with L models Generative means that suh a model produes data subjet to the distribution via sampling. 5

6 Probabilisti Classifiation Maximum A Posterior (MAP) lassifiation rule For an input x, find the largest one from L probabilities output by a disriminative probabilisti lassifier P( 1 x ),..., P( x). Assign x to label * if P( * x) is the largest. Generative lassifiation with the MAP rule Apply Bayesian rule to onvert them into posterior probabilities P( i x) P( x i) P( P( x) for i 1, 2,, Then apply the MAP rule to assign a label i ) P( x i ) P( i L ) L Common fator for all L probabilities 6

11 Naïve Bayes

12 Things We d Like to Do Spam Classifiation Given an , predit whether it is spam or not Medial Diagnosis Given a list of symptoms, predit whether a patient has disease X or not Weather Based on temperature, humidity, et predit if it will rain tomorrow

13 Bayesian Classifiation Problem statement: Given features X 1,X 2,,X n Predit a label Y Digit Reognition Classifier 5 X 1,,X n {0,1} (Blak vs. White pixels) Y {5,6} (predit whether a digit is a 5 or a 6)

14 The Bayes Classifier we saw that a good strategy is to predit: (for example: what is the probability that the image represents a 5 given its pixels?) So How do we ompute that?

15 The Bayes Classifier Use Bayes Rule! Likelihood Prior Normalization Constant Why did this help? Well, we think that we might be able to speify how features are generated by the lass label

16 The Bayes Classifier Let s expand this for our digit reognition task: To lassify, we ll simply ompute these two probabilities and predit based on whih one is greater

17 Model Parameters For the Bayes lassifier, we need to learn two funtions, the likelihood and the prior How many parameters are required to speify the prior for our digit reognition example? (Supposing that eah image is 30x30 pixels) The problem with expliitly modeling P(X 1,,X n Y) is that there are usually way too many parameters: We ll run out of spae We ll run out of time And we ll need tons of training data (whih is usually not available)

18 The Naïve Bayes Model The Naïve Bayes Assumption: Assume that all features are independent given the lass label Y Equationally speaking: (We will disuss the validity of this assumption later)

19 Why is this useful? # of parameters for modeling P(X 1,,X n Y): 2(2 n -1) # of parameters for modeling P(X 1 Y),,P(X n Y) 2n

20 Naïve Bayes Training Now that we ve deided to use a Naïve Bayes lassifier, we need to train it with some data: MNIST Training Data

21 Naïve Bayes Training Training in Naïve Bayes is easy: Estimate P(Y=v) as the fration of reords with Y=v Estimate P(X i =u Y=v) as the fration of reords with Y=v for whih X i =u (This orresponds to Maximum Likelihood estimation of model parameters)

22 Naïve Bayes Training In pratie, some of these ounts an be zero Fix this by adding virtual ounts: (This is like putting a prior on parameters and doing MAP estimation instead of MLE) This is alled Smoothing

23 Naïve Bayes Training For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.

24 Naïve Bayes Classifiation

25 Naïve Bayes Assumption Reall the Naïve Bayes assumption: that all features are independent given the lass label Y For an example where onditional independene fails: Y=XOR(X 1,X 2 ) Atually, the Naïve Bayes assumption is almost never true X 1 X 2 P(Y=0 X 1,X 2 ) P(Y=1 X 1,X 2 ) Still Naïve Bayes often performs surprisingly well even when its assumptions do not hold

26 26 Naïve Bayes Bayes lassifiation Diffiulty: learning the joint probability is infeasible! Naïve Bayes lassifiation Assume all input features are lass onditionally independent! Apply the MAP lassifiation rule: assign to * if.,..., for ) ( ),, ( ) ( ) ( ) ( 1 1 L n P x x P P P P x x ),, ( 1 x x P n L n n P a P a P P a P a P,,, ), ( )] ( ) ( [ ) ( )] ( ) ( [ 1 * 1 * * * 1 ) ( ) ( ) ( ),, ( ) ( ),, ( ),,, ( ),,, ( x P x P x P x x P x P x x P x x x P x x x P n n n n n ),,, ( ' 2 1 n a a a x Applying the independene assumption ),, ( estimate of * 1 a a P n ),, ( of esitmate 1 a a P n

27 Naïve Bayes For eah target value of Pˆ( i ) estimate P( For every feature value Pˆ( x j x jk i i i ( ) with examples in S; x jk ) estimate P( x i 1,, of eah feature x jk L i ) j ( j 1,, F; k ) with examples in S; 1,,N j ) x' ( a 1,, an ) ˆ( ˆ( ˆ( * * * * [ P a1 ) P an )] Pˆ( ) [ P a1 i) P an i)] Pˆ( i), i, i 1, ˆ(, L 27

28 Example Example: Play Tennis 28

29 Example Learning Phase Outlook Play=Yes Play=No Sunny 2/9 3/5 Overast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 2/5 Cool 3/9 1/5 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14 29

30 Example Test Phase Given a new instane, predit its label x =(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) Look up tables ahieved in the learning phrase P(Outlook=Sunny Play=Yes) = 2/9 P(Temperature=Cool Play=Yes) = 3/9 P(Huminity=High Play=Yes) = 3/9 P(Wind=Strong Play=Yes) = 3/9 P(Play=Yes) = 9/14 Deision making with the MAP rule 30 P(Outlook=Sunny Play=No) = 3/5 P(Temperature=Cool Play==No) = 1/5 P(Huminity=High Play=No) = 4/5 P(Wind=Strong Play=No) = 3/5 P(Play=No) = 5/14 P(Yes x ) [P(Sunny Yes)P(Cool Yes)P(High Yes)P(Strong Yes)]P(Play=Yes) = P(No x ) [P(Sunny No) P(Cool No)P(High No)P(Strong No)]P(Play=No) = Given the fat P(Yes x ) < P(No x ), we label x to be No.

31 Naïve Bayes Algorithm: Continuous-valued Features Numberless values taken by a ontinuous-valued feature Conditional probability often modeled with the normal distribution ji ji 2 1 ( x j ji) Pˆ( x j i ) exp 2 2 ji 2 ji : mean (avearage) of feature values x of examples :standard deviation of feature values Learning Phase: for X ( X 1,, Xn), C 1,, L Output: n L normal distributions and P( C i ) i 1,, L Test Phase: Given an unknown instane X a,, a ) Instead of looking-up tables, alulate onditional probabilities with all the normal distributions ahieved in the learning phrase Apply the MAP rule to assign a label (the same as done for the disrete ase) 31 j x j of examples for whih ( 1 n for whih i i

32 Naïve Bayes Example: Continuous-valued Features Temperature is naturally of ontinuous value. Yes: 25.2, 19.3, 18.5, 21.7, 20.1, 24.3, 22.8, 23.1, 19.8 No: 27.3, 30.1, 17.4, 29.5, 15.1 Estimate mean and variane for eah lass N N Yes 21.64, 2 xn, ( xn ) N No 23.88, N n1 n1 Learning Phase: output two Gaussian models for P(temp C) 1 Pˆ( x Yes) Pˆ( x No) 7.09 exp 2 exp 2 ( x 21.64) ( x 23.88) Yes No exp 2 exp ( x 21.64) ( x 23.88)

33 Zero onditional probability If no example ontains the feature value In this irumstane, we fae a zero onditional probability problem during test Pˆ( x ) Pˆ( a ) Pˆ( x 1 i jk i n i ) For a remedy, lass onditional probabilities re-estimated with n n : number of p : prior estimate(usually, m : weight to prior (number of 33 0 for x j for t a n mp Pˆ( a jk i ) n m : number of training examples for whih x training examples for whih p 1/ t jk, Pˆ( a "virtual" examples, jk a i ) 0 (m-estimate) j and possible values i jk of m 1) x j i )

Zero onditional probability Example: P(outlook=overast no)=0 in the play-tennis dataset Adding m virtual examples (m: up to 1% of #training example) In this dataset, # of training

34 Zero onditional probability Example: P(outlook=overast no)=0 in the play-tennis dataset Adding m virtual examples (m: up to 1% of #training example) In this dataset, # of training examples for the no lass is 5. We an only add m=1 virtual example in our m-esitmate remedy. The outlook feature an takes only 3 values. So p=1/3. Re-estimate P(outlook no) with the m-estimate 34

35 Numerial Stability It is often the ase that mahine learning algorithms need to work with very small numbers Imagine omputing the probability of 2000 independent oin flips MATLAB thinks that (.5) 2000 =0

36 Underflow Prevention Multiplying lots of probabilities floating-point underflow. Reall: log(xy) = log(x) + log(y), better to sum logs of probabilities rather than multiplying probabilities.

37 Underflow Prevention Class with highest final un-normalized log probability sore is still the most probable. argmax logp( ) logp( x ) NB C j j ipositions i j

38 Numerial Stability Instead of omparing P(Y=5 X 1,,X n ) with P(Y=6 X 1,,X n ), Compare their logarithms

39 Reovering the Probabilities What if we want the probabilities though?? Suppose that for some onstant K, we have: And How would we reover the original probabilities?

40 Reovering the Probabilities Given: Then for any onstant C: One suggestion: set C suh that the greatest i is shifted to zero:

41 Reap We defined a Bayes lassifier but saw that it s intratable to ompute P(X 1,,X n Y) We then used the Naïve Bayes assumption that everything is independent given the lass label Y A natural question: is there some happy ompromise where we only assume that some features are onditionally independent? Stay Tuned

42 Conlusions Naïve Bayes is: Really easy to implement and often works well Often a good first thing to try Commonly used as a punhing bag for smarter algorithms

43 Evaluating lassifiation algorithms You have designed a new lassifier. You give it to me, and I try it on my image dataset

44 Evaluating lassifiation algorithms I tell you that it ahieved 95% auray on my data. Is your tehnique a suess?

45 Types of errors But suppose that The 95% is the orretly lassified pixels Only 5% of the pixels are atually edges It misses all the edge pixels How do we ount the effet of different types of error?

46 Ground Truth Not Edge Edge Types of errors Edge Predition Not edge True Positive False Positive False Negative True Negative

47 Two parts to eah: whether you got it orret or not, and what you guessed. For example for a partiular pixel, our guess might be labelled True Positive Did we get it orret? True, we did get it orret. What did we say? We said positive, i.e. edge. or maybe it was labelled as one of the others, maybe False Negative Did we get it orret? False, we did not get it orret. What did we say? We said negative, i.e. not edge.

48 Sensitivity and Speifiity Count up the total number of eah label (TP, FP, TN, FN) over a large dataset. In ROC analysis, we use two statistis: Sensitivity = TP TP+FN Can be thought of as the likelihood of spotting a positive ase when presented with one. Or the proportion of edges we find. Speifiity = TN TN+FP Can be thought of as the likelihood of spotting a negative ase when presented with one. Or the proportion of non-edges that we find

49 TP Sensitivity = =? TP+FN TN Speifiity = =? TN+FP Ground Truth 1 0 Predition = 90 ases in the dataset were lass 1 (edge) = 100 ases in the dataset were lass 0 (non-edge) = 190 examples (pixels) in the data overall

50 The ROC spae 1.0 This is edge detetor A This is edge detetor B Sensitivity Note Speifiity

51 The ROC Curve Draw a onvex hull around many points: Sensitivity This point is not on the onvex hull. 1 - Speifiity

52 ROC Analysis All the optimal detetors lie on the onvex hull. sensitivity Whih of these is best depends on the ratio of edges to non-edges, and the different ost of mislassifiation 1 - speifiity Any detetor on this side an lead to a better detetor by flipping its output. Take-home point : You should always quote sensitivity and speifiity for your algorithm, if possible plotting an ROC graph. Remember also though, any statisti you quote should be an average over a suitable range of tests for your algorithm.

53 Holdout estimation What to do if the amount of data is limited? The holdout method reserves a ertain amount for testing and uses the remainder for training Usually: one third for testing, the rest for training

54 Holdout estimation Problem: the samples might not be representative Example: lass might be missing in the test data Advaned version uses stratifiation Ensures that eah lass is represented with approximately equal proportions in both subsets

55 Repeated holdout method Repeat proess with different subsamples more reliable In eah iteration, a ertain proportion is randomly seleted for training (possibly with stratifiiation) The error rates on the different iterations are averaged to yield an overall error rate

56 Repeated holdout method Still not optimum: the different test sets overlap Can we prevent overlapping? Of ourse!

57 Cross-validation Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Seond step: use eah subset in turn for testing, the remainder for training Called k-fold ross-validation

58 Cross-validation Often the subsets are stratified before the rossvalidation is performed The error estimates are averaged to yield an overall error estimate

59 More on ross-validation Standard method for evaluation: stratified tenfold ross-validation Why ten? Empirial evidene supports this as a good hoie to get an aurate estimate There is also some theoretial evidene for this Stratifiation redues the estimate s variane Even better: repeated stratified ross-validation E.g. ten-fold ross-validation is repeated ten times and results are averaged (redues the variane)

60 Leave-One-Out ross-validation Leave-One-Out: a partiular form of ross-validation: Set number of folds to number of training instanes I.e., for n training instanes, build lassifier n times Makes best use of the data Involves no random subsampling Very omputationally expensive (exeption: NN)

61 Leave-One-Out-CV and stratifiation Disadvantage of Leave-One-Out-CV: stratifiation is not possible It guarantees a non-stratified sample beause there is only one instane in the test set!

62 Bayesian Networks Slides are adapted from Jason Corso

63 Bayes Nets

64 Example

65 Full Joint distribution

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict