Case Study I: Naïve Bayesian spam filtering

Size: px

Start display at page:

Download "Case Study I: Naïve Bayesian spam filtering"

Miles Montgomery
6 years ago
Views:

1 Case Study I: Naïve Bayesian spam filtering Mike Wiper and Conchi Ausín Department of Statistics Universidad Carlos III de Madrid Advanced Statistics and Data Mining Summer School 26th - 30th June, 2017 Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 1 / 30

2 Objective message please happy tell hows money homecoming hey tonight people msgwat babeyoure finish wait hahasoon coolbit sent life ive ill gonna lor sorry leave lunch told dearhope free friendsyup gud sleep replytext miss try cos time late lol gettingwishwan thanks feel thatsthkweek buy yeahday meeting job class havent dun fine morning wont amp call meet smile nice cant dont love tomorrowcare phone pick didnt pls ltgt night send stuff We illustrate how to use Bayes theorem to design a simple spam detector. Pr(spam money) = Pr(money spam) Pr(spam) Pr(money) Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 2 / 30

3 Spam and ham s Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 3 / 30

4 Example Assume that we have the following set of classified as spam or ham. spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 4 / 30

5 Example Assume that we have the following set of classified as spam or ham. spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account We are interested in classifying the following new as spam or ham: new review us now Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 4 / 30

6 Using Bayes theorem spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 5 / 30

7 Using Bayes theorem spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account Prior probabilities are: Pr(spam) = 4 6 Pr(ham) = 2 6 Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 5 / 30

8 Using Bayes theorem spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account Prior probabilities are: Pr(spam) = 4 6 Pr(ham) = 2 6 The posterior probability that an containing the word review is a spam is: Pr (spam review) = 1 Pr(review spam) Pr(spam) Pr(review spam) Pr(spam)+Pr(review ham) Pr(ham) = = 1 3 Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 5 / 30

9 For several words... spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 6 / 30

10 For several words... spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account Pr ( spam) Pr ( ham) review 1/4 2/2 send 3/4 1/2 us 3/4 1/2 your 3/4 1/2 password 2/4 1/2 account 1/4 0/2 Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 6 / 30

11 For several words... Pr ( spam) Pr ( ham) review 1/4 2/2 send 3/4 1/2 us 3/4 1/2 your 3/4 1/2 password 2/4 1/2 account 1/4 0/2 Assuming that the words in each message are independent events: Pr (review us now spam) = Pr ({1, 0, 1, 0, 0, 0} spam) = 1 ( 1 3 ) ( Pr (review us now ham) = Pr ({1, 0, 1, 0, 0, 0} ham) = 2 ( 1 1 ) ( ) ( ) ( ) ( 1 1 ) = ) ( 1 0 ) = Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 7 / 30

12 Using Bayes theorem Then, the posterior probability that the new review us now is a spam is: Pr (spam review us now) = Pr (spam {1, 0, 1, 0, 0, 0}) = = Pr({1,0,1,0,0,0} spam) Pr(spam) Pr({1,0,1,0,0,0} spam) Pr(spam)+Pr({1,0,1,0,0,0} ham) Pr(ham) Consequently, the new will be classified as ham. = Note that the independence assumption between words will not in general be satisfied, but it can be useful for a naive approach. Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 8 / 30

13 Example: SMS spam data Consider the file sms.csv which contains a study of SMS records classified as spam or ham. rm(list=ls()) sms <- read.csv("sms.csv") str(sms) We want to use a naive Bayes classifier to build a spam filter based on the words in the message. Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 9 / 30

14 Prepare the Corpus A corpus is a collection of documents. library(tm) corpus <- Corpus(VectorSource(sms$text)) print(corpus) Here, VectorSource tells the Corpus function that each document is an entry in the vector. Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 10 / 30

15 Clean the Corpus Different texts may contain Hello!, Hello, hello, etc. We would like to consider all of these the same. We clean up the corpus with the tm_map function. Translate all letters to lower case: clean_corpus <- tm_map(corpus, tolower) inspect(clean_corpus[1:3]) Remove numbers: clean_corpus <- tm_map(clean_corpus, removenumbers) Remove punctuation: clean_corpus <- tm_map(clean_corpus, removepunctuation) Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 11 / 30

16 Clean the Corpus Remove common non-content words, like to, and, the,.. These are called stop words. The function stopwords reports a list of about 175 such words. stopwords()[1:10] clean_corpus <- tm_map(clean_corpus, removewords, stopwords()) Remove the excess white space: clean_corpus <- tm_map(clean_corpus, stripwhitespace) Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 12 / 30

17 Word clouds We create word clouds to visualize the differences between the two message types, ham or spam. First, obtain the indices of spam and ham messages: spam_indices <- which(sms$type == "spam") spam_indices[1:3] ham_indices <- which(sms$type == "ham") ham_indices[1:3] Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 13 / 30

18 Word clouds library(wordcloud) wordcloud(clean_corpus[ham_indices], min.freq=40) Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 14 / 30

19 Word clouds wordcloud(clean_corpus[spam_indices], min.freq=40) Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 15 / 30

20 Building a spam filter Divide corpus into training and test data. Use 75% training and 25% test: sms_train <- sms[1:4169,] sms_test <- sms[4170:5559,] And the clean corpus: corpus_train <- clean_corpus[1:4169] corpus_test <- clean_corpus[4170:5559] Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 16 / 30

21 Compute the frequency of terms Using DocumentTermMatrix, we create a sparse matrix data structure in which the rows of the matrix refer to document and the columns refer to words. sms_dtm <- DocumentTermMatrix(clean_corpus) inspect(sms_dtm[1:4, 30:35]) Divide the matrix into training and test rows. sms_dtm_train <- sms_dtm[1:4169,] sms_dtm_test <- sms_dtm[4170:5559,] Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 17 / 30

22 Identify frequently used words Don t muddy the classifier with words that may only occur a few times. To identify words appearing at least 5 times: S=colSums(as.matrix(sms_dtm_train)) S.index=which(S>4) length(s.index) Create document-term matrices using frequent words: sms_dtm_train <- sms_dtm_train[,s.index] sms_dtm_test <- sms_dtm_test[,s.index] Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 18 / 30

23 Convert count information to Yes, No Naive Bayes classification needs present or absent info on each word in a message. We have counts of occurrences. To convert the document-term matrices: convert_count <- function(x){ y <- ifelse(x > 0, 1,0) y <- factor(y, levels=c(0,1), labels=c("no", "Yes")) y } Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 19 / 30

24 Convert document-term matrices sms_dtm_train <- apply(sms_dtm_train, 2, convert_count) sms_dtm_train[1:4, 30:35] sms_dtm_test <- apply(sms_dtm_test, 2, convert_count) sms_dtm_test[1:4, 30:35] Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 20 / 30

25 Create a Naive Bayes classifier We will use a Naive Bayes classifier provided in the package e1071. library(e1071) We create the classifier using the training data. classifier <- naivebayes(sms_dtm_train, sms_train$type) class(classifier) Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 21 / 30

26 Evaluate the performance on the test data Given the classifier object, we can use the predict function to test the model on new data. predictions <- predict(classifier, newdata=sms_dtm_test) Classifications of messages in the test set are based on the probabilities generated with the training set. Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 22 / 30

27 Check the predictions against reality We have predictions and we have a factor of real spam-ham classifications. Generate a table. table(predictions, sms_test$type) Spam filter performance: True ham True spam Predicted as ham Predicted as spam It correctly classifies 99% of the ham; It correctly classifies 83% of the spam; This is good balance. Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 23 / 30

28 A Bayesian approach Consider the variable: such that Z = { 1, spam, 0, ham. Pr (Z = 1 θ) = θ. We assume a conjugate prior distribution, θ Beta (a, b). Given a sample of spam and ham s, z = {z 1,..., z n }, the posterior distribution is, θ z Beta (a + n S, b + n H ), where: n S = n I (z i = 1) is the number of spam s. i=1 n H = n I (z i = 0) is the number of ham s. i=1 Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 24 / 30

29 A Bayesian approach And the predictive probability for a new is: Therefore, Pr (Z new z) = = Pr (Z new θ) f (θ z) dθ θ znew (1 θ) 1 znew θa+n S 1 (1 θ) b+n H 1 dθ B (a, b) = B (a + n S + z new, b + n H + 1 z new ) B (a + n S, b + n H ) a + n S Pr(spam data) = Pr (Z new = 1 z) = a + b + n S + n H and Pr(ham data) = Pr (Z new = 0 z) = b + n H a + b + n S + n H. Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 25 / 30

30 Using the Bayesian approach spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account Given these data: z = {1, 0, 0, 1, 1, 1}, we have that: Pr(spam z) = a + 4 a + b + 6 Pr(ham z) = b + 2 a + b + 6 Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 26 / 30

31 A Bayesian approach Similarly, we may consider the variable: such that X = { 1, review, 0, no review, Pr (X = 1 Z = 1, θ S ) = θ S Pr (X = 1 Z = 0, θ H ) = θ H We assume conjugate prior distributions, θ S Beta (a S, b S ), θ H Beta (a H, b H ), Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 27 / 30

32 A Bayesian approach Then, given a sample of s, x = {x 1,..., x n }, and z = {x 1,..., x n }, the posterior distribution is, θ S x, z Beta (a S + n S1, b S + n S0 ), θ H x, z Beta (a H + n H1, b S + n H1 ), where: n S1 = n I (z i = 1, x i = 1) is the number of spam s with review. i=1 n H1 = n I (z i = 0, x i = 1) is the number of ham s with review. i=1 and where n S0 = n S n S1 and n H0 = n H n H1. And given the data: Pr(review spam, z, x) = Pr (X new = 1 Z new = 1, z, x) = a S + n 1S a S + b S + n S Pr(review ham, z, x) = Pr (X new = 1 Z new = 0, z, x) = a H + n 1H a H + b H + n H Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 28 / 30

33 A Bayesian approach spam: send us your password ham: send us your review ham: password review spam: review us spam: send your password spam: send us your account Given the data, the probability that an with the word review is a spam is: Pr (spam review, z, x) = = Pr(review spam,z,x) Pr(spam,z) Pr(review spam,z,x) Pr(spam,z)+Pr(review ham,z,x) Pr(ham,z) a S +1 a+4 a S +b S +4 a+b+6 a S +1 a+4 a S +b S +4 a+b+6 + a H +2 a+2 a H +b H +2 a+b+6 Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 29 / 30

34 Summary We have shown how to use the Bayes theorem to construct a spam detector. First, we have constructed a clean Corpus based on a collection of texts. Then, we have obtain a word could for each group of texts. Using a (training) data subset, we have constructed a document-term matrix using the most frequent words. We have constructed the naive Bayes classiffier based on empirical frequencies. Finally, we have shown how to extend the naive Bayes classiffier using a Bayesian approach to combine prior beliefs with the data information. Mike Wiper and Conchi Ausín Spam filtering Advanced Statistics and Data Mining 30 / 30

Probabilistic Learning Classification Using Naive Bayes

Probabilistic Learning Classification Using Naive Bayes When a meteorologist provides a weather forecast, precipitation is typically predicted using terms such as "70 percent chance of rain." These forecasts