Naïve Bayes Classifiers. Jonathan Lee and Varun Mahadevan

Similar documents
Naïve Bayes Classifiers. Jonathan Lee and Varun Mahadevan

CS 188: Artificial Intelligence Fall 2011

Artificial Intelligence Naïve Bayes

CS 188: Artificial Intelligence Spring Announcements

CS 343: Artificial Intelligence

CS 188: Artificial Intelligence Fall Machine Learning

CS 188: Artificial Intelligence Fall Announcements

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification

CS 188: Artificial Intelligence Fall 2008

Probabilistic Learning Classification using Naïve Bayes

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning. CS 232: Ar)ficial Intelligence Naïve Bayes Oct 26, 2015

Cake and Grief Counseling Will be Available: Using Artificial Intelligence for Forensics Without Jeopardizing Humanity.

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Summer 2010 Research Project. Spam Filtering by Text Classification. Manoj Reddy Advisor: Dr. Behrang Mohit

CSE 473: Ar+ficial Intelligence Machine Learning: Naïve Bayes and Perceptron

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Bayesian Networks. A Bayesian network is a directed acyclic graph that represents causal relationships between random variables. Earthquake.

CSCI544, Fall 2016: Assignment 1

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Classification and K-Nearest Neighbors

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Unit 9: Cryptography. Dave Abel. April 15th, 2016

CS125 : Introduction to Computer Science. Lecture Notes #11 Procedural Composition and Abstraction. c 2005, 2004 Jason Zych

Introduction to Machine Learning Prof. Mr. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Formal Methods of Software Design, Eric Hehner, segment 1 page 1 out of 5

AI Programming CS S-15 Probability Theory

Bayesian Spam Detection

Project Report: "Bayesian Spam Filter"

Simple Model Selection Cross Validation Regularization Neural Networks

CSCI544, Fall 2016: Assignment 2

Text Classification. Dr. Johan Hagelbäck.

Bayesian Classification Using Probabilistic Graphical Models

Contents. Management. Client. Choosing One 1/20/17

Warm-up as you walk in

The name of our class will be Yo. Type that in where it says Class Name. Don t hit the OK button yet.

Chapter01.fm Page 1 Monday, August 23, :52 PM. Part I of Change. The Mechanics. of Change

Advanced Introduction to Machine Learning CMU-10715

To earn the extra credit, one of the following has to hold true. Please circle and sign.

CPSC 340: Machine Learning and Data Mining. Finding Similar Items Fall 2017

Machine Learning

1 Document Classification [60 points]

Inference. Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation:

Modular Arithmetic. Marizza Bailey. December 14, 2015

Outlook is easier to use than you might think; it also does a lot more than. Fundamental Features: How Did You Ever Do without Outlook?

Today s Topics. Percentile ranks and percentiles. Standardized scores. Using standardized scores to estimate percentiles

The Basics of Graphical Models

Final Report - Smart and Fast Sorting

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017

Sample questions with solutions Ekaterina Kochmar

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

An electronic mailing list is a way to distribute information to many Internet users using . It is a list of names and addresses, similar to a

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

10708 Graphical Models: Homework 2

Starting Boolean Algebra

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Ages Donʼt Fall for Fake: Activity 1 Don t bite that phishing hook! Goals for children. Letʼs talk

2 A little on Spreadsheets

CSE 446 Bias-Variance & Naïve Bayes

Bayes Classifiers and Generative Methods

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CISC220 Lab 2: Due Wed, Sep 26 at Midnight (110 pts)

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

2016 All Rights Reserved

Introduction to Scientific Computing

Master Cold s. - The ebook. Written with at FindThatLead.com

Containerised Development of a Scientific Data Management System Ben Leighton, Andrew Freebairn, Ashley Sommer, Jonathan Yu, Simon Cox LAND AND WATER

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

A Program demonstrating Gini Index Classification

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Keyword is the term used for the words that people type into search engines to find you:

CMSC 201 Fall 2016 Lab 09 Advanced Debugging

The Problem of Overfitting with Maximum Likelihood

CPSC 340: Machine Learning and Data Mining. Ranking Fall 2016

Machine Learning

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Graphical Models. David M. Blei Columbia University. September 17, 2014

Construct an optimal tree of one level

Lecture 1: Perfect Security

Supervised Learning Classification Algorithms Comparison

Modern Cookie Stuffing

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

CLUSTERING. JELENA JOVANOVIĆ Web:

Spam Protection Guide

CS/INFO 1305 Summer 2011 Machine Learning

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Sorting lower bound and Linear-time sorting Date: 9/19/17

Mathematics 96 (3581) CA (Class Addendum) 2: Associativity Mt. San Jacinto College Menifee Valley Campus Spring 2013

Lecture 4: examples of topological spaces, coarser and finer topologies, bases and closed sets

1.7 Limit of a Function

CMSC 201 Spring 2017 Project 1 Number Classifier

Section 0.3 The Order of Operations

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

CS451 - Assignment 8 Faster Naive Bayes? Say it ain t so...

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/18/14

Lab 2 Building on Linux

Transcription:

Naïve Bayes Classifiers Jonathan Lee and Varun Mahadevan

Independence Recap: Definition: Two events X and Y are independent if P(XY) = P(X)P(Y), and if P Y > 0, then P X Y = P(X)

Conditional Independence Conditional Independence tells us that: Two events A and B are conditionally independent given C if P A B C = P A C P(B C), and if P(B) > 0 and P(C) > 0, then P A BC = P A C

Example: Randomly choose a day of the week A = { It is not a Monday } B = { It is a Saturday } C = { It is the weekend } A and B are dependent events P(A) = 6/7, P(B) = 1/7, P(AB) = 1/7 Now condition both A and B on C: P(A C) = 1, P(B C) = ½, P(AB C) = ½ P(AB C) = P(A C)P(B C) => A C and B C are independent

Programming Project: Spam Filter Due: Check the Calendar Implement a Naive Bayes classifier for classifying emails as either spam or ham. You may use C, Java, Python, or R; ask if you have a different preference. We ve provided starter code in Java, Python and R. Read Jonathan s notes on the website, start early, and ask for help if you get stuck!

Spam vs. Ham (at least in the past) the bane of any email user s existence You know it when you see it! Easy for humans to identify, but not necessarily easy for computers Less of a problem for consumers now, because spam filters have gotten really good

The spam classification problem Input: collection of emails, already labelled spam or ham Someone has to label these by hand! Usually called the training data Use this data to train a model that understands what makes an email spam or ham We re using a Naïve Bayes classifier, but there are other approaches This is a Machine Learning problem (take 446 for more!) Test your model on emails whose label isn t known to the model, and see how well it does Usually called the test data

Naïve Bayes in the real world One of the oldest, simplest models for classification Still, very powerful and used all the time in the real world/industry Identifying credit card fraud Identifying fake Amazon reviews Identifying vandalism on Wikipedia Still used (with modifications) by Gmail to prevent spam Facial recognition Categorizing Google News articles Even used for medical diagnosis!

How do we represent an email? There s a lot of different things about emails that might give a computer a hint about whether or not it s spam Possible features: words in body, subject line, sender, message header, time sent For this assignment, we choose to represent an email just as the set of distinct words (x 1, x 2,, x n ) in the subject and body SUBJECT: Top Secret Business Venture Dear Sir. First, I must solicit your confidence in this transaction, this is by virture of its nature as being utterly confidencial and top secret (top, secret, business, venture, dear, sir, first, I, must, solicit, your, confidence, in, this, transaction, is, by, virture, of, its, nature, as, being, utterly, confidencial, and) Notice that there are no duplicate words!

Naïve Bayes in theory The concepts behind Naïve Bayes are nothing new to you -- we ll be using what we ve learned in the past few weeks. Specifically Bayes Theorem P B A P A P A B = P(B) Law of Total Probability P(A) = P A B n P(B n ) n Conditional Probability P A B = P A B P(B) Chain Rule P A 1,, A n = P A 1 P A 2 A 1 P A n A n 1 A 1 Conditional Independence P A B C = P A C P(B C)

Take the set of distinct words (x 1, x 2,, x n ) to represent the text in an email. We are trying to compute P Spam x 1, x 2,, x n =??? By applying Bayes Theorem, we can reverse the conditioning. It s easier to find the probability of a word appearing in a spam email than the reverse. P Spam x 1, x 2,, x n = P x 1, x 2, x n Spam P(Spam) P(x 1, x 2,, x n ) P x 1, x 2, x n Spam P(Spam) = P x 1, x 2, x n Spam P Spam + P x 1, x 2, x n Ham P(Ham)

Let s take a look at the numerator and apply the rule for Conditional Probability P x 1, x 2, x n Spam P Spam And now let s use the Chain Rule to decompose this P x 1, x 2,, x n, Spam = P(x 1, x 2,, x n, Spam) = P x 1 x 2,, x n, Spam P(x 2,, x n, Spam) = P x 1 x 2,, x n, Spam P(x 2 x 3,, x n, Spam)P(x 3,, x n, Spam) = = P x 1 x 2,, x n, Spam P x n 1 x n, Spam P x n Spam P(Spam) But this is still hard to compute.

Let s simplify the problem with an assumption. We will assume that the words in the email are conditionally independent of each other, given that we know whether or not the email is spam. This is why we call this Naïve Bayes. This isn t true irl! So how does this help? P x 1, x 2,, x n, Spam = P x 1 x 2,, x n, Spam P x n 1 x n, Spam P x n Spam P(Spam) P x 1 Spam P(x 2 Spam) P x n 1 Spam P x n Spam P(Spam) P(x 1, x 2,, x n, Spam) P(Spam) n i=1 P(x i Spam)

So we know that Similarly Putting it all together P(x 1, x 2,, x n, Spam) P(Spam) P Spam x 1, x 2,, x n P(x 1, x 2,, x n, Ham) P(Ham) P Spam n P(Spam) n i=1 n i=1 n i=1 P(x i Spam) P(x i Ham) P(x i Spam) i=1 P x i Spam + P(Ham) i=1 P(x i Ham) n

How spammy is a word? Have a nice formula for email spam probability, using conditional probabilities of words given ham/spam P Spam and P(Ham) are just the proportion of total emails that are spam and ham What is P(viagra Spam) asking? Would be easy to just count up how many spam emails have this word in them, so P w Spam = number of spam emails containing w total number of spam emails This seems reasonable, but there s a problem (maybe)

Suppose the word Pokemon only ever showed up in ham emails in the training data, never in spam P Pokemon Spam = 0 Since the overall spam probability is the product of a bunch of individual probabilities, if any of those is 0, the whole thing is 0 Any email with the word Pokemon would be assigned a spam probability of 0 SUBJECT: Get out of debt! Cheap prescription pills! Earn fast cash using this one weird trick! Meet singles near you and get preapproved for a low interest credit card! Pokemon definitely not spam, right? What can we do?

Laplace smoothing Crazy idea: what if we pretend we ve seen every outcome once already? Pretend we ve seen one more spam email with w, one more without w P w Spam = number of spam emails containing w + 1 total number of spam emails + 2 Then, P Pokemon Spam > 0 No one word can poison the overall probability too much General technique to avoid assuming that unseen events will never happen