EECS 349 Machine Learning Homework 3

Similar documents
CMSC 201 Spring 2017 Project 1 Number Classifier

Spam Detection ECE 539 Fall 2013 Ethan Grefe. For Public Use

NLP Final Project Fall 2015, Due Friday, December 18

Tips from the experts: How to waste a lot of time on this assignment

CS1 Lecture 3 Jan. 22, 2018

CS 315 Software Design Homework 3 Preconditions, Postconditions, Invariants Due: Sept. 29, 11:30 PM

CSCI544, Fall 2016: Assignment 1

Tips from the experts: How to waste a lot of time on this assignment

APPM 2460 Matlab Basics

CSCI544, Fall 2016: Assignment 2

6.034 Design Assignment 2

Final Exam. Introduction to Artificial Intelligence. CS 188 Spring 2010 INSTRUCTIONS. You have 3 hours.

Project 1 Balanced binary

Divisibility Rules and Their Explanations

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0

Move-to-front algorithm

Programming Standards: You must conform to good programming/documentation standards. Some specifics:

2. Design Methodology

10601 Machine Learning. Model and feature selection

Spam Classification Documentation

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

10708 Graphical Models: Homework 2

Application of Support Vector Machine Algorithm in Spam Filtering

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

CSE 446 Bias-Variance & Naïve Bayes

CS1 Lecture 3 Jan. 18, 2019

7. Decision or classification trees

CE807 Lab 3 Text classification with Python

Author Prediction for Turkish Texts

CS451 - Assignment 3 Perceptron Learning Algorithm

LDPC Simulation With CUDA GPU

Detection and Extraction of Events from s

Wild Mushrooms Classification Edible or Poisonous

Introduction to Machine Learning Prof. Mr. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Midterm Exam 2B Answer key

Searching Guide. September 16, Version 9.3

AXIOMS FOR THE INTEGERS

1. Welcome. (1) Hello. My name is Dr. Christopher Raridan (Dr. R). (3) In this tutorial I will introduce you to the amsart documentclass.

Tips from the experts: How to waste a lot of time on this assignment

Matlab for FMRI Module 1: the basics Instructor: Luis Hernandez-Garcia

CS 161 Computer Security

Boosting Simple Model Selection Cross Validation Regularization

Sets. Sets. Examples. 5 2 {2, 3, 5} 2, 3 2 {2, 3, 5} 1 /2 {2, 3, 5}

To earn the extra credit, one of the following has to hold true. Please circle and sign.

Topic Notes: Bits and Bytes and Numbers

CMSC 201 Fall 2016 Lab 09 Advanced Debugging

Matlab and Coordinate Systems

Lexical Analysis. Lecture 3. January 10, 2018

Paul's Online Math Notes Calculus III (Notes) / Applications of Partial Derivatives / Lagrange Multipliers Problems][Assignment Problems]

Note: This is a miniassignment and the grading is automated. If you do not submit it correctly, you will receive at most half credit.

Homework /681: Artificial Intelligence (Fall 2017) Out: October 18, 2017 Due: October 29, 2017 at 11:59PM

Fall 2005 Joseph/Tygar/Vazirani/Wagner Final

Grade 6 Math Circles November 6 & Relations, Functions, and Morphisms

Chapter 3. Set Theory. 3.1 What is a Set?

Taking Control of Your . Terry Stewart Lowell Williamson AHS Computing Monday, March 20, 2006

9. MATHEMATICIANS ARE FOND OF COLLECTIONS

Programming Project 1: Lexical Analyzer (Scanner)

Introduction to Domain Testing

Due: 9 February 2017 at 1159pm (2359, Pacific Standard Time)

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

Using the City of Stamford / Stamford Public Schools. Web System

Variables and Data Representation

CSI5387: Data Mining Project

Math Modeling in Java: An S-I Compartment Model

Understanding and Exploring Memory Hierarchies

Binary, Hexadecimal and Octal number system

CS 188: Artificial Intelligence Fall Machine Learning

Detecting Spam with Artificial Neural Networks

Assignment 9 / Cryptography

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Basic Reliable Transport Protocols

Programming with Python

Programming Assignment 5 (100 Points) START EARLY!

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

TMG Clerk. User Guide

Carnegie Mellon University Department of Computer Science /615 - Database Applications C. Faloutsos & A. Pavlo, Fall 2015

Honors Computer Science Python Mr. Clausen Programs 4A, 4B, 4C, 4D, 4E, 4F

CSCA08 Winter 2018 Week 3: Logical Operations, Design Recipe. Marzieh Ahmadzadeh, Brian Harrington University of Toronto Scarborough

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

MiTV User Manual Revision 2 July 8, 2015 Prepared by Walter B. Schoustal MicroVideo Learning Systems, Inc.

Introduction Add Item Add Folder Add External Link Add Course Link Add Test Add Selection Text Editing...

CS 188: Artificial Intelligence Fall 2011

Proofwriting Checklist

Adobe Security Survey

Simple Model Selection Cross Validation Regularization Neural Networks

Functional Programming in Haskell Prof. Madhavan Mukund and S. P. Suresh Chennai Mathematical Institute

Homework 2: HMM, Viterbi, CRF/Perceptron

Exercise 1 Using Boolean variables, incorporating JavaScript code into your HTML webpage and using the document object

Introduction to Scientific Python, CME 193 Jan. 9, web.stanford.edu/~ermartin/teaching/cme193-winter15

Probabilistic Learning Classification using Naïve Bayes

CS 188: Artificial Intelligence Fall Announcements

Announcements. CS 188: Artificial Intelligence Fall Machine Learning. Classification. Classification. Bayes Nets for Classification

Regularization and model selection

Project Report: "Bayesian Spam Filter"

Final Database Design Deliverables CS x265, Introduction to Database Management Systems, Spring 2018

Topic Notes: Bits and Bytes and Numbers

Solutions to Homework 10

BIOL 417: Biostatistics Laboratory #3 Tuesday, February 8, 2011 (snow day February 1) INTRODUCTION TO MYSTAT

ELEMENTARY NUMBER THEORY AND METHODS OF PROOF

Definition MATH Benjamin V.C. Collins, James A. Swenson MATH 2730

Transcription:

WHAT TO HAND IN You are to submit the following things for this homework: 1. A SINGLE PDF document containing answers to the homework questions. 2. The WELL COMMENTED MATLAB source code for all software you write. HOW TO HAND IT IN To submit your assignment: 1. Compress all of the files specified into a.zip file. 2. Name the file in the following manner, firstname_lastname_hw3.zip. For example, Bryan_Pardo_hw3.zip. 3. Submit this.zip file via blackboard. DUE DATE: the start of class as specified in the course calendar. How I am going to grade your code Does it work at all? That is, can I run my testing script on it without a crash? Does it produce the output files where and how it is supposed to. Are they named correctly? Do files get moved to the correct directories? Does it produce correct output? In other words, are the spams in the spam folder and the hams in the ham folder? Did your dictionary contain the right probabilities after I ran it on a training corpus? Is it well-commented? This part matters a lot if you fail at working or producing correct output. Does each function explain what it expects as input and what it will produce as output? Are the internal steps commented? Does typing help <function_name> give me something useful that would let me use your code without having to read the source? How I am going to grade your answers Are they clear and concise? Were all the questions answered? Did the ideas make sense? Has the data been presented in a way that is easy for me to understand, with labeled tables or figures and explanations thereof? A screen dump showing 5 pages of output to the command line is neither clear nor concise. THE PROBLEM Spam is e-mail that is both unsolicited by the recipient and sent in substantively identical form to many recipients. As of April 2013, Spam makes up about 70% of all email. For details, see this report http://www.securelist.com/en/analysis/204792297/spam_in_q2_2013 Currently, spam is dealt with on the receiving end by automated spam filter programs that attempt to classify each message as either spam (undesired mass email) or ham (mail the user would like to receive). A Naïve Bayes Spam Filter uses a Naive Bayes classifier to identify spam email. Many modern mail programs implement Bayesian spam filtering. Examples include SpamAssassin and SpamBayes. In this lab, you will create a spam filter based on Naïve Bayes classification.

THE MATH OF A NAÏVE BAYES CLASSIFIER A Naïve Bayesian classifier takes obects described by a feature vector and classifies each obect based on the features. Let V be the set of possible classifications. In this case, that will be spam or ham. V { spam, ham} = (1) Let a 1 through a n be Boolean values. We will represent each mail document by these n Boolean values. Here, a i will be true if the ith word in our dictionary is present in the document, and a i will otherwise be false. a Dictionary words (2) i If we have a dictionary consisting of the following set of words: { feature, ham, impossible, zebra }, then the first paragraph of this section would be characterized by a vector of four Boolean values (one for each word). paragraph1: a 1 = T a 2 = T a 3 = F a 4 = F We write the probability of paragraph1 as: P(a 1 = T a 2 = T a 3 = F a 4 = F) When we don't know the values for the attrubutes yet, we write... P(a 1 a 2 a 3 a 4 )...but we don't mean the probability these variables are TRUE, we mean the probability these variables take the observed values in the text we're encoding. A Bayesian classifier would then find the classification for the spam that produces the maximal probability for the following equation. v = arg max P( v a a... a ) (3) MAP 1 2 n Now to estimate values for our probabilities, it is going to be more convenient to flip these probabilities around, so that we can talk about the probability of each attribute, given that something is spam. We can do this by applying Bayes rule. v MAP P( a1 a2... an v ) P( v ) = arg max P( a a... a ) 1 2 n (4) Since we are comparing a single text over the set of labels V in this equation, we can remove the divisor (which represents the text we ve encoded), since it will be the same, no matter what label v is set to. v = arg max P( a a... a v ) P( v ) (5) MAP 1 2 n A Naïve Bayes classifier is naïve because it assumes the likelihood of each feature being present is completely independent of whether any other feature is present.

v = arg max P( a v ) P( a v )... P( a v ) P( v ) NB 1 2 n = arg max P( v ) P( ai v ) i (6) Now, given this mathematical framework, building a spam classifier is fairly straightforward. Each attribute a i is a word drawn from a dictionary. To classify a document, generate a feature vector for that document that specifies which words from the dictionary are present and which are absent. Note about this encoding: repeating a word 100 times in a document won t cause the answer to Is word X present in document Y? to be any more true than if the word is there 1 time. This is true BOTH when you build your probabilities in the dictionary AND when you encode an email to see if it spam. Then, for each attribute (such as the word halo ), look up the probability of that attribute being present (or absent), given a particular categorization ( spam or ham ) and multiply them together. Don t forget, of course, to determine the base probability of a particular category (how likely is it for any given document to be spam?). Voila, the categorization that produces the highest probability is the category selected for the document. Oh one other thing. Your dictionary will have thousands of words in it. If you multiply thousands of words together, what happens to the probabilities? Can you represent them? You might consider using equation (7), instead of (6), when implementing your code. A TRAINING CORPUS v NB = arg max v V # % log(p(v )) + $ i ( ) log P(a i v & ( (7) In order to train your system, we suggest you use the spam corpus used by the people who built the Spam Assassin program. To get this spam corpus, go to the following link: http://spamassassin.apache.org/publiccorpus/ You should pick a spam file and a ham file for you to test and train your system. Two that seem to work fairly well are 20030228_spam.tar.bz2, 20021010_easy_ham.tar.bz2, but you don t HAVE to use those files in your experiments. You can unzip these files using the standard unix utility tar (with the options xf) or from windows using WinRAR (which can be found here: http://www.rarlab.com/download.htm). THE DICTIONARY BUILDER (5 points) You will need to write a MATLAB function that builds a dictionary of words from a set of spam and ham files, along with the associated probability of occurrence of each word. The dictionary builder should accept the parameters shown below makedictionary( spam_directory, ham_directory, dictionary_filename) The parameter spam_directory is character string containing the path to a directory containing ONLY spam files. Each file in this directory will be an ASCII text file that is spam. The parameter ham_directory is a character string containing the path to a directory containing ONLY ham files. Each file in this directory will be an ASCII file containing a ham document. Once the dictionary is created, it should save the results to a file named dictionary_filename. This file should be IN THE CURRENT DIRECTORY. Repeat. This file should be IN THE CURRENT DIRECTORY. '

For each word in the dictionary, your system should determine the probability of observing that word in a spam document and the probability of observing that word in a ham document. Note: the probability of observing a word in a document is calculated by counting how many documents contain one or more instances of that word. If I have 100 documents, where 99 of them have no instances of smurf and 1 document has 100 instances of smurf, the probability of smurf is 0.01, NOT 1.0. The dictionary file should consist of three space-delineated elements per line. The first element is a word. The second element is the probability of observing that word at least once in a spam document. The third element is the probability of observing the word at least once in a ham document. Lines in the dictionary MUST BE IN ALPHABETICAL ORDER. Assume A-Z is equivalent to a-z (i.e. capitalization does not matter) and that the digits 1-9 precede A-Z in alphabetical order. Here is an example dictionary format [word] [P(word spam] [P(word ham)] Here is an example dictionary of four keywords. Bob 0.003 0.4 break 0.001 0.02 Cl3an 0.7 0.003 Supercalafragalisticexpealidocious 0.01 0.0 While you must ignore capitalization, you will still have to make a lot of choices in designing your dictionary maker. Will you treat feature and features as different? What about Bob and B0b? Will you ignore the text in the headers of emails? These design choices are up to you, but you must ustify them in the question-answering portion of this lab. THE SPAM CLASSIFIER (5 points) You will need to write a MATLAB function that classifies each document in a directory containing ASCII text documents as either spam (undesired unk mail) or ham (desired document). spamsort( mail_directory, spam_directory, ham_directory, dictionary_filename, spam_prior_probability) Here, mail_directory, spam_directory, ham_directory, dictionary_filename are all character strings. The variable spam_prior_probability is a real value. The parameter mail_directory specifies the path to a directory where every file is an ASCII text file to be classified as spam or not. The parameter dictionary_filename specifies the name and path of a dictionary file in the same format as that output from the makedictionary function. The parameter spam_prior_probability specifies the base probability that any given document is spam. Your program must classify each document in the mail directory as either spam or ham. Then, it must move it to spam_directory if it is spam. It must move the file to the ham_directory if it is ham (not spam). When done, the mail_directory should be empty, since all the files will have been moved to either spam_directory or ham_directory.

QUESTION TO ANSWER (5 points total, 1 point per question) You will need to test the performance of your system. This means creating a training dataset and a testing dataset. You will train the spam filter on the training set and then evaluate results on the testing set. 1) Design choices: A. Describe how you decided to build your dictionary. How did you decide to deal with non-alphanumeric characters? How about plurals? What other issues did you build your system to handle? Is every word from the training corpus in your dictionary or did you remove some words or limit the dictionary size? Why? B. Describe how you decided to build your spam sorter. How did you calculate the probabilities? Did you end up using equation (6) or (7)? If you used (7), give a proof it would return same results as (6). 2) Datasets: In your experiments, what data set did you use to build your dictionary? What data set did you use to test the resulting spam filter? Were they the same? Why did you choose the data sets you chose? How many files are in your data? What is the percentage of spam in the data? Is your data representative of the kinds of spam you might see in the real world? Why or why not? 3) Your Testing Methodology: Explain how you tested the spam filter to see if it worked better than ust using the prior probability of spam learned from your training set. Did you do cross validation? If so, how many folds? If not, why not? What is your error measure? 4) Experimental Results: Does your spam filter do better than ust using the prior probability of spam in your data set? Give evidence to back up your assertion. This may include well-labeled graphs, tables and statistical tests. The idea is to convince a reader that you really have established that your system works (or doesn t). 5) Future Work: How might you try to redesign the system to improve performance in the future? Would you take into account higher-level features (e.g. sentence structure, part-of-speech tagging) when categorizing as spam? Would you exclude portions of the mail as meaningless noise? Why? Be specific. Saying only I will use part-of-speech information is not a meaningful response. I want a description of what improvements you would make, why you think they would help and how you would implement that.