CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

Similar documents
Homework Assignment #3

Question 1: knn classification [100 points]

Programming Exercise 3: Multi-class Classification and Neural Networks

Machine Problem 8 - Mean Field Inference on Boltzman Machine

Ascii Art. CS 1301 Individual Homework 7 Ascii Art Due: Monday April 4 th, before 11:55pm Out of 100 points

Classification and K-Nearest Neighbors

Nearest neighbors classifiers

Online Algorithm Comparison points

Search. The Nearest Neighbor Problem

CS 1803 Pair Homework 4 Greedy Scheduler (Part I) Due: Wednesday, September 29th, before 6 PM Out of 100 points

CS451 - Assignment 3 Perceptron Learning Algorithm

CS2223: Algorithms D- Term, Homework I. Teams: To be done individually. Due date: 03/27/2015 (1:50 PM) Submission: Electronic submission only

Separable Kernels and Edge Detection

0.1 Welcome. 0.2 Insertion sort. Jessica Su (some portions copied from CLRS)

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Project 2 Reliable Transport

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

CSCI567 Machine Learning (Fall 2014)

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

CS 2316 Pair 1: Homework 3 Enigma Fun Due: Wednesday, February 1st, before 11:55 PM Out of 100 points

Tree-based methods for classification and regression

Nearest Neighbor Classification

Equation to LaTeX. Abhinav Rastogi, Sevy Harris. I. Introduction. Segmentation.

k-nearest Neighbors + Model Selection

CS2223: Algorithms D-Term, Assignment 5

CS ) PROGRAMMING ASSIGNMENT 11:00 PM 11:00 PM

Final Report: Classification of Plankton Classes By Tae Ho Kim and Saaid Haseeb Arshad

CS 229 Midterm Review

CSCI544, Fall 2016: Assignment 1

Lecture 06 Decision Trees I

CS 1301 Individual Homework 3 Conditionals & Loops Due: Monday February 8 th before 11:55pm Out of 100 points

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Unit 10: Data Structures CS 101, Fall 2018

Assignment #1: /Survey and Karel the Robot Karel problems due: 1:30pm on Friday, October 7th

Mehran Sahami Handout #7 CS 106A September 24, 2014

15-780: Problem Set #2

Midterm Examination CS540-2: Introduction to Artificial Intelligence

ENEE 457: Computer Systems Security 8/27/18. Lecture 1 Introduction to Computer Systems Security

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

CMSC 201 Spring 2018 Project 2 Battleship

4, 2016, 11:55 PM EST

1 Case study of SVM (Rob)

CSCI 204 Introduction to Computer Science II

Business Club. Decision Trees

NOTE: Your grade will be based on the correctness, efficiency and clarity.

HW Assignment 3 (Due by 9:00am on Mar 6)

CS261: Problem Set #2

6.034 Design Assignment 2

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CS 1803 Pair Homework 3 Calculator Pair Fun Due: Wednesday, September 15th, before 6 PM Out of 100 points

Nearest Neighbor Classification. Machine Learning Fall 2017

CS 2316 Individual Homework 4 Greedy Scheduler (Part I) Due: Wednesday, September 18th, before 11:55 PM Out of 100 points

15 Minute Traffic Formula. Contents HOW TO GET MORE TRAFFIC IN 15 MINUTES WITH SEO... 3

You must define a class that represents songs. Your class will implement the Song interface in $master/proj2/cs61b/song.java.

NLP Final Project Fall 2015, Due Friday, December 18

CMSC 201 Spring 2018 Project 3 Minesweeper

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Topic 7 Machine learning

Data Structures III: K-D

CSCI544, Fall 2016: Assignment 2

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0

CMSC 201 Spring 2018 Lab 01 Hello World

Homework #5 Algorithms I Spring 2017

Programming Exercise 4: Neural Networks Learning

CS 2505 Fall 2013 Data Lab: Manipulating Bits Assigned: November 20 Due: Friday December 13, 11:59PM Ends: Friday December 13, 11:59PM

CS 4349 Lecture October 18th, 2017

CS 188: Artificial Intelligence Fall Machine Learning

CS70 - Lecture 6. Graphs: Coloring; Special Graphs. 1. Review of L5 2. Planar Five Color Theorem 3. Special Graphs:

Shared Memory Programming With OpenMP Exercise Instructions

CMSC 201 Spring 2017 Lab 01 Hello World

PROBLEM 4

CS 231A CA Session: Problem Set 4 Review. Kevin Chen May 13, 2016

6.867 Machine Learning

CSE 417 Practical Algorithms. (a.k.a. Algorithms & Computational Complexity)

3. When you process a largest recent earthquake query, you should print out:

Python With Data Science

CSC 411 Lecture 4: Ensembles I

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

USC Viterbi School of Engineering

LARGE MARGIN CLASSIFIERS

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CS 2316 Homework 9b GT Thrift Shop Due: Wednesday, April 20 th, before 11:55 PM Out of 100 points. Premise

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Overview of Project V Buffer Manager. Steps of Phase II

CMPSCI611: The SUBSET-SUM Problem Lecture 18

MyCaseInfo Getting Started Guide A Best Case Bankruptcy Add-on Tool

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Homework #4 RELEASE DATE: 04/22/2014 DUE DATE: 05/06/2014, 17:30 (after class) in CSIE R217

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

CS159 - Assignment 2b

CS 112 Project Assignment: Visual Password

BTRY/STSCI 6520, Fall 2012, Homework 3

Lecture 1 (Part 1) Introduction/Overview

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Due Friday, March 20 at 11:59 p.m. Write and submit one Java program, Sequence.java, as described on the next page.

1 Document Classification [60 points]

EE512 Graphical Models Fall 2009

Transcription:

CS 170 Algorithms Fall 2014 David Wagner HW12 Due Dec. 5, 6:00pm Instructions. This homework is due Friday, December 5, at 6:00pm electronically via glookup. This homework assignment is a programming assignment that is based on a machine learning application. You may work individually or in groups of two for this assignment. You may not work with more than one other person. If you work with a partner, you may only use code you wrote together via pair programming or code you wrote individually. For example, you may choose to do the following: for problem 1, you decide to use pair programming and for problem 2, you decide only to discuss your approaches to one another and use your own implementations. You may not turn in any code for which you were not involved in writing (for instance, you implement Problem 1, I ll implement Problem 2 is not allowed). In addition, you may not discuss your approaches with anyone other than your partner. 1. (50 pts.) K-Nearest Neighbors Digit classification is a classical problem that has been studied in depth by many researchers and computer scientists over the past few decades. Digit classification has many applications: for instance, postal services like the US Postal Service, UPS, and FedEx use pre-trained classifiers in order to speed up and accurately recognize handwritten addresses. Today, over 95% of all handwritten addresses are correctly classified through a computer rather than a human manually reading the address. The problem statement is as follows: given an image of a single handwritten digit, build a classifier that correctly predicts what the actual digit value of the image is. Thus, your classifier receives as input an image of a digit, and must output a class in the set {0,1,2,...,9}. For this homework, you will attack this problem by using a k-nearest neighbors algorithm. We will give you a data set (a reduced version of the MNIST handwritten digit data set). Each image of a digit is a 28 28 pixel image. We have already extracted features, using a very simple scheme: each pixel is its own feature, so we have 28 2 = 784 features. The value of a feature is the intensity of that corresponding pixel, normalized to be in the range 0..1. We have preprocessed and vectorized these images into feature vectors for you. We have split the data set into training, validation, and test sets, and we ve provided the class of each image in the training and validation sets. Your job is to infer the class of each image in the test set. Here are five examples of images that might appear in this data set, to help you visualize the data: We want you to do the following steps: CS 170, Fall 2014, HW12 1

(i) Implement the k-nearest neighbors algorithm. You can implement it in any way you like, in any programming language of your choice. For k > 1, decide on a rule for resolving ties (if there is a tie for the majority vote among the k nearest neighbors when trying to classify a new observation, which one do you choose?). (ii) Using the training set as your training data, compute the class of each digit in the validation set, and compute the error rate on the validation set, for each of the following candidate values of k: k = 1,2,5,10,25. (iii) Did your algorithm perform significantly better for k = 2 than for k = 1? Why do you think this is? (iv) Look at a few examples of digits in the validation set that your k-nearest neighbors algorithm misclassifies. We have provided images of each of the digits, so you might want to look at the corresponding images to see if you can get some intuition for what s going on. Answer the following questions: Which example digits did you look at? Why do you think the algorithm misclassifies these examples? List some additional features we could add that might help classify these examples more accurately. (v) Using the best k value you achieved on the validation set, compute the class of each image in the test set (0 9) and submit your answer. (vi) Optional bonus challenge problem: Implement your proposed features from part (iv) and try them out. Did they improve accuracy on the validation set? Try to find a set of features that gives significant reduction in the error rate you might need to experiment a bit. List the set of features you used, the total number of features, the best value of k, and the error rate you achieved. Note that item (vi) is entirely optional. Do it only if you would like an extra challenge. Your write-up should have the following parts: (a) The error rate on the validation set for each of k = 1,2,5,10,25. State which value of k is best. (b) Describe what tie-breaking rule you used. (c) Answer the questions under item (iii) above. (d) Answer the questions under item (iv) above. (e) Optional bonus challenge problem: Do item (vi) above and answer the questions under item (vi) above. 2. (50 pts.) Random forests Spam classification is another problem that has been widely studied and used in many, if not all, email services. Gmail and Yahoo Mail are two examples of services currently using sophisticated spam classifiers to filter spam from non-spam emails. The problem statement is as follows: build a classifier that, given an email, correctly predicts whether or not this email is spam or not. For this homework, you will attack this problem by using a decision forest algorithm using the email data set we have provided you. We have selected 57 useful features and preprocessed the emails to extract the feature values for you, so you only need to deal with feature vectors. A more detailed description of the feature values we are using can be found in the README file inside the emaildataset directory. We have split the data set into training, validation, and test sets, and we ve provided the class of each email in the training and validation set (spam or non-spam). There are two classes: 0 (non-spam) and 1 (spam). Do the following steps: (i) Implement a random forests classifier with T trees. Use the approach presented in lecture. For each of the T trees, use bagging to select a subsample of the training data set (a different subsample for each tree). Use the greedy algorithm to infer a decision tree. CS 170, Fall 2014, HW12 2

At each node, choose a random subset of the features as candidates, and choose the best feature (among those candidates) and best threshold. We suggest you use a subset of size 57 = 8. Use a different random subset for each tree and each node. Use information gain (i.e., the entropy-based measure) as your impurity measure for greedily selecting the splitting rule at each node of each decision tree. In your greedy decision tree inference algorithm, use any stopping rule of your choice to choose when to stop splitting (you might want to impose a limit on the depth of the tree, or stop splitting once the impurity is below some fixed value, or both). Your classifier will classify a new email by taking a majority vote of the T classes output by each of the T trees. When T > 1, you ll need to choose a rule for resolving ties. (ii) For each of the following candidate values of T, T = 1,2,5,10,25: run your code on the training set provided to learn a random forest of T trees, compute the class of each digit in the validation set using the resulting classifier, and compute the error rate on the validation set for this value of T. (T is the number of decision trees. Each value of T will give you a different classifier and thus potentially a different error rate on the validation set.) (iii) Using the best T value you achieved on your validation set, compute the class of each email in the data set (0 = non-spam, 1 = spam) and submit your answer. Your write-up should have the following parts: (a) The error rate on the validation set for each of T = 1,2,5,10,25. State which value of T is best. (b) Describe what rule you used to resolve ties. (c) Describe what you used as the stopping rule. (d) If you deviated from the suggested approach above in some detail, describe how your implementation differs. Data sets. The data sets are available for download from the course web page. Detailed information about each data set is present in the README file for each data set. Programming tips. You may use any programming language you like, and any standard libraries of your choice. Python or Matlab might be convenient, but you can use any language you want. One restriction on this homework: you cannot use any existing machine learning library. For instance, it is easy to find implementations of k-nearest neighbors and random forests for download or in existing libraries. You can t just download or invoke an existing implementation of these algorithms. We have some advice for you to make implementation easier: If you use Python, some of our TAs suggest that using numpy or scipy to make your implementation faster. This is entirely optional. We won t evaluate or grade you on the speed or asymptotic efficiency of your code. The only benefit is that it might make your code run faster, making it easier to experiment with your algorithm. To read in the input files, the easiest approach is probably to split each line on, (comma). Alternatively, if you prefer, on Python you can use csv.reader (from Python s csv module, which is designed for reading CSV files). As a comparison point, our implementations achieve approximately 10% error rates or less on each problem. Our implementations take under an hour to train and evaluate all of the validation instances. CS 170, Fall 2014, HW12 3

Submission. You must turn in the following files through glookup, by putting all of the files in a directory on your instructional class account, cding into that directory, and running the submit hw12 command from within that directory. 1. writeup1.txt a text file (ASCII file) containing your write-up for Problem 1. 2. writeup2.txt a text file (ASCII file) containing your write-up for Problem 2. 3. code/ directory containing all of your source code. 4. name.txt a text file containing the following: (a) Line 1: Your first and last name. (b) Line 2: Your login. 5. collaborator.txt a text file containing the full name (first and last name) and login of your partner, if you worked in a group of two, or None if you worked individually. (Note that if you work in a group of two, both of you must turn in all files.) 6. digitsoutputk.csv five text files containing the result of running your program on the validation set of the digits data set using the corresponding k value. You should have 5 of these files: digitsoutput1.csv, digitsoutput2.csv, digitsoutput5.csv, digitsoutput10.csv, and digitsoutput25.csv. Each file should have exactly 1000 lines, as follows: (a) Line 1: Predicted class for the first digit image in the validation set (0-9). (b) Line 2: Predicted class for the second digit image in the validation set (0-9). (d) Line 1000: Predicted class for the 1000th digit image in the validation set (0-9). 7. digitsoutput.csv a text file containing exactly 1000 lines, as follows: (a) Line 1: Predicted class for the first digit image in the test set (0-9). (b) Line 2: Predicted class for the second digit image in the test set (0-9). (d) Line 1000: Predicted class for the 1000th digit image in the test set (0-9). 8. emailoutputt.csv five text files containing the result of running your program on the validation set of the email data set using the corresponding T value. You should have 5 of these files: emailoutput1.csv, emailoutput2.csv, emailoutput5.csv, emailoutput10.csv, and emailoutput25.csv. Each file should contain exactly 500 lines, as follows: (a) Line 1: Predicted class for the first email in the validation set (0 or 1). (b) Line 2: Predicted class for the second email in the validation set (0 or 1). (d) Line 500: Predicted class for the 500th email in the validation set (0 or 1). 9. emailoutput.csv a text file containing exactly 500 lines, as follows: (a) Line 1: Predicted class for the first email the test set (0 or 1). CS 170, Fall 2014, HW12 4

(b) Line 2: Predicted class for the second email the test set (0 or 1). (d) Line 500: Predicted class for the 500th email the test set (0 or 1). 10. bonusdigitsoutput.csv Optional. This is for the optional bonus challenge problem. You don t need to include this file, unless you do the optional bonus problem. If you do the optional bonus problem, include this file. This file is the same as digitsoutput.csv, except now using your new set of features. Thus, it should be a text file containing 1000 lines, where the ith line contains the predicted class of the ith digit in the test set, where the classes were computed using your proposed set of features. Fun fact. Did you know that finding the optimal decision tree is NP-hard? There s a reduction from 3- dimensional matching. The proof was published in a 3-page paper: http://people.csail.mit. edu/rivest/pubs/hr76.pdf CS 170, Fall 2014, HW12 5