Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Similar documents
Homework 5. Due: April 20, 2018 at 7:00PM

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

CS446: Machine Learning Fall Problem Set 4. Handed Out: October 17, 2013 Due: October 31 th, w T x i w

CSE 546 Machine Learning, Autumn 2013 Homework 2

CS 224N: Assignment #1

Machine Learning / Jan 27, 2010

Nearest Neighbor Classification

CS 224N: Assignment #1

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Perceptron: This is convolution!

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Problem Set #6 Due: 11:30am on Wednesday, June 7th Note: We will not be accepting late submissions.

Machine Learning and Computational Statistics, Spring 2015 Homework 1: Ridge Regression and SGD

AM 221: Advanced Optimization Spring 2016

Machine Learning and Computational Statistics, Spring 2016 Homework 1: Ridge Regression and SGD

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CS 179 Lecture 16. Logistic Regression & Parallel SGD

6.867 Machine Learning

Bayesian model ensembling using meta-trained recurrent neural networks

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Regularization and model selection

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CSCI544, Fall 2016: Assignment 1

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Nearest neighbors classifiers

CSCI567 Machine Learning (Fall 2014)

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization. Wolfram Burgard

CS 224d: Assignment #1

A Brief Look at Optimization

Logistic Regression and Gradient Ascent

Homework 1 (a and b) Convex Sets and Convex Functions

Log- linear models. Natural Language Processing: Lecture Kairit Sirts

Simple Model Selection Cross Validation Regularization Neural Networks

CS264: Homework #4. Due by midnight on Wednesday, October 22, 2014

Digital Image Processing Laboratory: MAP Image Restoration

Convex Optimization / Homework 2, due Oct 3

CS 161 Problem Set 4

CS 170 Algorithms Fall 2014 David Wagner HW12. Due Dec. 5, 6:00pm

Homework #4 Programming Assignment Due: 11:59 pm, November 4, 2018

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Supervised Learning. Manfred Huber

Backpropagation + Deep Learning

Data 8 Final Review #1

Classification: Linear Discriminant Functions

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

k-nearest Neighbors + Model Selection

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Logistic Regression. May 28, Decision boundary is a property of the hypothesis and not the data set e z. g(z) = g(z) 0.

ITI Introduction to Computing II Winter 2018

Problem Set #6 Due: 2:30pm on Wednesday, June 1 st

CS 237: Probability in Computing

Markov Decision Processes (MDPs) (cont.)

CSE 446 Bias-Variance & Naïve Bayes

CS 237: Probability in Computing

Homework 5. Due Friday, March 1 at 5:00 PM

CS 229 Midterm Review

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm

Lecture 19 Subgradient Methods. November 5, 2008

Collective classification in network data

Online Algorithm Comparison points

Supervised Learning for Image Segmentation

CSCI544, Fall 2016: Assignment 2

Machine Problem 8 - Mean Field Inference on Boltzman Machine

06: Logistic Regression

6.034 Design Assignment 2

Machine Learning Classifiers and Boosting

All lecture slides will be available at CSC2515_Winter15.html

Multinomial Regression and the Softmax Activation Function. Gary Cottrell!

Programming Exercise 4: Neural Networks Learning

Boosting Simple Model Selection Cross Validation Regularization

CS229 Final Project: Predicting Expected Response Times

Deep Learning and Its Applications

1 Case study of SVM (Rob)

Predict the box office of US movies

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Practice Questions for Midterm

General Instructions. Questions

1 Training/Validation/Testing

10-701/15-781, Fall 2006, Final

Probabilistic Robotics

Combine the PA Algorithm with a Proximal Classifier

Generative and discriminative classification techniques

Week 4: Simple Linear Regression II

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Module 1 Lecture Notes 2. Optimization Problem and Model Formulation

Lecture Linear Support Vector Machines

Louis Fourrier Fabien Gaie Thomas Rolf

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Introduction to Machine Learning. Xiaojin Zhu

Case Study 1: Estimating Click Probabilities

Lecture 8: The EM algorithm

6.034 Quiz 2, Spring 2005

LOGISTIC REGRESSION FOR MULTIPLE CLASSES

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Predict the Likelihood of Responding to Direct Mail Campaign in Consumer Lending Industry

Optimization. 1. Optimization. by Prof. Seungchul Lee Industrial AI Lab POSTECH. Table of Contents

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Transcription:

Homework 2 Due: March 2, 2018 at 7:00PM Written Questions Problem 1: Estimator (5 points) Let x 1, x 2,..., x m be an i.i.d. (independent and identically distributed) sample drawn from distribution B(p) where B(p) denotes a Bernoulli distribution. Specifically, P(B = 1) = p, P(B = 0) = 1 p. Suppose p is an unknown parameter and we estimate it via: p = 1 m (x i ). m i=1 Show that p is an unbiased estimator for p. Recall that an estimator is unbiased if its expected value over all possible samples is equal to the parameter it is estimating. Note: In lecture, we showed this estimator is unbiased for a sample size of 3. For this question, we are asking you to generalize the argument to a sample size of m. Problem 2: Agnostic PAC Learning (25 points) Previously, we looked at PAC learning under the assumption that the true hypothesis was a function within our set of hypotheses H the realizable case. However, this assumption does not always hold. In some cases, the true hypothesis is a function f H the unrealizable case. Learning in the unrealizable case is also called agnostic learning. We will examine how PAC learning differs in this scenario. Hoeffding s inequality can give us a bound on sample size for the unrealizable case in which we have a finite set of hypotheses H. The true function for labeling data is f. Suppose we are labelling data x generated from distribution D. Define the expected error of a hypothesis, err D (h), as the expected proportion of data incorrectly labelled by the hypothesis h, written as: err D (h) = E x D [f(x) h(x)]. We have a sample S. Define the sampling error of a hypothesis, err S (h), as the proportion of data from the sample S incorrectly labelled by hypothesis h. It may be written as: err S (h) = 1 [y h(x)]. S (x,y) S Define ERM(H) = argmin h H err S (h) and RM(H) = argmin h H err D (h) as the empirical risk minimizing hypothesis and the risk minimizing hypothesis, respectively. In the PAC setting, we content ourselves with an algorithm that, with probability at least 1 δ, returns a hypothesis ĥ whose error over the distribution D is within ɛ of that of the risk minimizing solution, or: err D (ĥ) err D(RM(H)) ɛ. Note: We are considering PAC learning on a binary dataset. 1

a. First, consider the realizable case, f H. What is the PAC learning bound for ĥ = ERM(H)? Express the sample size bound in terms of ɛ, δ, and H. b. Suppose we re stuck with agnostic learning such that f H. Our best hypothesis is ĥ = ERM(H). Use Hoeffding s inequality to show that, given ɛ 1 and δ 1, there s an m such that sampling S = {x 1,..., x m } D gives us err S (h) err D (h) ɛ 1, with probability at least 1 δ 1, for some h. c. What happens if we use a sample of size m to evaluate all the hypotheses in H? In particular, we want it to be simultaneously true that, for all h H, err S (h) err D (h) ɛ 1 with probability 1 δ. Write an upper bound for the true failure probability in terms of δ 1 using the union bound. Then, express δ in terms of δ 1 so that with probability 1 δ it is true that err S (h) err D (h) ɛ 1 for all h H. d. Now we know that our error estimates for all h H are within ɛ 1 of their true errors (with high probability). If we pick ĥ = ERM(H), how far might ĥ be from RM(H)? Call that ɛ and write ɛ in terms of ɛ 1. That is, find ɛ in terms of ɛ 1 such that err D (ĥ) err D(RM(H)) ɛ. e. Putting it all together, define m in terms of ɛ and δ so that, with probability at least 1 δ (using the previously computed bound), err D (ERM(H)) err D (RM(H)) ɛ. Problem 3: Naive Bayes Maximum Likelihood (12 points) Consider binary dataset S i.i.d. D with observations in the form {(x 1 j,..., xn j ), y j)}. Define c(y) as a function that counts the number of observations such that the label is y. c(y) = [y j = y] (x j,y j) S Define c(i, y) as a function that counts the number of observations such that the label is y and x i = 1. c(i, y) = [y j = y, x i j = 1] (x j,y j) S Define b as P(Y = 1), and b iy as P(X i = 1 Y = y). Prove that the following estimators are MLE for these parameters: bmle = c(1) and b S iy c(i, y) MLE = c(y) Problem 4: Gradient Descent (18 points) We have a convex function f over the closed interval [ b, b] (for some positive number b). Let f be the derivative of f. Let α be some positive number, which will represent a learning rate parameter. Consider using gradient descent to find the minimum of f: We start at x 0 = 0. Then, at each step, we set x t+1 = x t αf (x t ). If x t+1 falls below b, we set it to b, and if it goes above b, we set it to b. We say that an optimization algorithm (such as gradient descent) ɛ-converges if, at some point, x t stays within ɛ of the true minimum. Formally, we have ɛ-convergence at time t if for all t t. x t x min ɛ, where x min = argmin f(x) x [ b,b] 2

a. For α = 0.1, b = 1, and ɛ = 0.001, find a convex function f so that running gradient descent does not ɛ-converge. Specifically, make it so that x 0 = 0, x 1 = b, x 2 = b, x 3 = b, x 4 = b, etc. b. For α = 0.1, b = 1, and ɛ = 0.001, find a convex function f so that gradient descent does ɛ-converge, but only after at least 10,000 steps. c. Construct a different optimization algorithm that has the property that it will always ɛ-converge (for any convex f) within log 2 (2b/ɛ) steps. d. Unfortunately, even if x t is within ɛ of x min, f(x t ) can be arbitrarily greater than f(x min ). However, consider the case where the derivative of f is always between r and r. ( x [ b, b], f (x) [ r, r].) In this case, we can make a guarantee about the difference between f(x t ) and f(x min ). Given that x t x min ɛ and that r f (x) r, find a bound on f(x t ) f(x min ) in terms of ɛ and r. Programming Assignment Introduction In this assignment, you ll implement both Naive Bayes and Logistic Regression and use these algorithms to classify handwritten digits. Stencil Code & Data You can find the stencil code for this assignment on the course website. We have provided the following two files: main.py is the entry point of your program which will read in the data, run the classifiers and print the results. models.py contains the NaiveBayes model and the LogisticRegression model which you will be implementing. You should not modify any code in the main.py. All the functions you need to fill in reside in models.py, marked by TODOs. You can see a full description of them in the section below. To feed the data into the program successfully, please do not rename the data files and also make sure all the data files are in a directory named data located in the same directory as main.py. To run the program, run python main.py in a terminal. MNIST Dataset You will be using the well-known MNIST Dataset, which includes 60,000 examples for training and 10,000 for testing. All images are size normalized to fit in a 20 20 pixel box and this box is centered in a 28 28 image based on the center of mass of its pixels. The full description of this dataset is available at http://yann.lecun.com/exdb/mnist/. There are four files: train-images-idx3-ubyte.gz: images of 60,000 training examples, train-labels-idx1-ubyte.gz: labels of 60,000 training examples, t10k-images-idx3-ubyte.gz: images of 10,000 testing examples, t10k-labels-idx1-ubyte.gz: labels of 10,000 testing examples. 3

On a department machine, you can copy the files to your working directory by running the command cp /course/cs1420/data/hw2/* <DEST DIRECTORY> where <DEST DIRECTORY> is the directory where you would like to copy the four data files. If you are working locally, you will need to use the scp command to copy the files to your computer over ssh: Data Format scp <login>@ssh.cs.brown.edu:/course/cs1420/data/hw2/* <DEST DIRECTORY> The original feature values in this dataset are integers ranging from 0 to 255 (encoding grey scale values), but we have thresholded those features to make them binary (1 for values higher than 99, 0 otherwise). The labels are the same as the original dataset, which are integers in the range [0, 9]. As always, we have written all the preprocessing code for you. The final dataset is represented by a namedtuple with two fields: data.inputs is a m 784 NumPy array that contains the binary features of the m examples. data.labels is a m dimensional NumPy array that contains the labels of the m examples. You can find more infomation on namedtuple here The Assignment In models.py, there are six functions you will implement, three for each model. They are: NaiveBayes: train() uses maximum likelihood estimation to learn the parameters. Because all the features are binary values, you should use the Bernoulli distribution (as described in lecture) for the features. predict() predicts the labels using the inputs of test data. accuracy() computes the percentage of the correctly predicted labels over a dataset. LogisticRegression: train() uses stochastic gradient descent to optimize the parameters of the logistic regression model. predict() predicts the labels using the inputs of test data. accuracy() computes the percentage of the correctly predicted labels over a dataset. Note: A softmax function has already been implemented. What softmax does is normalize input values from the range [, ] into the range [0, 1] so that they add up to 1. You can think of the output of the softmax function as a probability distribution over n classes. The shape of l, p and L(w; l) are all [1, n] where n is the number of classes and l denotes the logits, which are used as inputs of softmax function. Note: You are not allowed to use any off-the-shelf packages that have already implemented these models, such as scikit-learn. We re asking you to implement them yourself. Gradient Descent You will be using Stochastic Gradient Descent (SGD) to train your LogisticRegression model. Below, we have provided pseudocode for SGD on a sample S: 4

initialize parameters w and learning rate α repeat until w converges: shuffle training examples for each training example (x, y) S: l = x, w p = softmax(l) for i the set of classes if i = y: L(w; l i ) = p i 1 else: L(w; l i ) = p i L(w; x) = x, L(w; l) w = w α L(w; x) Project Report As introduced in HW1, each programming assignment in this course will be accompanied by a short report in which you will answer a few guiding questions about the results of your algorithm. To answer these questions, you may need to write new code that generates some output (potentially a value or graph) that you will want to include in your writeup. For example, you may be asked to contrast the results of running the same algorithm on different datasets or to explore the effect of changing a certain hyperparameter in your algorithm. By the end of this course you will not only be able to implement common machine-learning algorithms but also develop intuition as to how the results of a given algorithm should be interpreted. The next section outlines some guiding questions that you should answer in your report. Please leave any code that you use in your final handin but make sure that it is not run by default when your program is run. We ask that your final report be a PDF file named report.pdf, which must be handed in along with your code. You may use any program to create the PDF file, but we highly recommend using LaTeX. We have provided an example report available on our course website to get you started. Guiding Questions Report the training and testing error of both classifiers. Discuss the results in a short paragraph. When training the LogisticRegression model with SGD, graph the loss at each iteration (weight update) of the algorithm. To save time, you can generate a data point after every 1000 iterations (instead of one for each iteration). Produce 2 graphs, each with a different learning rate. In a sentence or two discuss the graphs. Grading Breakdown We expect the accuracies that NaiveBayes and LogisticRegression reach should be above 84% and 86%, respectively. For logistic regression, we will evaluate your implementation on the highest accuracy among five trials since we are using stochastic gradient descent. As always, you will primarily be graded on the correctness of your code and not based on whether it does or does not achieve the accuracy targets. The grading breakdown for the assignment is as follows: Written Questions 60% Naive Bayes 16% Logistic Regression 16% Report 8% Total 100% 5

Handing in Written Questions Answers to the written questions should be printed, stapled and labeled with the date, homework number and your unique course ID. You should place your handin in the CS1420 handin box located on the second floor of the CIT. Programming Assignment To hand in the programming component of this assignment, first ensure that your code runs on Python 3 using our course virtualenv. You can activate the virtualenv on a department machine by running the following command in a Terminal: source /course/cs1420/cs142 env/bin/activate Once the virtualenv is activated, run your program and ensure that there are no errors. We will be using this virtualenv to grade all programming assignments in this course so we recommend testing your code on a department machine each time before you hand in. Note that handing in code that does not run may result in a significant loss of credit. To hand in the coding portion of the assignment, run cs142 handin hw2 from the directory containing all of your source code and your report in a file named report.pdf. Anonymous Grading You need to be graded anonymously, so do not write your name anywhere on your handin. Instead, you should use the course ID that you generated when filling out the collaboration policy form. If you do not have a course ID, you should email the HTAs as soon as possible. Obligatory Note on Academic Integrity Plagiarism don t do it. As outlined in the Brown Academic Code, attempting to pass off another s work as your own can result in failing the assignment, failing this course, or even dismissal or expulsion from Brown. More than that, you will be missing out on the goal of your education, which is the cultivation of your own mind, thoughts, and abilities. Please review this course s collaboration policy and, if you have any questions, please contact a member of the course staff. Feedback Many aspects of this course have been radically redesigned compared to previous iterations of the course. To continually improve this class and future iterations of the class, we are offering an online anonymous feedback form. If you have comments, we would love to hear your constructive feedback on this assignment and on any other aspects of the course. 6