Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Similar documents
Programming Exercise 3: Multi-class Classification and Neural Networks

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Logistic Regression and Gradient Ascent

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

Network Traffic Measurements and Analysis

Programming Exercise 4: Neural Networks Learning

Machine Learning A W 1sst KU. b) [1 P] Give an example for a probability distributions P (A, B, C) that disproves

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Simple Model Selection Cross Validation Regularization Neural Networks

Perceptron: This is convolution!

1 Document Classification [60 points]

6.867 Machine Learning

Online Algorithm Comparison points

Programming Exercise 5: Regularized Linear Regression and Bias v.s. Variance

Evaluating Classifiers

Fall 09, Homework 5

Evaluating Classifiers

CS229 Final Project: Predicting Expected Response Times

The exam is closed book, closed notes except your one-page cheat sheet.

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

Lecture 20: Neural Networks for NLP. Zubin Pahuja

Multi-Class Logistic Regression and Perceptron

A Dendrogram. Bioinformatics (Lec 17)

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Bayes Net Learning. EECS 474 Fall 2016

Notes and Announcements

Machine Learning / Jan 27, 2010

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Neural Networks. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Weka ( )

Regularization and model selection

CSE 446 Bias-Variance & Naïve Bayes

Machine Learning Classifiers and Boosting

Machine Learning. Supervised Learning. Manfred Huber

Problem Set #6 Due: 11:30am on Wednesday, June 7th Note: We will not be accepting late submissions.

Machine Learning. Chao Lan

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CSE446: Linear Regression. Spring 2017

Computational Intelligence (CS) SS15 Homework 1 Linear Regression and Logistic Regression

CSEP 573: Artificial Intelligence

A Brief Look at Optimization

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

CS6220: DATA MINING TECHNIQUES

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Logistic Regression. Abstract

Tutorial on Machine Learning Tools

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

2. A Bernoulli distribution has the following likelihood function for a data set D: N 1 N 1 + N 0

Machine Learning using Matlab. Lecture 3 Logistic regression and regularization

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

DECISION TREES & RANDOM FORESTS X CONVOLUTIONAL NEURAL NETWORKS

Problem Set #6 Due: 2:30pm on Wednesday, June 1 st

Week 3: Perceptron and Multi-layer Perceptron

Boosting Simple Model Selection Cross Validation Regularization

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Optimization in the Big Data Regime 5: Parallelization? Sham M. Kakade

Classification. 1 o Semestre 2007/2008

Supervised Learning for Image Segmentation

CPSC Coding Project (due December 17)

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Deep Neural Networks Optimization

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

CS 559: Machine Learning Fundamentals and Applications 9 th Set of Notes

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

Learning from Data: Adaptive Basis Functions

CSE 546 Machine Learning, Autumn 2013 Homework 2

Content-based image and video analysis. Machine learning

732A54/TDDE31 Big Data Analytics

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Divide and Conquer Kernel Ridge Regression

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

ECE 5470 Classification, Machine Learning, and Neural Network Review

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Predicting Popular Xbox games based on Search Queries of Users

Vulnerability of machine learning models to adversarial examples

Classifying Depositional Environments in Satellite Images

Predict the box office of US movies

General Instructions. Questions

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CIS680: Vision & Learning Assignment 2.b: RPN, Faster R-CNN and Mask R-CNN Due: Nov. 21, 2018 at 11:59 pm

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

Support Vector Machines

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Perceptron as a graph

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

All lecture slides will be available at CSC2515_Winter15.html

10-701/15-781, Fall 2006, Final

Deep Learning and Its Applications

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Transcription:

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class Guidelines Submission. Submit a hardcopy of the report containing all the figures and printouts of code in class. For readability you may attach the code printouts at the end of the solutions. Submissions may be 48 hours late with 50% deduction. Submissions more than 48 hours after the deadline will be given zero. Late submissions should be emailed to the TA as a pdf. Plagiarism. We might reuse problem set questions from previous years, covered by papers and webpages, we expect the students not to copy, refer to, or look at the solutions in preparing their answers. Since this is a graduate class, we expect students to want to learn and not google for answers. Collaboration. The homework must be done individually, except where otherwise noted in the assignments. Individually means each student must hand in their own answers, and each student must write their own code in the programming part of the assignment. It is acceptable, however, for students to collaborate in figuring out answers and helping each other solve the problems. We will be assuming that, as participants in a graduate course, you will be taking the responsibility to make sure you personally understand the solution to any work arising from such a collaboration. Using other programming languages. All of the starter code is in Matlab which is what we expect you to use. You are free to use other languages such as Octave or Python with the caveat that we won t be able to answer or debug non Matlab questions. 1

The MNIST dataset The dataset for all the parts of this homework can be loaded into Matlab by typing load(digits.mat). This is similar to the one used in mini-project 1, but contains examples from all 10 digit classes. In addition, the data is split into train, val and test sets. For each split, the features and labels are in variables x and y respectively. For e.g., data.train.x is an array of size 784 1000 containing 1000 digits, one for each column of the matrix. There are 100 digits each from classes 0, 1,..., 9. Class labels are given by the variable data.train.y. The val set contains 50 examples from each class, and the test set contains 100 examples from each class. This is a subset of the much larger MNIST dataset 1. You can visualize the dataset using montagedigits(x) function. For example, here is the output of montagedigits(data.train.x). Tip: montagedigits(x) internally uses the montage command in Matlab which has a size option controlling the number of rows and columns of the output. Evaluation For various parts of the homework you will be asked to report the accuracy and confusion matrix. The 1 N accuracy is simply the fraction of labels you got right, i.e., N k=1 [y k = ypred k ]. The confusion matrix C is a 10 10 array where C ij = N k=1 [y k = i] [ypred k = j]. The function provided in the codebase called [acc, conf] = evaluatelabels(y, ypred, display) returns the accuracy and confusion matrix. If display=true, the code also displays the confusion matrix as seen below: 1 http://yann.lecun.com/exdb/mnist/ 2

1 Multiclass to binary reductions In the previous homework you implemented a binary classifier using averagedperceptontrain and preceptronpredict functions. In this part you will extend these to the multiclass setting using one-vsone and one-vs-all reductions. The codebase contains an implementation of the above two functions. You are welcome to use your own implementation, but in case you decide to use the provided functions here are the details. The function model = averagedperceptrontrain(x, y, param) takes as input x, y and runs the averaged preceptron training for param.maxiter iterations. The output model contains the learned weights and some other fields that you can ignore at this point. The function [y,a]=preceptronpredict(model, x) returns the predictions y and the activations a = model.w T x. The entry code for this part is in the first half of the file runmulticlassreductions.m. It loads the data and initializes various parameters. 1.a Multiclass training Implement a function model = multiclasstrain(x, y, param); that takes training data x,y and return a multi-class model. Right now you will train linear classifiers using averagedperceptrontrain, but later you will extend this to use non-linear classifiers. Your code will do this by setting the parameter param.kernel.type= linear. 1. [10 points] When param.type= onevsall the code should return 10 one-vs-all perceptron classifiers, one for each digit class. You may find cell arrays in Matlab useful for this part. 2. [10 points] When param.type= onevsone the code should return (10 choose 2) = 45 one-vs-one perceptron classifiers, one for each pair of digit classes. 1.b Multiclass prediction Implement a function ypred = multiclasspredict(model, x, param); that returns the class labels for the data x. 1. [5 points] When param.type= onevsall the code should predict the labels using the one-vs-all prediction scheme, i.e., pick the class which has the highest activation for each data point. 2. [5 points] When param.type= onevsone the code should predict the labels using the one-vs-one prediction scheme, i.e., pick the class that wins the most number of times for each data point. 1.c Experiments On the train set learn one-vs-one and one-vs-all classifiers and answer the following questions: 1. [2 points] On the val set estimate the optimal number of iterations. You may find that the accuracy saturates after a certain number of iterations. Report the optimal value you found for the onevsone/onevsall classifiers? Set param.maxiter to that value. 2. [2 points] What is the test accuracy of onevsone/onevsall classifier on the dataset? 3. [2 points] Show the confusion matrices on the test set for onevsone/onevsall classifiers. What pair of classes are the most confused for each classifier, i.e., i and j that have the max ij {C ij + C ji }? 3

2 Kernelized perceptrons In this part you will extend the perceptron training and prediction to use polynomial kernels. A polynomial kernel of degree d is defined as: K d (poly) (x, z) = (a + bxt z) d (1) Here, a 0, b > 0 and d {1, 2,..., } are hyperparamters of the kernel. 2.a Training kernelized perceptrons [10 points] In this part you will implement a kernelized version of averaged perceptron in the function model = kernelizedaveragedperceptrontrain(x, y, param). In class we extended the perceptron training to require only dot products, or kernel evaluations between data points. However, it is often more efficient to precompute the kernel values between all pairs of input data to avoid repeated kernel evaluations. Write a function K=kernelMatrix(x, z, param) that returns a matrix with K ij = kernel(x i, z j ). The kernel is specified by param.kernel.type. It is important to implement this efficiently, for example by avoiding for loops. You may find the Matlab function pdist useful for this. The output of the training is a set of weights α for each training example. However, this alone is not sufficient for prediction since it requires dot products with the training data. You may find it convenient to store the training data for each classifier in the model itself, for example in a field such as model.x, and the weights α in model.a. The entry code for this part is in the second half of the file runmulticlassreductions.m. Tip: When the kernel is linear then the classifier trained using kernelized averaged perceptron should match the results (up to numerical precision) using averaged perceptron. With polynomial kernels you can do this by setting a = 0, b = 1 and d = 1. 2.b Predictions using kernelized preceptrons [5 points] Implement [y,a] = kernelizedperceptronpredict(model, x, param), which takes the model and features as input and returns predictions and activations. Once again you may first compute the kernel matrix using K=kernelMatrix(model.x, x, param) function you implemented earlier and then multiply model.a to compute activations. 2.c Going multiclass [5 points] Integrate the above two functions with multiclasstrain and multiclasspredict functions you implemented in problem 1 to use kernelized perceptrons as the binary classifier. The code should use perceptrons when param.kernel.type= linear and kernelized perceptrons otherwise. The code passes the parameters of the polynomial kernel in the field param.kernel.poly. 2.d Experiments Train a multiclass classifier using a polynomial kernel with a = 1, b = 1 and d = 4 as the kernel parameters. This is a degree 4 polynomial kernel. Run the kernelized perceptron for 100 iterations during training and answer the following questions. 1. [2 points] What is the test accuracy of onevsone/onevsall classifier on the dataset? 2. [2 points] Show the confusion matrices on the test set for onevsone/onevsall classifiers. What classes are the most confused? 4

3 Multiclass naive Bayes classifier For this part you will train a naive Bayes classifier for multiclass prediction. Assume that for a given data x the i th feature x (i), for class j, is generated according to a Gaussian distribution with class specific means µ ij and variances σ 2 ij, i.e., ( ) P (x (i) 1 x (i) µ Y = j) = ij 2 exp. (2) 2πσij In addition each class is generated using a multinomial, i.e, 2σ 2 ij P (Y = j) = θ j. (3) 1. [4 points] Write down the log-likelihood of the training data (x 1, y 1 ), (x 2, y 2 ),... (x n, y n ) in terms of the parameters. 2. [2 points] What is the maximum likelihood estimate of the parameter θ j? 3. [2 points] What is the maximum likelihood estimate of the parameter µ ij? 4. [2 points] What is the maximum likelihood estimate of the parameter σ ij? 3.a Experiments You will implement a naive Bayes model for mulitclass digit classification. The entry code for this part is in runnaivebayes.m. Before you implement the model, there is one issue to take care of. Suppose that pixel i is always zero for a class j, then the MLE of the σ ij for that pixel and class will be zero. This leads to probabilities that are zero for any pixel value not equal to zero. Thus a little bit of noise in the pixel can make the total probability zero. One way to deal with this problem is to add a small positive constant τ to the MLE of the σ ij parameter, i.e., ˆσ 2 ij ˆσ 2 ij + τ. (4) This will assign a small non-zero probability to any noisy pixel you might see in the test data. In terms of Bayesian estimation this is a maximum a-posteriori (MAP) estimate because τ acts as a prior to the σ parameter. [10 points] Implement the function model=naivebayestrain(x, y, τ) that estimates the parameters of the naive Bayes model using the smoothing parameter τ. [5 points] Implement the function ypred=naivebayespredict(model,x) that predicts the class with the highest joint probability P (x, Y = k). Train these models on the train set using τ = 0.01 and report accuracies on the val, test set. Tip: Compute log probabilities to avoid numerical underflow. 5

4 Multiclass logistic regression We can easily extend the binary logistic regression model to handle multiclass classification. Let s assume we have K different classes, and posterior probability for class k is given by: P (Y = k X = x) = exp(wk T x) K (5) i=1 exp(wt i x). Our goal is to estimate the weights using gradient ascent. We will also define regularization on the parameters to avoid overfitting and very large weights. 1. [4 points] Write down the log conditional likelihood of the labels, L(w 1,..., w K ) with L 2 regularization on the weights. Use λ as the tradeoff parameter between the L and regularization, i.e., objective = L λ regularization. Show your steps. 2. [4 points] Note that there is not a closed form solution to maximize the log conditional likelihood, L(w 1,..., w K ) with respect to w k. However, we can still find the solution with gradient ascent by using partial derivatives. Derive the expression for the i th component in the vector gradient of L(w 1,..., w K ) with respect to w k. 3. [2 points] Beginning with the initial weights of 0, write down the update rule for w k, using η for the step size. 4. [2 points] Will the solution converge to a global maximum? 5. [20 points] Train a multiclass logistic regression using η = 0.01 and λ = 0.01 on the training set and run batch gradient accent for param.maxtier=100 iterations. Recall that in batch gradients, we sum the gradients across all training examples. In Matlab you can do all the computations efficiently using matrix multiplications. For example, on my laptop the entire training (100 iterations) takes less a second. Report accuracy on the test set. The entry code for this part is in runmulticlasslr.m. 5 Two layer neural network Consider a two layer neural network with a hidden layer of H units and K output units, one for each class. Assume that each hidden unit is connected to all the inputs and a bias feature. Use the sigmoid function as the link function. 1 σ(z) = (6) 1 + exp( z) The output units combines the hidden units using multi-class logistic regression as in the earlier problem. 1. [4 points] Write down the log-likelihood of the labels given all the weights with a L 2 regularization on the weights. Use λ as the tradeoff parameter between the L and regularization, i.e., objective = L λ regularization. Show your steps. 2. [4 points] Derive the equations for gradients for all the weights. The gradients of the second layer weights resemble the earlier problem, while that for the hidden layer can be obtained by backpropagation (or chain-rule of gradients). 3. [2 points] Suppose you run gradient ascent, will the solution converge to a global maximum? 4. [20 points] Train a two layer network with 100 hidden units using η = 0.001 and λ = 0.01 on the training set and run batch gradient ascent for param.maxtier=1000 iterations. Once again you can implement all the computations using matrix multiplications. My implementation takes about 30 seconds for the entire training. Tip: you can do entry-wise matrix multiplication using. operation. Initialize the weights randomly to small weights using the Matlab function randn(.,.)*0.01. Report accuracy on the test set. Tip: You can implement this part with small modifications to the multiclass LR code from the earlier part. The entry code for this part is in runmulticlassnn.m. Tip: For the previous problem and this one, make sure your objective increases after each iteration. 6

Checkpoints To help with debugging here are the outputs of my implementation. The neural network results might vary slightly due to randomness in initialization. You might be able to get better performance by tuning the hyperparameters. It might be tempting to tune them on the test data, but that would be cheating! Only use the validation data for this. >> runmulticlassreductions OneVsAll:: Perceptron (linear):: Validation accuracy: 75.80% OneVsAll:: Perceptron (linear):: Test accuracy: 75.70% OneVsOne:: Perceptron (linear):: Validation accuracy: 86.40% OneVsOne:: Perceptron (linear):: Test accuracy: 84.30% OneVsAll:: Perceptron (poly):: Validation accuracy: 88.60% OneVsAll:: Perceptron (poly):: Test accuracy: 87.30% OneVsOne:: Perceptron (poly):: Validation accuracy: 88.20% OneVsOne:: Perceptron (poly):: Test accuracy: 87.00% >> runnaivebayes NaiveBayes:: Validation accuracy: 75.60% NaiveBayes:: Test accuracy: 76.00% >> runmulticlasslr Multiclass LR:: Validation accuracy: 85.40% Multiclass LR:: Test accuracy: 82.70% >> runmulticlassnn Multiclass LR:: Validation accuracy: 87.00% Multiclass LR:: Test accuracy: 85.20% 7