Linear discriminant analysis and logistic

Similar documents
Practical 8: Neural networks

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Introduction to Artificial Intelligence

Data analysis case study using R for readily available data set using any one machine learning Algorithm

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Introduction to R and Statistical Data Analysis

Practical 7: Support vector machines

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

DEPARTMENT OF BIOSTATISTICS UNIVERSITY OF COPENHAGEN. Graphics. Compact R for the DANTRIP team. Klaus K. Holst

Statistical Methods in AI

Practical 7: Support vector machines

Intro to R for Epidemiologists

mmpf: Monte-Carlo Methods for Prediction Functions by Zachary M. Jones

Orange3 Educational Add-on Documentation

Machine Learning: Algorithms and Applications Mockup Examination

Manuel Oviedo de la Fuente and Manuel Febrero Bande

Creating publication-ready Word tables in R

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

The supclust Package

STAT 1291: Data Science

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Logistic Regression

A Brief Look at Optimization

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to R for Epidemiologists

Classification. Data Set Iris. Logistic Regression. Species. Petal.Width

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Logistic Regression. Abstract

k-nearest Neighbors + Model Selection

DATA VISUALIZATION WITH GGPLOT2. Coordinates

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Optimization Plugin for RapidMiner. Venkatesh Umaashankar Sangkyun Lee. Technical Report 04/2012. technische universität dortmund

Package intccr. September 12, 2017

Experimental Design + k- Nearest Neighbors

Keras: Handwritten Digit Recognition using MNIST Dataset

MULTIVARIATE ANALYSIS USING R

K-means Clustering & PCA

Package automl. September 13, 2018

Louis Fourrier Fabien Gaie Thomas Rolf

Work 2. Case-based reasoning exercise

Programming Exercise 3: Multi-class Classification and Neural Networks

Neural Networks Laboratory EE 329 A

EPL451: Data Mining on the Web Lab 5

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

The STEPDISC Procedure

Homework 3: Solutions

Input: Concepts, Instances, Attributes

Multiple imputation using chained equations: Issues and guidance for practice

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Simulation of Back Propagation Neural Network for Iris Flower Classification

Package mmtsne. July 28, 2017

Chapter 60 The STEPDISC Procedure. Chapter Table of Contents

The Curse of Dimensionality

LaTeX packages for R and Advanced knitr

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Combine the PA Algorithm with a Proximal Classifier

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke

Package ordinalnet. December 5, 2017

Class 6 Large-Scale Image Classification

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

CS4618: Artificial Intelligence I. Accuracy Estimation. Initialization

Metric Learning for Large Scale Image Classification:

Computational Machine Learning, Fall 2015 Homework 4: stochastic gradient algorithms

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Solution 1 (python) Performance: Enron Samples Rate Recall Precision Total Contribution

Hands on Datamining & Machine Learning with Weka

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

Performance Analysis of Data Mining Classification Techniques

Machine Learning. Chao Lan

Package TPD. June 14, 2018

Metric Learning for Large-Scale Image Classification:

Package nnet. R topics documented: March 20, Priority recommended. Version Date Depends R (>= 2.14.

Package reghelper. April 8, 2017

Identification Of Iris Flower Species Using Machine Learning

Exploring high-dimensional classification boundaries

Graphing Bivariate Relationships

Basic Concepts Weka Workbench and its terminology

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Package pnmtrem. February 20, Index 9

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Lecture 20: Classification and Regression Trees

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

The grplasso Package

Programming Exercise 4: Neural Networks Learning

Perceptron: This is convolution!

Classification: Linear Discriminant Functions

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

Stat 342 Exam 3 Fall 2014

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Combined Weak Classifiers

CISC 4631 Data Mining

CS 584 Data Mining. Classification 1

Package bayescl. April 14, 2017

Transcription:

Practical 6: classifiers Linear discriminant analysis and logistic This practical looks at two different methods of fitting linear classifiers. The linear discriminant analysis is implemented in the MASS package and the logistic classifier is implemented in the nnet package. Iris data The iris dataset is available in the MASS package. This is the classical dataset Fisher used in his original 1936 paper on linear discriminant analysis. The dataset contains observations of iris flowers. Each observation consists of measurements of the mean length and width of both the flower sepals and petals, followed by the species of the flower. Read in the data and perform some exploratory analysis. library(mass) data(iris) # load data head(iris) pairs(iris[,1:4]) LDA. Now, perform linear discriminant analysis using the lda function from the MASS package and interpret the output. iris.lda <- lda(species~., data=iris) iris.lda # Call: # lda(species ~., data = iris) # # Prior probabilities of groups: # setosa versicolor virginica # 0.3333333 0.3333333 0.3333333 # # Group means: # Sepal.Length Sepal.Width Petal.Length Petal.Width # setosa 5.006 3.428 1.462 0.246 # versicolor 5.936 2.770 4.260 1.326 # virginica 6.588 2.974 5.552 2.026 # # Coefficients of linear discriminants: # LD1 LD2 # Sepal.Length 0.8293776 0.02410215 # Sepal.Width 1.5344731 2.16452123 1

# Petal.Length -2.2012117-0.93192121 # Petal.Width -2.8104603 2.83918785 # # Proportion of trace: # LD1 LD2 # 0.9912 0.0088 The prior probabilities are the estimated group priors ˆπ l, l = 1,..., 3. The group means are the class centroids ˆµ l, l = 1,..., 3. As mentioned in the lecture, the LDA for L classes can be viewed as a nearest class centroid (adjusted by class priors) classification method after projecting the data onto an (at most) (L 1)-dimensional space. The matrix of coefficients of linear discriminants 0.8293776 0.02410215 1.5344731 2.16452123 A = 2.2012117 0.93192121 2.8104603 2.83918785 is precisely that projection matrix. (Though not covered in the course, the two columns LD1 and LD2 in this case corresponds to the first and second canonical direction vectors. The proportion of trace are the corresponding Rayleigh quotients, i.e. ratio of betweenclass variance and within-class variance, along the two canonical directions. We see that most of the signal is captured in a single canonical direction.) We can visualise the LDA output using the following plot command. See?plot.lda for more details. plot(iris.lda) Why is the plot two-dimensional? How are the x- and y- coordinates of the twodimensional plot computed? We can manually obtain the same plot (and colour coding the classes) as follows. A <- iris.lda$scaling X <- as.matrix(iris[,1:4]) plot(x%*%a, pch=20, col=iris$species) pred <- predict(iris.lda, newdata=iris) # another way to obtain projected points plot(pred$x, pch=20,col=iris$species) # predict also centres the data 2

What is the training misclassification error of LDA? What is the leave-one-out crossvalidated misclassification error? Logistic classifier. Now, let s apply logistic regression classification on the same dataset. This can be done either using the multinom function in the nnet package (i.e. treating logistic regression classifier as a neural network with no hidden layers!) or the mlogit function in the mlogit package. (Of course, if it is a two-class classification problem, we can also use the good old glm function.) We will use the former in this practical. library(nnet) iris.logit <- multinom(species~., data=iris, maxit=200) coef <- t(coef(iris.logit)) coef # versicolor virginica # (Intercept) 18.408209-24.230061 # Sepal.Length -6.082250-8.547304 # Sepal.Width -9.396625-16.077164 # Petal.Length 16.170374 25.599633 # Petal.Width -2.058115 16.227474 The maxit argument is set to 200 for convergence. For which covariate values will a flower be classified as virginica? The misclassification training error using logistic classifier is 1.3%. sum(predict(iris.logit, newdata=iris)!= iris$species)/150 # 0.013333333 3

The multinom function numerically solves for the coefficients using a quasi-newton method (similar to Newton Raphson method but with an approximate inverse Hessian matrix that is computationally cheaper to obtain). Neither nnet nor mlogit package implement the stochastic gradient descent. Hence we write our own code to achieve this. Do you understand what is happening in the innermost for loop? X <- model.matrix(species~., data=iris) y <- model.matrix(~species, data=iris)[,-1] # indicator vectors for labels labels <- levels(iris$species) n <- dim(x)[1]; p <- dim(x)[2] # num of obs and num of covariates expit = function(v) exp(v)/(sum(exp(v))+1) # generalised expit for multinomial logl <- function(beta, X, y){ # loglikelihood of the coefficients sum(log(exp(rowsums(y*(x%*%beta)))/(rowsums(exp(x%*%beta))+1))) beta <- matrix(0, p, 2) # initialise matrix of linear coefficients nepoch = 200 set.seed(1122); shuffle = sample(150) # randomly shuffle indices for (epoch in 1:nepoch){ # stochastic gradient descent updates for (i in shuffle){ alpha <- 1/(1+epoch/10) # step size (learning rate) xi = X[i,,drop=T]; yi = y[i,,drop=t] prob <- expit(t(beta)%*%xi) # predicted probs of i-th obs with current betas beta <- beta + alpha * xi %*% t(yi - prob) # SGD update cat('epoch = ', epoch, 'loglik = ', logl(beta, X, y), '\n') The step size for stochastic gradient update is chosen to be (1 + e/10) 1, where e is the current epoch (number of passes through the entire data). Optimisation theory in this area says that any choice of learning rates α e satisfying α e 0, i=1 α e = and i=1 α2 e < will ensure eventual convergence to a local maximum (which is a global maximum in this case). However, difference choice of learning rate can have huge influence on the speed of convergence. The above code prints out the log-likelihood of the estimated coefficients after every epoch. We see that the log-likelihood has not converged at the end of 200 epochs. On the other hand, the multinom function has reached numerical convergence. Compare the estimated coefficients beta stochastic gradient descent to the coefficients obtained by the multinom function. beta coef logl(beta) logl(coef) They are not the same (actually, they are rather different) and the log-likelihood for coefficients estimated through quasi-newton method is much higher than that from the stochastic gradient descent. 4

MNIST data The MNIST (Modified National Institute of Standards and Technology) data is a database of handwritten digits. We will be using a subset of the database. The full database can be found on http://yann.lecun.com/exdb/mnist/. Load the data into R and explore a bit. filepath <- "http://www.statslab.cam.ac.uk/~tw389/teaching/slp18/data/" filename <- "mnist.csv" mnist <- read.csv(paste0(filepath, filename), header = TRUE) mnist[1:10,1:10] mnist$digit <- as.factor(mnist$digit) visualise = function(vec,...){ # function for graphically displaying a digit image(matrix(as.numeric(vec),nrow=28)[,28:1], col=gray((255:0)/255),...) old_par <- par(mfrow=c(2,2)) for (i in 1:4) visualise(mnist[i,-1]) par(old_par) Each handwritten digit is stored in a 28 28 pixels grayscale image. The grayscale values (0 to 255) for the 784 pixels are stored as row vectors in the dataset. We define a visualise function to graphically display a digit based on its vector of grayscale values. 5

We use the first 2/3 of the data as training data and the remaining 1/3 as test data. Moreover, many margin pixels are constantly white throughout the dataset. We exclude them from our analysis. train <- mnist[1:4000,] identical <- apply(train, 2, function(v){all(v==v[1])) train <- train[,!identical] test <- mnist[4001:6000,!identical] LDA. We first fit a LDA classifier to the data. The test error is 16.9%. mnist.lda <- lda(digit~., data=train) pred <- predict(mnist.lda, test) sum(pred$class!= test$digit)/2000 We can visualise the outcome by plotting along the first two canonical directions. plot(pred$x, pch=20, col=as.numeric(test$digit)+1, xlim=c(-10,10), ylim=c(-10,10)) legend('topright', lty=1, col=1:10, legend=0:9) 6

The first two canonical directions already show good separation of the 10 classes. The dim argument in predict.lda can be used to explicitly fit a reduced rank LDA classifier. We plot how the training error and test error change as more dimensions are included. train.err <- test.err <- rep(0, 20) for (r in 1:20){ train.err[r] <- sum(predict(mnist.lda, train, dim=r)$class!= train$digit)/4000 test.err[r] <- sum(predict(mnist.lda, test, dim=r)$class!= test$digit)/2000 plot(train.err,type='l', col='orange', ylim=range(c(train.err,test.err)), ylab='error') points(test.err,type='l', col='blue') legend('topright', c('train err', 'test err'), col=c('orange', 'blue'), lty=1) 7

We see that all useful information are essentially captured in the first ten canonical directions. Logistic classifier. Next, we try to fit a logistic classifier to the MNIST data. We again start with the multinom function. The MaxNWts argument controls maximum number of coefficients in the multinomial logistic model. The test error is 22.2%. mnist.logit <- multinom(digit~., data=train, MaxNWts=100000) sum(predict(mnist.logit, newdata=test)!= test$digit)/2000 we also implement the stochastic gradient descent. X <- cbind(1,as.matrix(train[,-1])/255) X.test <- cbind(1,as.matrix(test[,-1])/255) y <- model.matrix(~digit-1, data=train)[,-1] # use digit 0 as baseline nepoch <- 20 beta <- matrix(0, 655, 9) train.err <- test.err <- rep(0,nepoch) set.seed(1122); shuffle = sample(4000) # randomly shuffle indices for (epoch in 1:nepoch){ for (i in shuffle){ alpha <- 1/(1+epoch/10) xi <- X[i,,drop=T]; yi <- y[i,,drop=t] prob <- expit(t(beta)%*%xi) beta <- beta + alpha * xi%*%t(yi - prob) train.err[epoch] <- sum((0:9)[max.col(cbind(0,x%*%beta))]!= train[,1])/4000 test.err[epoch] <- sum((0:9)[max.col(cbind(0,x.test%*%beta))]!= test[,1])/2000 test.err[nepoch] The final test error is 12.3%. We kept track of the training and test errors after each 8

epoch. Here is how they evolve. plot(train.err, type='l', col='orange', ylim=range(c(train.err,test.err)), xlab='epoch', ylab='error') points(test.err, type='l', col='blue') legend('topright', c('train err', 'test err'), col=c('orange', 'blue'), lty=1) Let s have a look some of the mis-classified images. pred <- (0:9)[max.col(cbind(0,x.test%*%beta))] err_ind <- (1:2000)[pred!= test[,1]] old_par <- par(mfrow = c(3,3)) for (i in 1:9){ visualise(mnist[4000+err_ind[i],-1], main=paste0('true=', test[err_ind[i],1], ', pred=', pred[err_ind[i]])) par(old_par) 9

If we compare the log-likelihood of the estimators from the quasi-newton method and stochastic gradient descent, we find that the former still produces a higher log-likelihood. However, it is the latter that gives a better test error. Early stopping in stochastic gradient descent is acting as a form of regularisation to prevent over-fitting in this case. 10