Linear discriminant analysis and logistic

Practical 6: classifiers Linear discriminant analysis and logistic This practical looks at two different methods of fitting linear classifiers. The linear discriminant analysis is implemented in the MASS package and the logistic classifier is implemented in the nnet package. Iris data The iris dataset is available in the MASS package. This is the classical dataset Fisher used in his original 1936 paper on linear discriminant analysis. The dataset contains observations of iris flowers. Each observation consists of measurements of the mean length and width of both the flower sepals and petals, followed by the species of the flower. Read in the data and perform some exploratory analysis. library(mass) data(iris) # load data head(iris) pairs(iris[,1:4]) LDA. Now, perform linear discriminant analysis using the lda function from the MASS package and interpret the output. iris.lda <- lda(species~., data=iris) iris.lda # Call: # lda(species ~., data = iris) # # Prior probabilities of groups: # setosa versicolor virginica # 0.3333333 0.3333333 0.3333333 # # Group means: # Sepal.Length Sepal.Width Petal.Length Petal.Width # setosa 5.006 3.428 1.462 0.246 # versicolor 5.936 2.770 4.260 1.326 # virginica 6.588 2.974 5.552 2.026 # # Coefficients of linear discriminants: # LD1 LD2 # Sepal.Length 0.8293776 0.02410215 # Sepal.Width 1.5344731 2.16452123 1

# Petal.Length -2.2012117-0.93192121 # Petal.Width -2.8104603 2.83918785 # # Proportion of trace: # LD1 LD2 # 0.9912 0.0088 The prior probabilities are the estimated group priors ˆπ l, l = 1,..., 3. The group means are the class centroids ˆµ l, l = 1,..., 3. As mentioned in the lecture, the LDA for L classes can be viewed as a nearest class centroid (adjusted by class priors) classification method after projecting the data onto an (at most) (L 1)-dimensional space. The matrix of coefficients of linear discriminants 0.8293776 0.02410215 1.5344731 2.16452123 A = 2.2012117 0.93192121 2.8104603 2.83918785 is precisely that projection matrix. (Though not covered in the course, the two columns LD1 and LD2 in this case corresponds to the first and second canonical direction vectors. The proportion of trace are the corresponding Rayleigh quotients, i.e. ratio of betweenclass variance and within-class variance, along the two canonical directions. We see that most of the signal is captured in a single canonical direction.) We can visualise the LDA output using the following plot command. See?plot.lda for more details. plot(iris.lda) Why is the plot two-dimensional? How are the x- and y- coordinates of the twodimensional plot computed? We can manually obtain the same plot (and colour coding the classes) as follows. A <- iris.lda$scaling X <- as.matrix(iris[,1:4]) plot(x%*%a, pch=20, col=iris$species) pred <- predict(iris.lda, newdata=iris) # another way to obtain projected points plot(pred$x, pch=20,col=iris$species) # predict also centres the data 2

What is the training misclassification error of LDA? What is the leave-one-out crossvalidated misclassification error? Logistic classifier. Now, let s apply logistic regression classification on the same dataset. This can be done either using the multinom function in the nnet package (i.e. treating logistic regression classifier as a neural network with no hidden layers!) or the mlogit function in the mlogit package. (Of course, if it is a two-class classification problem, we can also use the good old glm function.) We will use the former in this practical. library(nnet) iris.logit <- multinom(species~., data=iris, maxit=200) coef <- t(coef(iris.logit)) coef # versicolor virginica # (Intercept) 18.408209-24.230061 # Sepal.Length -6.082250-8.547304 # Sepal.Width -9.396625-16.077164 # Petal.Length 16.170374 25.599633 # Petal.Width -2.058115 16.227474 The maxit argument is set to 200 for convergence. For which covariate values will a flower be classified as virginica? The misclassification training error using logistic classifier is 1.3%. sum(predict(iris.logit, newdata=iris)!= iris$species)/150 # 0.013333333 3

The multinom function numerically solves for the coefficients using a quasi-newton method (similar to Newton Raphson method but with an approximate inverse Hessian matrix that is computationally cheaper to obtain). Neither nnet nor mlogit package implement the stochastic gradient descent. Hence we write our own code to achieve this. Do you understand what is happening in the innermost for loop? X <- model.matrix(species~., data=iris) y <- model.matrix(~species, data=iris)[,-1] # indicator vectors for labels labels <- levels(iris$species) n <- dim(x)[1]; p <- dim(x)[2] # num of obs and num of covariates expit = function(v) exp(v)/(sum(exp(v))+1) # generalised expit for multinomial logl <- function(beta, X, y){ # loglikelihood of the coefficients sum(log(exp(rowsums(y*(x%*%beta)))/(rowsums(exp(x%*%beta))+1))) beta <- matrix(0, p, 2) # initialise matrix of linear coefficients nepoch = 200 set.seed(1122); shuffle = sample(150) # randomly shuffle indices for (epoch in 1:nepoch){ # stochastic gradient descent updates for (i in shuffle){ alpha <- 1/(1+epoch/10) # step size (learning rate) xi = X[i,,drop=T]; yi = y[i,,drop=t] prob <- expit(t(beta)%*%xi) # predicted probs of i-th obs with current betas beta <- beta + alpha * xi %*% t(yi - prob) # SGD update cat('epoch = ', epoch, 'loglik = ', logl(beta, X, y), '\n') The step size for stochastic gradient update is chosen to be (1 + e/10) 1, where e is the current epoch (number of passes through the entire data). Optimisation theory in this area says that any choice of learning rates α e satisfying α e 0, i=1 α e = and i=1 α2 e < will ensure eventual convergence to a local maximum (which is a global maximum in this case). However, difference choice of learning rate can have huge influence on the speed of convergence. The above code prints out the log-likelihood of the estimated coefficients after every epoch. We see that the log-likelihood has not converged at the end of 200 epochs. On the other hand, the multinom function has reached numerical convergence. Compare the estimated coefficients beta stochastic gradient descent to the coefficients obtained by the multinom function. beta coef logl(beta) logl(coef) They are not the same (actually, they are rather different) and the log-likelihood for coefficients estimated through quasi-newton method is much higher than that from the stochastic gradient descent. 4

MNIST data The MNIST (Modified National Institute of Standards and Technology) data is a database of handwritten digits. We will be using a subset of the database. The full database can be found on http://yann.lecun.com/exdb/mnist/. Load the data into R and explore a bit. filepath <- "http://www.statslab.cam.ac.uk/~tw389/teaching/slp18/data/" filename <- "mnist.csv" mnist <- read.csv(paste0(filepath, filename), header = TRUE) mnist[1:10,1:10] mnist$digit <- as.factor(mnist$digit) visualise = function(vec,...){ # function for graphically displaying a digit image(matrix(as.numeric(vec),nrow=28)[,28:1], col=gray((255:0)/255),...) old_par <- par(mfrow=c(2,2)) for (i in 1:4) visualise(mnist[i,-1]) par(old_par) Each handwritten digit is stored in a 28 28 pixels grayscale image. The grayscale values (0 to 255) for the 784 pixels are stored as row vectors in the dataset. We define a visualise function to graphically display a digit based on its vector of grayscale values. 5

We use the first 2/3 of the data as training data and the remaining 1/3 as test data. Moreover, many margin pixels are constantly white throughout the dataset. We exclude them from our analysis. train <- mnist[1:4000,] identical <- apply(train, 2, function(v){all(v==v[1])) train <- train[,!identical] test <- mnist[4001:6000,!identical] LDA. We first fit a LDA classifier to the data. The test error is 16.9%. mnist.lda <- lda(digit~., data=train) pred <- predict(mnist.lda, test) sum(pred$class!= test$digit)/2000 We can visualise the outcome by plotting along the first two canonical directions. plot(pred$x, pch=20, col=as.numeric(test$digit)+1, xlim=c(-10,10), ylim=c(-10,10)) legend('topright', lty=1, col=1:10, legend=0:9) 6

The first two canonical directions already show good separation of the 10 classes. The dim argument in predict.lda can be used to explicitly fit a reduced rank LDA classifier. We plot how the training error and test error change as more dimensions are included. train.err <- test.err <- rep(0, 20) for (r in 1:20){ train.err[r] <- sum(predict(mnist.lda, train, dim=r)$class!= train$digit)/4000 test.err[r] <- sum(predict(mnist.lda, test, dim=r)$class!= test$digit)/2000 plot(train.err,type='l', col='orange', ylim=range(c(train.err,test.err)), ylab='error') points(test.err,type='l', col='blue') legend('topright', c('train err', 'test err'), col=c('orange', 'blue'), lty=1) 7

We see that all useful information are essentially captured in the first ten canonical directions. Logistic classifier. Next, we try to fit a logistic classifier to the MNIST data. We again start with the multinom function. The MaxNWts argument controls maximum number of coefficients in the multinomial logistic model. The test error is 22.2%. mnist.logit <- multinom(digit~., data=train, MaxNWts=100000) sum(predict(mnist.logit, newdata=test)!= test$digit)/2000 we also implement the stochastic gradient descent. X <- cbind(1,as.matrix(train[,-1])/255) X.test <- cbind(1,as.matrix(test[,-1])/255) y <- model.matrix(~digit-1, data=train)[,-1] # use digit 0 as baseline nepoch <- 20 beta <- matrix(0, 655, 9) train.err <- test.err <- rep(0,nepoch) set.seed(1122); shuffle = sample(4000) # randomly shuffle indices for (epoch in 1:nepoch){ for (i in shuffle){ alpha <- 1/(1+epoch/10) xi <- X[i,,drop=T]; yi <- y[i,,drop=t] prob <- expit(t(beta)%*%xi) beta <- beta + alpha * xi%*%t(yi - prob) train.err[epoch] <- sum((0:9)[max.col(cbind(0,x%*%beta))]!= train[,1])/4000 test.err[epoch] <- sum((0:9)[max.col(cbind(0,x.test%*%beta))]!= test[,1])/2000 test.err[nepoch] The final test error is 12.3%. We kept track of the training and test errors after each 8

epoch. Here is how they evolve. plot(train.err, type='l', col='orange', ylim=range(c(train.err,test.err)), xlab='epoch', ylab='error') points(test.err, type='l', col='blue') legend('topright', c('train err', 'test err'), col=c('orange', 'blue'), lty=1) Let s have a look some of the mis-classified images. pred <- (0:9)[max.col(cbind(0,x.test%*%beta))] err_ind <- (1:2000)[pred!= test[,1]] old_par <- par(mfrow = c(3,3)) for (i in 1:9){ visualise(mnist[4000+err_ind[i],-1], main=paste0('true=', test[err_ind[i],1], ', pred=', pred[err_ind[i]])) par(old_par) 9

If we compare the log-likelihood of the estimators from the quasi-newton method and stochastic gradient descent, we find that the former still produces a higher log-likelihood. However, it is the latter that gives a better test error. Early stopping in stochastic gradient descent is acting as a form of regularisation to prevent over-fitting in this case. 10