Discriminant analysis in R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 1/26
Default data Get the data set Default library(islr) data(default) head(default) default student balance income 1 No No 729.5265 44361.625 2 No Yes 817.1804 12106.135 3 No No 1073.5492 31767.139 4 No No 529.2506 35704.494 5 No No 785.6559 38463.496 6 No Yes 919.5885 7491.559 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 2/26
LDA with R The lda() function, present in the MASS library, allows to face classification problems with LDA. The syntax is similar to the one used in the lm and glm functions library(mass) lda.fit=lda(default~balance+income+student,data=default) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 3/26
Output lda.fit Call: lda(default ~ balance + income + student, data = Default) Prior probabilities of groups: No Yes 0.9667 0.0333 Group means: balance income studentyes No 803.9438 33566.17 0.2914037 Yes 1747.8217 32089.15 0.3813814 Coefficients of linear discriminants: LD1 balance 2.243541e-03 income 3.367310e-06 studentyes -1.746631e-01 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 4/26
Output analysis Prior probabilities of groups: estimates of probabilities π k, k = 1, 2 (the relative frequencies of the two groups in the data) Group means: estimates of the conditional averages x y: For example, those who did not go to default have an average balance of 803.94 (dollars), while those who defaulted have balance average of 1747.82. Note also that because student is a qualitative variable, among those who went in default, the percentage of students is approximately 38%. Coefficients of linear discriminant: the linear combination (LD1) used as discriminant. The coefficients give us the importance of the predictors in the classification. balance has a positive effect on LD1 and higher than income. Being a student decreases the value of LD1. Compare the coefficients of LD1 with those obtained in the logistic regression. file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 5/26
file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 6/26
Plot The distribution of LD1 for the two classification groups: high values of LD1 give greater probability of default. plot(lda.fit) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 7/26
Prediction The classification of training (or test) units can be done with the predict() function The output of predict() contains a series of objects: we use the names() function to see what they are and, in order to analyze and use them, we put everything in a data.frame. lda.pred=predict(lda.fit, Default) names(lda.pred) [1] "class" "posterior" "x" previsione<-as.data.frame(lda.pred) head(previsione) class posterior.no posterior.yes LD1 1 No 0.9967765 0.003223517-0.14953711 2 No 0.9973105 0.002689531-0.23615933 3 No 0.9852914 0.014708600 0.57988235 4 No 0.9988157 0.001184329-0.62801554 5 No 0.9959768 0.004023242-0.04346935 6 No 0.9957918 0.004208244-0.02194120 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 8/26
Classification table Let s build a data.frame, Default.LDA, which contains training data and LDA prediction. We also build a classification table Default.LDA<-cbind(Default, "pred.def"=lda.pred$class, "pr.def"=round(lda.pred$pos head(default.lda) default student balance income pred.def pr.def.no pr.def.yes 1 No No 729.5265 44361.625 No 0.9968 0.0032 2 No Yes 817.1804 12106.135 No 0.9973 0.0027 3 No No 1073.5492 31767.139 No 0.9853 0.0147 4 No No 529.2506 35704.494 No 0.9988 0.0012 5 No No 785.6559 38463.496 No 0.9960 0.0040 6 No Yes 919.5885 7491.559 No 0.9958 0.0042 attach(default.lda) addmargins(table(default,pred.def)) pred.def default No Yes Sum No 9645 22 9667 Yes 254 79 333 Sum 9899 101 10000 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 9/26
Change the classification criterion Let s build a new variable, lda.pred.2, that classifies units as default = Yes if the default probability estimated by the LDA model is greater than 0.2 # Build a vector of "No" lda.pred.2=rep("no",nrow(default)) # Change to "Yes" the elements of that vector for which P(Yes)>0.2 lda.pred.2[default.lda$pr.def.yes>0.2]="yes" # Build the classification table addmargins(table(default,lda.pred.2)) lda.pred.2 default No Yes Sum No 9435 232 9667 Yes 140 193 333 Sum 9575 425 10000 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 10/26
Build the ROC curve To create the ROC curve it is necessary to install the ROCR library. Steps: 1. Given a classification rule, start by creating a prediction object. This function is used to transform input data. prediction.obj = prediction (predictions, labels) predictions: the expected probability of a unit being positive labels: the classification observed on the data (test or training). It should be provided as factor, the lower level corresponding to the negative class, the upper level the positive class. file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 11/26
2. calculate a performance indicator perf.obj=performance(prediction.obj, measure, measure.x) measure: performance measure to use: `fpr` - False positive rate `tpr` - True positive rate `auc` - area under the ROC curve A complete list of performance measures can be found in the R help of the ROCR package measure.x: a second measure of performance. 3. To get the ROC curve plot a perf.obj with measure ="tpr" and measure.x ="fpr" file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 12/26
Example with Default data library(rocr) pred <- prediction(default.lda$pr.def.yes,default$default) perf <- performance(pred, measure = "tpr", x.measure = "fpr") plot(perf, colorize=t,lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 13/26
AUC In performance() use measure="auc" with the option fpr.stop=0.5 to get the area above the diagonal. AUC is 0.4499787 perf.2<-performance(pred,measure="auc",fpr.stop=0.5) perf.2@y.values[[1]][1] ## [1] 0.4499787 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 14/26
LOOCV Cross-validation The CV = TRUE option in the lda() function produces a LOOCV on the data. In this case the output contains the classification produced and the a posteriori probabilities (for each unit excluded from the data set). To see only a part of the output, it is appropriate to put the results in a data.frame lda.fit=lda(default~balance+income+student,data=default,cv=t) names(lda.fit) [1] "class" "posterior" "terms" "call" "xlevels" lda.loocv<-data.frame(lda.fit$class,lda.fit$posterior) head(lda.loocv) lda.fit.class No Yes 1 No 0.9967755 0.003224456 2 No 0.9973093 0.002690741 3 No 0.9852851 0.014714856 4 No 0.9988154 0.001184632 5 No 0.9959757 0.004024296 6 No 0.9957890 0.004210951 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 15/26
Estimate of the test error rate with LOOCV We calculate a classification table (default = Yes if the probability of default estimated by the LDA model is greater than 0.2) based on the results of the LOOCV lda.cv.pred.2=rep("no",nrow(default)) lda.cv.pred.2[lda.loocv$yes>0.2]="yes" addmargins(table(default$default,lda.cv.pred.2)) lda.cv.pred.2 No Yes Sum No 9435 232 9667 Yes 140 193 333 Sum 9575 425 10000 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 16/26
QDA Quadratic discriminant analysis can be performed using the function qda() qda.fit<-qda(default~balance+income+student,data=default) qda.fit Call: qda(default ~ balance + income + student, data = Default) Prior probabilities of groups: No Yes 0.9667 0.0333 Group means: balance income studentyes No 803.9438 33566.17 0.2914037 Yes 1747.8217 32089.15 0.3813814 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 17/26
LOOCV for QDA Similarly to what we have seen for the LDA, a LOOCV can be performed simply by inserting the CV = TRUE option qda.fit<-qda(default~balance+income+student,data=default,cv=true) names(qda.fit) [1] "class" "posterior" "terms" "call" "xlevels" qda.loocv<-data.frame(qda.fit$class,qda.fit$posterior) head(qda.loocv) qda.fit.class No Yes 1 No 0.9993994 0.0006005886 2 No 0.9994829 0.0005170994 3 No 0.9900671 0.0099329377 4 No 0.9998886 0.0001113890 5 No 0.9989423 0.0010577454 6 No 0.9986900 0.0013100168 qda.cv.pred.2=rep("no",nrow(default)) qda.cv.pred.2[qda.loocv$yes>0.2]="yes" addmargins(table(default$default,qda.cv.pred.2)) qda.cv.pred.2 No Yes Sum No 9336 331 9667 Yes 121 212 333 Sum 9457 543 10000 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 18/26
LDA - QDA comparison Using the results obtained, we compare LDA and QDA by comparing the estimates of the error rate test obtained by LOOCV with classification rule: default = Yes if the probability of default estimated by the model of LDA (or QDA) is greater than 0.2 LDA QDA Sensitivity is 57.96% - ( 193/333) Specificity is 97.6% - ( 9435/9667) Sensitivity is 63.66% - ( 212/333) Specificity is 96.58% - ( 9336/9667) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 19/26
Compare LDA - QDA with the ROC curve library(rocr) lda.pred <- prediction(lda.loocv$yes, Default$default) lda.perf <- performance(lda.pred, measure = "tpr", x.measure = "fpr") qda.pred <- prediction(qda.loocv$yes, Default$default) qda.perf <- performance(qda.pred, measure = "tpr", x.measure = "fpr") plot(lda.perf, col="blue",lwd=4) plot(qda.perf, col="red",lwd=4,add=true) legend(0.6,0.6,c('lda','qda'),col=c('blue','red'),lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 20/26
Comparison with logistic regression The logistic regression model: glm.fit<-glm(default~., data=default, family=binomial) summary(glm.fit) ## ## Call: ## glm(formula = default ~., family = binomial, data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4691-0.1418-0.0557-0.0203 3.7383 ## ## Coefficients: ## Estimate Std. Error z value Pr(> z ) ## (Intercept) -1.087e+01 4.923e-01-22.080 < 2e-16 *** ## studentyes -6.468e-01 2.363e-01-2.738 0.00619 ** ## balance 5.737e-03 2.319e-04 24.738 < 2e-16 *** ## income 3.033e-06 8.203e-06 0.370 0.71152 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 1571.5 on 9996 degrees of freedom ## AIC: 1579.5 ## ## Number of Fisher Scoring iterations: 8 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 21/26
Cross-validation for logistic regression We cross-validate the LGR model and compare it to LDA and QDA using the ROC curves The cv.glm() function of the boot library (seen in previous exercises), directly provides an estimate of the test error using LOOCV or k-fold CV. To calculate the ROC curve we need the probabilities of prediction. The code below calculates the probabilities of default using a LOOCV. The probabilities are inserted in the glm.loocv vector. glm.loocv=vector() #for(i in 1:nrow(Default)){ for(i in 1:10000){ T.Def=Default[-i,] T.glm.fit=glm(default~.,data=T.Def,family=binomial) glm.loocv[i]=predict(t.glm.fit,default[i,],type="response") } To build the ROC curve use library(rocr) glm.pred <- prediction(glm.loocv, Default$default) glm.perf <- performance(glm.pred, measure = "tpr", x.measure = "fpr") file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 22/26
file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 23/26
Compare LDA - QDA - LGR with ROCs lda.pred <- prediction(lda.loocv$yes, Default$default) lda.perf <- performance(lda.pred, measure = "tpr", x.measure = "fpr") qda.pred <- prediction(qda.loocv$yes, Default$default) qda.perf <- performance(qda.pred, measure = "tpr", x.measure = "fpr") plot(lda.perf, col="blue",lwd=4) plot(qda.perf, col="red",lwd=4,add=true) plot(glm.perf, col="green",lwd=4,add=true) legend(0.6,0.6,c('lda','qda','lgr'),col=c('blue','red','green'),lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 24/26
AUC The AUC for the three models, calculated on the cross-validated data with LOOCV, are: LDA: 0.4495681 QDA: 0.4487009 RLG: 0.4493764 In this case, three models are equivalent. The choice of which model to use can be based on purely personal considerations file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 25/26
References An Introduction to Statistical Learning, with applications in R. (Springer, 2013) Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer. ROCR: visualizing classifier performance in R. Bioinformatics 21(20):3940-3941 (2005). file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 26/26