Discriminant analysis in R QMMA

Size: px

Start display at page:

Download "Discriminant analysis in R QMMA"

Brittany Singleton
5 years ago
Views:

1 Discriminant analysis in R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 1/26

2 Default data Get the data set Default library(islr) data(default) head(default) default student balance income 1 No No No Yes No No No No No No No Yes file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 2/26

3 LDA with R The lda() function, present in the MASS library, allows to face classification problems with LDA. The syntax is similar to the one used in the lm and glm functions library(mass) lda.fit=lda(default~balance+income+student,data=default) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 3/26

4 Output lda.fit Call: lda(default ~ balance + income + student, data = Default) Prior probabilities of groups: No Yes Group means: balance income studentyes No Yes Coefficients of linear discriminants: LD1 balance e-03 income e-06 studentyes e-01 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 4/26

5 Output analysis Prior probabilities of groups: estimates of probabilities π k, k = 1, 2 (the relative frequencies of the two groups in the data) Group means: estimates of the conditional averages x y: For example, those who did not go to default have an average balance of (dollars), while those who defaulted have balance average of Note also that because student is a qualitative variable, among those who went in default, the percentage of students is approximately 38%. Coefficients of linear discriminant: the linear combination (LD1) used as discriminant. The coefficients give us the importance of the predictors in the classification. balance has a positive effect on LD1 and higher than income. Being a student decreases the value of LD1. Compare the coefficients of LD1 with those obtained in the logistic regression. file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 5/26

6 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 6/26

7 Plot The distribution of LD1 for the two classification groups: high values of LD1 give greater probability of default. plot(lda.fit) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 7/26

8 Prediction The classification of training (or test) units can be done with the predict() function The output of predict() contains a series of objects: we use the names() function to see what they are and, in order to analyze and use them, we put everything in a data.frame. lda.pred=predict(lda.fit, Default) names(lda.pred) [1] "class" "posterior" "x" previsione<-as.data.frame(lda.pred) head(previsione) class posterior.no posterior.yes LD1 1 No No No No No No file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 8/26

9 Classification table Let s build a data.frame, Default.LDA, which contains training data and LDA prediction. We also build a classification table Default.LDA<-cbind(Default, "pred.def"=lda.pred$class, "pr.def"=round(lda.pred$pos head(default.lda) default student balance income pred.def pr.def.no pr.def.yes 1 No No No No Yes No No No No No No No No No No No Yes No attach(default.lda) addmargins(table(default,pred.def)) pred.def default No Yes Sum No Yes Sum file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 9/26

10 Change the classification criterion Let s build a new variable, lda.pred.2, that classifies units as default = Yes if the default probability estimated by the LDA model is greater than 0.2 # Build a vector of "No" lda.pred.2=rep("no",nrow(default)) # Change to "Yes" the elements of that vector for which P(Yes)>0.2 lda.pred.2[default.lda$pr.def.yes>0.2]="yes" # Build the classification table addmargins(table(default,lda.pred.2)) lda.pred.2 default No Yes Sum No Yes Sum file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 10/26

11 Build the ROC curve To create the ROC curve it is necessary to install the ROCR library. Steps: 1. Given a classification rule, start by creating a prediction object. This function is used to transform input data. prediction.obj = prediction (predictions, labels) predictions: the expected probability of a unit being positive labels: the classification observed on the data (test or training). It should be provided as factor, the lower level corresponding to the negative class, the upper level the positive class. file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 11/26

12 2. calculate a performance indicator perf.obj=performance(prediction.obj, measure, measure.x) measure: performance measure to use: `fpr` - False positive rate `tpr` - True positive rate `auc` - area under the ROC curve A complete list of performance measures can be found in the R help of the ROCR package measure.x: a second measure of performance. 3. To get the ROC curve plot a perf.obj with measure ="tpr" and measure.x ="fpr" file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 12/26

13 Example with Default data library(rocr) pred <- prediction(default.lda$pr.def.yes,default$default) perf <- performance(pred, measure = "tpr", x.measure = "fpr") plot(perf, colorize=t,lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 13/26

14 AUC In performance() use measure="auc" with the option fpr.stop=0.5 to get the area above the diagonal. AUC is perf.2<-performance(pred,measure="auc",fpr.stop=0.5) ## [1] file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 14/26

15 LOOCV Cross-validation The CV = TRUE option in the lda() function produces a LOOCV on the data. In this case the output contains the classification produced and the a posteriori probabilities (for each unit excluded from the data set). To see only a part of the output, it is appropriate to put the results in a data.frame lda.fit=lda(default~balance+income+student,data=default,cv=t) names(lda.fit) [1] "class" "posterior" "terms" "call" "xlevels" lda.loocv<-data.frame(lda.fit$class,lda.fit$posterior) head(lda.loocv) lda.fit.class No Yes 1 No No No No No No file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 15/26

16 Estimate of the test error rate with LOOCV We calculate a classification table (default = Yes if the probability of default estimated by the LDA model is greater than 0.2) based on the results of the LOOCV lda.cv.pred.2=rep("no",nrow(default)) lda.cv.pred.2[lda.loocv$yes>0.2]="yes" addmargins(table(default$default,lda.cv.pred.2)) lda.cv.pred.2 No Yes Sum No Yes Sum file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 16/26

17 QDA Quadratic discriminant analysis can be performed using the function qda() qda.fit<-qda(default~balance+income+student,data=default) qda.fit Call: qda(default ~ balance + income + student, data = Default) Prior probabilities of groups: No Yes Group means: balance income studentyes No Yes file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 17/26

18 LOOCV for QDA Similarly to what we have seen for the LDA, a LOOCV can be performed simply by inserting the CV = TRUE option qda.fit<-qda(default~balance+income+student,data=default,cv=true) names(qda.fit) [1] "class" "posterior" "terms" "call" "xlevels" qda.loocv<-data.frame(qda.fit$class,qda.fit$posterior) head(qda.loocv) qda.fit.class No Yes 1 No No No No No No qda.cv.pred.2=rep("no",nrow(default)) qda.cv.pred.2[qda.loocv$yes>0.2]="yes" addmargins(table(default$default,qda.cv.pred.2)) qda.cv.pred.2 No Yes Sum No Yes Sum file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 18/26

19 LDA - QDA comparison Using the results obtained, we compare LDA and QDA by comparing the estimates of the error rate test obtained by LOOCV with classification rule: default = Yes if the probability of default estimated by the model of LDA (or QDA) is greater than 0.2 LDA QDA Sensitivity is 57.96% - ( 193/333) Specificity is 97.6% - ( 9435/9667) Sensitivity is 63.66% - ( 212/333) Specificity is 96.58% - ( 9336/9667) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 19/26

20 Compare LDA - QDA with the ROC curve library(rocr) lda.pred <- prediction(lda.loocv$yes, Default$default) lda.perf <- performance(lda.pred, measure = "tpr", x.measure = "fpr") qda.pred <- prediction(qda.loocv$yes, Default$default) qda.perf <- performance(qda.pred, measure = "tpr", x.measure = "fpr") plot(lda.perf, col="blue",lwd=4) plot(qda.perf, col="red",lwd=4,add=true) legend(0.6,0.6,c('lda','qda'),col=c('blue','red'),lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 20/26

21 Comparison with logistic regression The logistic regression model: glm.fit<-glm(default~., data=default, family=binomial) summary(glm.fit) ## ## Call: ## glm(formula = default ~., family = binomial, data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## ## ## Coefficients: ## Estimate Std. Error z value Pr(> z ) ## (Intercept) e e < 2e-16 *** ## studentyes e e ** ## balance 5.737e e < 2e-16 *** ## income 3.033e e ## --- ## Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: on 9999 degrees of freedom ## Residual deviance: on 9996 degrees of freedom ## AIC: ## ## Number of Fisher Scoring iterations: 8 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 21/26

22 Cross-validation for logistic regression We cross-validate the LGR model and compare it to LDA and QDA using the ROC curves The cv.glm() function of the boot library (seen in previous exercises), directly provides an estimate of the test error using LOOCV or k-fold CV. To calculate the ROC curve we need the probabilities of prediction. The code below calculates the probabilities of default using a LOOCV. The probabilities are inserted in the glm.loocv vector. glm.loocv=vector() #for(i in 1:nrow(Default)){ for(i in 1:10000){ T.Def=Default[-i,] T.glm.fit=glm(default~.,data=T.Def,family=binomial) glm.loocv[i]=predict(t.glm.fit,default[i,],type="response") } To build the ROC curve use library(rocr) glm.pred <- prediction(glm.loocv, Default$default) glm.perf <- performance(glm.pred, measure = "tpr", x.measure = "fpr") file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 22/26

23 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 23/26

24 Compare LDA - QDA - LGR with ROCs lda.pred <- prediction(lda.loocv$yes, Default$default) lda.perf <- performance(lda.pred, measure = "tpr", x.measure = "fpr") qda.pred <- prediction(qda.loocv$yes, Default$default) qda.perf <- performance(qda.pred, measure = "tpr", x.measure = "fpr") plot(lda.perf, col="blue",lwd=4) plot(qda.perf, col="red",lwd=4,add=true) plot(glm.perf, col="green",lwd=4,add=true) legend(0.6,0.6,c('lda','qda','lgr'),col=c('blue','red','green'),lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 24/26

25 AUC The AUC for the three models, calculated on the cross-validated data with LOOCV, are: LDA: QDA: RLG: In this case, three models are equivalent. The choice of which model to use can be based on purely personal considerations file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 25/26

26 References An Introduction to Statistical Learning, with applications in R. (Springer, 2013) Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer. ROCR: visualizing classifier performance in R. Bioinformatics 21(20): (2005). file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 26/26

Stat 4510/7510 Homework 4

Stat 4510/7510 Homework 4 Stat 45/75 1/7. Stat 45/75 Homework 4 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that the grader