Discriminant analysis in R QMMA

Similar documents
Stat 4510/7510 Homework 4

Chapitre 2 : modèle linéaire généralisé

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Binary Regression in S-Plus

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Unit 5 Logistic Regression Practice Problems

k-nn classification with R QMMA

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Statistical Consulting Topics Using cross-validation for model selection. Cross-validation is a technique that can be used for model evaluation.

URLs identification task: Istat current status. Istat developed and applied a procedure consisting of the following steps:

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Cross-Validation Alan Arnholt 3/22/2016

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Lecture 25: Review I

Evaluating Classifiers

Generalized Additive Models

Tutorials Case studies

Multinomial Logit Models with R

List of Exercises: Data Mining 1 December 12th, 2015

Nina Zumel and John Mount Win-Vector LLC

Poisson Regression and Model Checking

Discussion Notes 3 Stepwise Regression and Model Selection

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Dynamic Network Regression Using R Package dnr

Evaluating Classifiers

Machine Learning: Practice Midterm, Spring 2018

Package gbts. February 27, 2017

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Classification. Data Set Iris. Logistic Regression. Species. Petal.Width

run ld50 /* Plot the onserved proportions and the fitted curve */ DATA SETR1 SET SETR1 PROB=X1/(X1+X2) /* Use this to create graphs in Windows */ gopt

CS145: INTRODUCTION TO DATA MINING

1 The SAS System 23:01 Friday, November 9, 2012

SYS 6021 Linear Statistical Models

Package MMIX. R topics documented: February 15, Type Package. Title Model selection uncertainty and model mixing. Version 1.2.

Chapter 10: Extensions to the GLM

Classification Algorithms in Data Mining

22s:152 Applied Linear Regression

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Data Mining and Knowledge Discovery: Practice Notes

Weighted Sample. Weighted Sample. Weighted Sample. Training Sample

CH9.Generalized Additive Model

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Section 2.1: Intro to Simple Linear Regression & Least Squares

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Stat 342 Exam 3 Fall 2014

Data Mining and Knowledge Discovery Practice notes 2

CS249: ADVANCED DATA MINING

Using R for Analyzing Delay Discounting Choice Data. analysis of discounting choice data requires the use of tools that allow for repeated measures

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

RENR690-SDM. Zihaohan Sang. March 26, 2018

Classification Part 4

Data Mining and Knowledge Discovery: Practice Notes

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Package rocc. February 20, 2015

Gelman-Hill Chapter 3

Package parcor. February 20, 2015

Generalized Additive Model

22s:152 Applied Linear Regression

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Manuel Oviedo de la Fuente and Manuel Febrero Bande

Product Catalog. AcaStat. Software

CS4491/CS 7265 BIG DATA ANALYTICS

Linear Methods for Regression and Shrinkage Methods

2. On classification and related tasks

Package scorecard. December 17, 2017

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Using the SemiPar Package

Package EBglmnet. January 30, 2016

3 Feature Selection & Feature Extraction

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Package optimus. March 24, 2017

Package hmeasure. February 20, 2015

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Data Mining: Classifier Evaluation. CSCI-B490 Seminar in Computer Science (Data Mining)

Regression trees with R

Package oasis. February 21, 2018

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Logistic Regression

The brlr Package. March 22, brlr... 1 lizards Index 5. Bias-reduced Logistic Regression

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

Stat 5303 (Oehlert): Response Surfaces 1

Package FWDselect. December 19, 2015

Regression on the trees data with R

Package citools. October 20, 2018

As a reference, please find a version of the Machine Learning Process described in the diagram below.

The ROCR Package. January 30, 2007

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

CREDIT RISK MODELING IN R. Finding the right cut-off: the strategy curve

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Package EFS. R topics documented:

Package smbinning. December 1, 2017

MIPE: Model Informing Probability of Eradication of non-indigenous aquatic species. User Manual. Version 2.4

Package fbroc. August 29, 2016

Transcription:

Discriminant analysis in R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 1/26

Default data Get the data set Default library(islr) data(default) head(default) default student balance income 1 No No 729.5265 44361.625 2 No Yes 817.1804 12106.135 3 No No 1073.5492 31767.139 4 No No 529.2506 35704.494 5 No No 785.6559 38463.496 6 No Yes 919.5885 7491.559 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 2/26

LDA with R The lda() function, present in the MASS library, allows to face classification problems with LDA. The syntax is similar to the one used in the lm and glm functions library(mass) lda.fit=lda(default~balance+income+student,data=default) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 3/26

Output lda.fit Call: lda(default ~ balance + income + student, data = Default) Prior probabilities of groups: No Yes 0.9667 0.0333 Group means: balance income studentyes No 803.9438 33566.17 0.2914037 Yes 1747.8217 32089.15 0.3813814 Coefficients of linear discriminants: LD1 balance 2.243541e-03 income 3.367310e-06 studentyes -1.746631e-01 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 4/26

Output analysis Prior probabilities of groups: estimates of probabilities π k, k = 1, 2 (the relative frequencies of the two groups in the data) Group means: estimates of the conditional averages x y: For example, those who did not go to default have an average balance of 803.94 (dollars), while those who defaulted have balance average of 1747.82. Note also that because student is a qualitative variable, among those who went in default, the percentage of students is approximately 38%. Coefficients of linear discriminant: the linear combination (LD1) used as discriminant. The coefficients give us the importance of the predictors in the classification. balance has a positive effect on LD1 and higher than income. Being a student decreases the value of LD1. Compare the coefficients of LD1 with those obtained in the logistic regression. file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 5/26

file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 6/26

Plot The distribution of LD1 for the two classification groups: high values of LD1 give greater probability of default. plot(lda.fit) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 7/26

Prediction The classification of training (or test) units can be done with the predict() function The output of predict() contains a series of objects: we use the names() function to see what they are and, in order to analyze and use them, we put everything in a data.frame. lda.pred=predict(lda.fit, Default) names(lda.pred) [1] "class" "posterior" "x" previsione<-as.data.frame(lda.pred) head(previsione) class posterior.no posterior.yes LD1 1 No 0.9967765 0.003223517-0.14953711 2 No 0.9973105 0.002689531-0.23615933 3 No 0.9852914 0.014708600 0.57988235 4 No 0.9988157 0.001184329-0.62801554 5 No 0.9959768 0.004023242-0.04346935 6 No 0.9957918 0.004208244-0.02194120 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 8/26

Classification table Let s build a data.frame, Default.LDA, which contains training data and LDA prediction. We also build a classification table Default.LDA<-cbind(Default, "pred.def"=lda.pred$class, "pr.def"=round(lda.pred$pos head(default.lda) default student balance income pred.def pr.def.no pr.def.yes 1 No No 729.5265 44361.625 No 0.9968 0.0032 2 No Yes 817.1804 12106.135 No 0.9973 0.0027 3 No No 1073.5492 31767.139 No 0.9853 0.0147 4 No No 529.2506 35704.494 No 0.9988 0.0012 5 No No 785.6559 38463.496 No 0.9960 0.0040 6 No Yes 919.5885 7491.559 No 0.9958 0.0042 attach(default.lda) addmargins(table(default,pred.def)) pred.def default No Yes Sum No 9645 22 9667 Yes 254 79 333 Sum 9899 101 10000 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 9/26

Change the classification criterion Let s build a new variable, lda.pred.2, that classifies units as default = Yes if the default probability estimated by the LDA model is greater than 0.2 # Build a vector of "No" lda.pred.2=rep("no",nrow(default)) # Change to "Yes" the elements of that vector for which P(Yes)>0.2 lda.pred.2[default.lda$pr.def.yes>0.2]="yes" # Build the classification table addmargins(table(default,lda.pred.2)) lda.pred.2 default No Yes Sum No 9435 232 9667 Yes 140 193 333 Sum 9575 425 10000 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 10/26

Build the ROC curve To create the ROC curve it is necessary to install the ROCR library. Steps: 1. Given a classification rule, start by creating a prediction object. This function is used to transform input data. prediction.obj = prediction (predictions, labels) predictions: the expected probability of a unit being positive labels: the classification observed on the data (test or training). It should be provided as factor, the lower level corresponding to the negative class, the upper level the positive class. file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 11/26

2. calculate a performance indicator perf.obj=performance(prediction.obj, measure, measure.x) measure: performance measure to use: `fpr` - False positive rate `tpr` - True positive rate `auc` - area under the ROC curve A complete list of performance measures can be found in the R help of the ROCR package measure.x: a second measure of performance. 3. To get the ROC curve plot a perf.obj with measure ="tpr" and measure.x ="fpr" file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 12/26

Example with Default data library(rocr) pred <- prediction(default.lda$pr.def.yes,default$default) perf <- performance(pred, measure = "tpr", x.measure = "fpr") plot(perf, colorize=t,lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 13/26

AUC In performance() use measure="auc" with the option fpr.stop=0.5 to get the area above the diagonal. AUC is 0.4499787 perf.2<-performance(pred,measure="auc",fpr.stop=0.5) perf.2@y.values[[1]][1] ## [1] 0.4499787 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 14/26

LOOCV Cross-validation The CV = TRUE option in the lda() function produces a LOOCV on the data. In this case the output contains the classification produced and the a posteriori probabilities (for each unit excluded from the data set). To see only a part of the output, it is appropriate to put the results in a data.frame lda.fit=lda(default~balance+income+student,data=default,cv=t) names(lda.fit) [1] "class" "posterior" "terms" "call" "xlevels" lda.loocv<-data.frame(lda.fit$class,lda.fit$posterior) head(lda.loocv) lda.fit.class No Yes 1 No 0.9967755 0.003224456 2 No 0.9973093 0.002690741 3 No 0.9852851 0.014714856 4 No 0.9988154 0.001184632 5 No 0.9959757 0.004024296 6 No 0.9957890 0.004210951 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 15/26

Estimate of the test error rate with LOOCV We calculate a classification table (default = Yes if the probability of default estimated by the LDA model is greater than 0.2) based on the results of the LOOCV lda.cv.pred.2=rep("no",nrow(default)) lda.cv.pred.2[lda.loocv$yes>0.2]="yes" addmargins(table(default$default,lda.cv.pred.2)) lda.cv.pred.2 No Yes Sum No 9435 232 9667 Yes 140 193 333 Sum 9575 425 10000 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 16/26

QDA Quadratic discriminant analysis can be performed using the function qda() qda.fit<-qda(default~balance+income+student,data=default) qda.fit Call: qda(default ~ balance + income + student, data = Default) Prior probabilities of groups: No Yes 0.9667 0.0333 Group means: balance income studentyes No 803.9438 33566.17 0.2914037 Yes 1747.8217 32089.15 0.3813814 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 17/26

LOOCV for QDA Similarly to what we have seen for the LDA, a LOOCV can be performed simply by inserting the CV = TRUE option qda.fit<-qda(default~balance+income+student,data=default,cv=true) names(qda.fit) [1] "class" "posterior" "terms" "call" "xlevels" qda.loocv<-data.frame(qda.fit$class,qda.fit$posterior) head(qda.loocv) qda.fit.class No Yes 1 No 0.9993994 0.0006005886 2 No 0.9994829 0.0005170994 3 No 0.9900671 0.0099329377 4 No 0.9998886 0.0001113890 5 No 0.9989423 0.0010577454 6 No 0.9986900 0.0013100168 qda.cv.pred.2=rep("no",nrow(default)) qda.cv.pred.2[qda.loocv$yes>0.2]="yes" addmargins(table(default$default,qda.cv.pred.2)) qda.cv.pred.2 No Yes Sum No 9336 331 9667 Yes 121 212 333 Sum 9457 543 10000 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 18/26

LDA - QDA comparison Using the results obtained, we compare LDA and QDA by comparing the estimates of the error rate test obtained by LOOCV with classification rule: default = Yes if the probability of default estimated by the model of LDA (or QDA) is greater than 0.2 LDA QDA Sensitivity is 57.96% - ( 193/333) Specificity is 97.6% - ( 9435/9667) Sensitivity is 63.66% - ( 212/333) Specificity is 96.58% - ( 9336/9667) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 19/26

Compare LDA - QDA with the ROC curve library(rocr) lda.pred <- prediction(lda.loocv$yes, Default$default) lda.perf <- performance(lda.pred, measure = "tpr", x.measure = "fpr") qda.pred <- prediction(qda.loocv$yes, Default$default) qda.perf <- performance(qda.pred, measure = "tpr", x.measure = "fpr") plot(lda.perf, col="blue",lwd=4) plot(qda.perf, col="red",lwd=4,add=true) legend(0.6,0.6,c('lda','qda'),col=c('blue','red'),lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 20/26

Comparison with logistic regression The logistic regression model: glm.fit<-glm(default~., data=default, family=binomial) summary(glm.fit) ## ## Call: ## glm(formula = default ~., family = binomial, data = Default) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -2.4691-0.1418-0.0557-0.0203 3.7383 ## ## Coefficients: ## Estimate Std. Error z value Pr(> z ) ## (Intercept) -1.087e+01 4.923e-01-22.080 < 2e-16 *** ## studentyes -6.468e-01 2.363e-01-2.738 0.00619 ** ## balance 5.737e-03 2.319e-04 24.738 < 2e-16 *** ## income 3.033e-06 8.203e-06 0.370 0.71152 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 2920.6 on 9999 degrees of freedom ## Residual deviance: 1571.5 on 9996 degrees of freedom ## AIC: 1579.5 ## ## Number of Fisher Scoring iterations: 8 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 21/26

Cross-validation for logistic regression We cross-validate the LGR model and compare it to LDA and QDA using the ROC curves The cv.glm() function of the boot library (seen in previous exercises), directly provides an estimate of the test error using LOOCV or k-fold CV. To calculate the ROC curve we need the probabilities of prediction. The code below calculates the probabilities of default using a LOOCV. The probabilities are inserted in the glm.loocv vector. glm.loocv=vector() #for(i in 1:nrow(Default)){ for(i in 1:10000){ T.Def=Default[-i,] T.glm.fit=glm(default~.,data=T.Def,family=binomial) glm.loocv[i]=predict(t.glm.fit,default[i,],type="response") } To build the ROC curve use library(rocr) glm.pred <- prediction(glm.loocv, Default$default) glm.perf <- performance(glm.pred, measure = "tpr", x.measure = "fpr") file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 22/26

file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 23/26

Compare LDA - QDA - LGR with ROCs lda.pred <- prediction(lda.loocv$yes, Default$default) lda.perf <- performance(lda.pred, measure = "tpr", x.measure = "fpr") qda.pred <- prediction(qda.loocv$yes, Default$default) qda.perf <- performance(qda.pred, measure = "tpr", x.measure = "fpr") plot(lda.perf, col="blue",lwd=4) plot(qda.perf, col="red",lwd=4,add=true) plot(glm.perf, col="green",lwd=4,add=true) legend(0.6,0.6,c('lda','qda','lgr'),col=c('blue','red','green'),lwd=3) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 24/26

AUC The AUC for the three models, calculated on the cross-validated data with LOOCV, are: LDA: 0.4495681 QDA: 0.4487009 RLG: 0.4493764 In this case, three models are equivalent. The choice of which model to use can be based on purely personal considerations file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 25/26

References An Introduction to Statistical Learning, with applications in R. (Springer, 2013) Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer. ROCR: visualizing classifier performance in R. Bioinformatics 21(20):3940-3941 (2005). file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 26/26