Unit 5 Logistic Regression Practice Problems

Size: px

Start display at page:

Download "Unit 5 Logistic Regression Practice Problems"

Poppy Henry
5 years ago
Views:

1 Unit 5 Logistic Regression Practice Problems SOLUTIONS R Users Source: Afifi A., Clark VA and May S. Computer Aided Multivariate Analysis, Fourth Edition. Boca Raton: Chapman and Hall, Exercises #1-#3 utilize a data set provided by Afifi, Clark and May (2004). The data are a study of depression and was a longitudinal study. The purpose of the study was to obtain estimates of the prevalence and incidence of depression and to explore its risk factors. The study variables were of several types demographics, life events, stressors, physical health, health services utilization, medication use, lifestyle, and social support. Before You Begin Download R data sets depress.rdata and illeetvilaine.rdata from the course website page. R Preliminary (ONE TIME) Install Packages as Needed from the console ONLY # install.packages( gmodels ) # install.packages( DescTools ) # install.packages( lmtest ) # install.packages( ResourceSelection ) # install.packages( generalhoslem ) # install.packages( epir ) # install.packages( proc ) Exercises #1-3 utilize the data set depress.rdata Consider the following three variables. Variable Codings Label drink 1 = yes 2 = no Regular Drinker sex 1 = male 2 = female cases 0 = Normal 1 = Case of Depression Depressed is cesd > 16.sol_logistic R Users.docx Page 1 of 15

2 Preliminaries - depress.rdata setwd("/users/cbigelow/desktop") load(file="depress.rdata") options(scipen=1000) options(signif.stars=false) depress$sex <- factor(depress$sex, levels=c(1,2), labels=c("1=male", "2=Female")) depress$drink <- factor(depress$drink, levels=c(1,2), labels=c("1=yes", "2=No")) 1. Source: Afifi A., Clark VA and May S. Computer Aided Multivariate Analysis, Fourth Edition. Boca Raton: Chapman and Hall, 2004, Problem 12.9, page 330. Complete the following table by supplying the cell frequencies. Sex Regular Drinker Male Female Total Yes No Total R Solution library(summarytools) library(desctools) summarytools::ctable(depress$sex,depress$drink,prop = 'n', totals = TRUE) ## Cross- Tabulation ## Variables: sex * drink ## Data Frame: depress ## ## ## drink 1=Yes 2=No Total ## sex ## 1=Male ## 2=Female ## Total ## sol_logistic R Users.docx Page 2 of 15

3 paste("relative Odds (OR) Drinker for Males relative to Females") ## [1] "Relative Odds (OR) Drinker for Males relative to Females" DescTools::OddsRatio(depress$drink,depress$sex,method="wald",conf.level=0.95) ## odds ratio lwr.ci upr.ci ## What are the odds that a man is a regular drinker? 95 / 16 = What are the odds that a woman is a regular drinker? 139 / 44 = What is the odds ratio? That is, what is the odds that a man, relative to a woman, is a regular drinker? OR = [odds for man] / [odds for woman] = /3.159 = Repeat the tabulation that you produced for problem #1 two times, one for persons who are depressed and the other for persons who are not depressed. R - Solution library(summarytools) library(desctools) # PERSONS WHO ARE DEPRESSED subset_depressed <- subset(depress,cases==1) paste("subset - DEPRESSED") ## [1] "Subset - DEPRESSED" summarytools::ctable(subset_depressed$sex,subset_depressed$drink,prop = 'n', totals = TRUE) ## Cross- Tabulation ## Variables: sex * drink ## Data Frame: subset_depressed ## ## ## drink 1=Yes 2=No Total ## sex ## 1=Male ## 2=Female ## Total ## sol_logistic R Users.docx Page 3 of 15

4 paste("depressed: Relative Odds (OR) Drinker for Males relative to Females") ## [1] "DEPRESSED: Relative Odds (OR) Drinker for Males relative to Females" DescTools::OddsRatio(subset_depressed$drink,subset_depressed$sex,method="wald",conf.level=0.95) ## odds ratio lwr.ci upr.ci ## OR = [odds for man] / [odds for woman] = (8/2)/(33/7) = # PERSONS WHO ARE NOT DEPRESSED subset_normal <- subset(depress, cases==0) paste("subset - NORMAL") ## [1] "Subset - NORMAL" summarytools::ctable(subset_normal$sex,subset_normal$drink,prop = 'n', totals = TRUE) ## Cross- Tabulation ## Variables: sex * drink ## Data Frame: subset_normal ## ## ## drink 1=Yes 2=No Total ## sex ## 1=Male ## 2=Female ## Total ## paste("normal: Relative Odds (OR) Drinker for Males relative to Females") ## [1] "NORMAL: Relative Odds (OR) Drinker for Males relative to Females" DescTools::OddsRatio(subset_normal$drink,subset_normal$sex,method="wald",conf.level=0.95) ## odds ratio lwr.ci upr.ci ## OR = [odds for man] / [odds for woman] = (87/14)/(106/37) = sol_logistic R Users.docx Page 4 of 15

5 3. Fit a logistic regression model using these variables. Use DRINK as the dependent variable and CASES and FEMALE SEX as independent variables. Also include as an independent variable the appropriate interaction term. Is the interaction term in your model significant? R Solution library(summarytools) # Create 0/1 indicators of regular drinker (drink01) and female gender (female01) # Create interaction of drinker and female gender # KEY: when drink=1 then drink01=1, otherwise drink01=0 depress$drink01 <- ifelse(as.numeric(depress$drink)==1,1,0) summarytools::ctable(depress$drink,depress$drink01,prop = 'n', totals = TRUE) ## Cross- Tabulation ## Variables: drink * drink01 ## Data Frame: depress ## ## ## drink Total ## drink ## 1=Yes ## 2=No ## Total ## # KEY: when sex=2 then female01=1, otherwise female01=0 depress$female01 <- ifelse(as.numeric(depress$sex)==2,1,0) summarytools::ctable(depress$sex,depress$female01,prop = 'n', totals = TRUE) ## Cross- Tabulation ## Variables: sex * female01 ## Data Frame: depress ## ## ## female Total ## sex ## 1=Male ## 2=Female ## Total ## # Interaction fem_case = (female01)*(cases) depress$fem_case <- depress$female01*depress$cases.sol_logistic R Users.docx Page 5 of 15

6 # Fit logistic regression model: Y=drink01 X1=, X2=female01, X3=interaction # glm(yvariable ~ X1 + X Xp, data=dataframename, family=binomial) model_q3 <- glm(drink01 ~ cases + female01 + fem_case, data=depress, family=binomial) summary(model_q3) Call: glm(formula = drink01 ~ cases + female01 + fem_case, family = binomial, data = depress) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e- 10 *** cases female * fem_case Wald Statistic p- value Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 293 degrees of freedom Residual deviance: on 290 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 # Odds Ratios % 95% CI exp(cbind(or=coef(model_q3), confint(model_q3))) OR 2.5 % 97.5 % (Intercept) cases female fem_case # LR Test of Interaction (Null: beta for interaction=0) # Illustration in some detail # Fit reduced model, fit full model, perform LR test reduced <- glm(drink01 ~ cases + female01, data=depress, family=binomial) full <- glm(drink01 ~ cases + female01 + fem_case, data=depress, family=binomial).sol_logistic R Users.docx Page 6 of 15

7 lrtest(reduced,full) Likelihood ratio test Model 1: drink01 ~ cases + female01 Model 2: drink01 ~ cases + female01 + fem_case #Df LogLik Df Chisq Pr(>Chisq) In this sample, there is no statistically significant evidence that the odds regular drinking is associated with an interaction of gender and case of depression (LR test, df=1, p-value =.3493). This is close to the Wald statistic p-value =.327 that can be seen in the coefficients table. Fitted Model: logit [ pr (drinker=yes) ] = *cases *female *fem_case where cases =1 if depressed; 0 otherwise female01 = 1 if female; 0 otherwise fem_case = (cases) * (female01) Is the interaction term in your model significant? No. ˆβ3 = SÊ( ˆβ 3 ) = 0.96 and p-value = Source: Kleinbaum, Kupper, Miller, and Nizam. Applied Regression Analysis and Other Multivariable Methods, Third Edition. Pacific Grove: Duxbury Press, p 683 (problem 2). A five year follow-up study on 600 disease free subjects was carried out to assess the effect of 0/1 exposure E on the development (or not) of a certain disease. The variables AGE (continuous) and obesity status (OBS), the latter a 0/1 variable were determined at the start of the follow-up and were to be considered as control variables in analyzing the data. (a) State the logit form of a logistic regression model that assesses the effect of the 0/1 exposure variable E controlling for the confounding effects of AGE and OBS and the interaction effects of AGE with E and OBS with E..sol_logistic R Users.docx Page 7 of 15

8 Solution: logit[π] = β 0 + β 1*E + β 2*AGE + β 3*OBS + β*agee 4 + β*obse 5 I used the following notation: π = Probability [ disease ] AGEE = AGE * E. This is a created variable that is the interaction of AGE with E OBSE = OBS * E Similarly, this is the interaction of OBS with E. logit[π] = β 0 + β 1*E + β 2*AGE + β 3*OBS + β*agee 4 + β*obse 5 (b) Given the model you have for #4a, give a formula for the odds ratio for the exposure-disease relationship that controls for the confounding and interactive effects of AGE and OBS. Solution: The solution here follows the ideas on pp 9-11 in Lecture Notes 5, Logistic Regression. Value of Predictor for Person who is Predictor Exposed Not Exposed E 1 0 AGE AGE 1 AGE 0 OBS OBS 1 OBS 0 AGEE AGE 1 0 OBSE OBS 1 0 Then OR = exp { logit[π for exposed person] - logit[π for NON exposed person] } = exp { [ β + β + β*age + β*obs + β*age + β*obs] [ β + β*age + β*obs] } = exp { β 1 + β 2*(AGE1-AGE o) + β 3*(OBS 1 - OBS 0) + β 4*AGE 1 + β 5*OBS 1 }.sol_logistic R Users.docx Page 8 of 15

9 (c) Now use the formula that you have for #4b to write an expression for the estimated odds ratio for the exposure-disease relationship that considers both confounding and interaction when AGE=40 and OBS=1. Solution: OR ˆ = exp { β + (40)β + β } Value of Predictor for Person who is Predictor Exposed Not Exposed E 1 0 AGE OBS 1 1 AGEE 40 0 OBSE 1 0 OR = exp { β 1 + β 2*(40-40) + β 3*(1-1) + β 4*40 + β 5*1 } = exp { β 1 + β 4*40 + β 5*1 } Exercises #5-9 utilize the data set illeetvilaine.rdata Consider the following three variables. Variable Codings Label in STATA case 1 = yes 0= no Case status agegp 1=25-34 Age, grouped 2= = = = =75+ tobgrp 1=0-9 g/day 2= = =30+ Tobacco use, grouped 5. Understanding the Global Test of the Fitted Model: In this exercise, you will see that the global chi square test of the current model is equivalent to a LR Ratio test that compares the reduced model containing the intercept only versus the full model that contains the current model. In developing your answer, include as predictors in your full model indicator variables for agegp and tobgp. Tip How to Fit the Intercept Only Model. R Users Fit the dependent variable to the constant 1..sol_logistic R Users.docx Page 9 of 15

10 R Solution # Input illeetvillaine.rdata from desktop setwd("/users/cbigelow/desktop/") load(file= illeetvilaine.rdata ) ille <- illeetvilaine # Create indicators for agegp (REFERENT is age 25-34) ille$age_3544 <- ifelse(as.numeric(ille$agegp)==2,1,0) ille$age_4554 <- ifelse(as.numeric(ille$agegp)==3,1,0) ille$age_5564 <- ifelse(as.numeric(ille$agegp)==4,1,0) ille$age_6574 <- ifelse(as.numeric(ille$agegp)==5,1,0) ille$age_75p <- ifelse(as.numeric(ille$agegp)==6,1,0) # Create indicators for tobgrp (REFERENT is tob 0-9) ille$tob_1019 <- ifelse(as.numeric(ille$tobgp)==2,1,0) ille$tob_2029 <- ifelse(as.numeric(ille$tobgp)==3,1,0) ille$tob_30p <- ifelse(as.numeric(ille$tobgp)==4,1,0) Dear class I followed these indicator variable creations with some cross- tabulations to check that my coding was correct (See page 5 for example using the command tally( ). Output of these checks is not shown. # Fit reduced model, fit full model, perform LR test reduced <- glm(case ~ 1, data=ille, family=binomial) full <- glm(case ~ age_ age_ age_ age_ age_75p + tob_ tob_ tob _30p, data=ille,family=binomial) lrtest(reduced,full) Likelihood ratio test Model 1: case ~ 1 Model 2: case ~ age_ age_ age_ age_ age_75p + tob_ tob_ tob_30p #Df LogLik Df Chisq Pr(>Chisq) < 2.2e- 16 *** Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1.sol_logistic R Users.docx Page 10 of 15

11 # By hand calculation using difference in values of deviance byhand_1 <- reduced$deviance - full$deviance byhand_1 [1] pchisq(byhand_1, 8, lower.tail=false) [1] e- 30 # By hand calculation using difference in values of - 2 log Likelihood byhand_2 <- - 2*logLik(reduced)[1] - (- 2*logLik(full)[1]) byhand_2 [1] pchisq(byhand_2,8,lower.tail=false) [1] e Illustration of Hosmer-Lemeshow Goodness of Fit Test (NULL: model is a good fit) See also unit 5 notes, pp For this illustration, fit the model containing design variables for age group and tobacco use. Then perform the Hosmer-Lemeshow goodness of fit test. R Solution library(resourceselection) library(generalhoslem) # Hosmer- Lemeshow GOF (Null:model is good fit) # Using command hoslem.test() in package ResourceSelection # hoslem.test(nameofmodel$y, fitted(name OF MODEL), g=numberofbins) ResourceSelection::hoslem.test(full$y,fitted(full), g=8) Hosmer and Lemeshow goodness of fit (GOF) test data: full$y, fitted(full) X- squared = , df = 6, p- value = sol_logistic R Users.docx Page 11 of 15

12 # Using command logitgof() in package generalhoslem generalhoslem::logitgof(ille$case,fitted(full),g=8,ord=false) Hosmer and Lemeshow test (binary model) data: ille$case, fitted(full) X- squared = , df = 6, p- value = Illustration of Link Test (NULL: current model is adequate; no additional predictors needed) See also unit 5 notes, pp For the current model, perform the link test for omitted variables. R Solution # R solution requires a bit of programming # Step 1: Obtain fitted values. Name what you like (eg - hat) hat <- predict(full) # Step 2: Obtain (fitted values)- squared. Name what you like (eg - hatsq) hatsq <- hat^2 # Step 3: Fit model to fitted and fitted- squared. Test Null: Beta for fitted squared=0 linktest <- summary(glm(case ~ hat + hatsq, data=ille, family=binomial)) linktest Call: glm(formula = case ~ hat + hatsq, family = binomial, data = ille) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) hat e- 07 *** hatsq Nice. We have no statistically significant evidence of model inadequacy (Null: beta(hatsq)=0,p- value =.657).sol_logistic R Users.docx Page 12 of 15

13 8. Illustration of Classification Table See also unit 5 notes, pp For the current model, obtain the classification table R Solution library(gmodels) library(epir) # R solution requires a bit of programming here, too. # Fit model. full <- glm(case ~ age_ age_ age_ age_ age_75p + tob_ tob_ tob _30p, data=ille,family=binomial) # Save fitted probabilities (y_fitted) y_fitted=fitted(full) # Save predicted 0/1 events using cut- off of 0.50 (pred_50) pred_50 <- ifelse(y_fitted >= 0.50, 1, 0) # Detailed xtab using command CrossTable (package gmodels) classify <- as.data.frame(cbind(pred_50, case=ille$case)) gmodels::crosstable(classify$pred_50,classify$case,dnn=c("predicted (cut=.50)", "Observed"),format ="SAS",prop.r=FALSE,prop.c=FALSE,prop.t=FALSE,prop.chisq=FALSE,chisq=FALSE,fisher=FALSE) Cell Contents N Total Observations in Table: 975 Observed Predicted (cut=.50) 0 1 Row Total Column Total sol_logistic R Users.docx Page 13 of 15

14 # Detailed epi statistics using command epi.tests (package epir) # Must recode 1=event/0=NONevent to 1=event/2=NONevent # And must create table for use with epi.tests classify$test <- ifelse((classify$pred_50)==0,2,1) classify$disease <- ifelse((classify$case)==0,2,1) class_table <- table(classify$test, classify$disease) epir::epi.tests(class_table) Outcome + Outcome - Total Test Test Total Point estimates and 95 % CIs: Apparent prevalence 0.03 (0.02, 0.04) True prevalence 0.21 (0.18, 0.23) Sensitivity 0.10 (0.06, 0.15) Specificity 0.99 (0.98, 0.99) Positive predictive value 0.67 (0.47, 0.83) Negative predictive value 0.81 (0.78, 0.83) Positive likelihood ratio 7.75 (3.69, 16.29) Negative likelihood ratio 0.91 (0.87, 0.96) sol_logistic R Users.docx Page 14 of 15

15 9. Illustration of ROC Curve See also unit 5 notes, pp For the current model, produce the ROC curve. R Solution library(proc) # roc1 <- roc(dataframe$yvariable, fitted(modelname)) # plot.roc(roc1, print.auc=true, legacy.axes=true) roc1 <- roc(ille$case,fitted(full)) proc::plot.roc(roc1,main="ille- et- Vilaine", print.auc=true,legacy.axes=true).sol_logistic R Users.docx Page 15 of 15

Discriminant analysis in R QMMA

Discriminant analysis in R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l4-lda-eng.html#(1) 1/26 Default data Get the data set Default library(islr)