Statistics and Data Analysis. Use of the ROC Curve and the Bootstrap in Comparing Weighted Logistic Regression Models

Size: px

Start display at page:

Download "Statistics and Data Analysis. Use of the ROC Curve and the Bootstrap in Comparing Weighted Logistic Regression Models"

Geoffrey Dalton
6 years ago
Views:

1 Paper Use of the ROC Curve and the Bootstrap n Comparng Weghted Logstc Regresson Models Davd Izrael, Annabella A. Battagla, Davd C. Hoagln, and Mchael P. Battagla, Abt Assocates Inc., Cambrdge, MA Abstract In analyzng data from a survey, researchers often need to compare the effectveness of several logstc regresson models. The recever operatng characterstc curve offers one way to measure effectveness of predcton, by calculatng the area under the curve (AUC). We present a SAS macro for calculatng AUC that takes the survey weghts nto account. For comparng logstc regresson models, one needs to assess dfferences n AUC aganst the varaton n the data. We demonstrate the use of the SAS SURVEYSELECT procedure to create a set of 1,000 bootstrap samples and gve some background on the calculaton of separate weghts for each bootstrap sample. For each sample, the AUC macro s then used to calculate the AUC for each model. We show how to use the bootstrap results to assess the sgnfcance of the dfference n predctve ablty of the two models. 1. Introducton In analyzng data from a survey, we needed to compare the effectveness of several logstc regresson models. The recever operatng characterstc (ROC) curve offers one way to measure effectveness of predcton, by calculatng AUC. Then, for the comparsons, we needed to assess the dfferences n AUC aganst the varaton n the data. We had already developed a substantal set of bootstrap samples, and those allowed us to calculate a bootstrap standard error for the dfference n AUC, wthout makng any dstrbutonal assumptons. Wth ths bref overvew we now descrbe these components and then dscuss the applcaton of them n our study. A key ngredent s the samplng weghts assocated wth the survey data. The recever operatng characterstc curve s often used to descrbe the accuracy of tests n dagnostc medcne, as summarzed n the revew by Pepe (2000). Brefly, the test yelds a numercal result X, such that larger values are more ndcatve of dsease. One can choose a threshold z and dchotomze the test by defnng X z as a postve result. From subjects whose true dsease status s known (both dseased and nondseased), one obtans the falsepostve rate and the false-negatve rate for each value of z. The ROC curve s obtaned by plottng 1 mnus the falsenegatve rate aganst the false-postve rate for all possble choces of z. That s, each value of z yelds a pont on the curve, whch ncludes the pont (0,0) (f z s hgh enough, the test produces no postves) and the pont (1,1) (f z s low enough, all outcomes are postve). The area under the ROC curve provdes a summary of the accuracy of the dagnostc test. As Pepe ponts out, the AUC can be nterpreted as the probablty that the test result from a randomly chosen dseased ndvdual s more ndcatve of dsease than that from a randomly chosen nondseased ndvdual. Ths nterpretaton or equvalence, dscussed also by Hanley and McNel (1982), focuses attenton on the dstrbutons of the test result (for example, the concentraton of a chemcal n blood) n dseased and nondseased persons. If the two dstrbutons are clearly separated, the probablty wll be close to 1; but f they are centered at the same value, the probablty wll be ½. In the context of logstc regresson we refer to event cases and non-event cases, rather than dseased and nondseased persons. The test result s the predcted probablty of an event, from the logstc regresson model. The bootstrap (Efron 1982) uses resamplng to provde a bass for studyng the behavor of estmates. For a smple random sample of sze n, wth observatons x1, x2,..., x n, the man steps nvolve settng B (the number of bootstrap samples, usually large); usng samplng wth replacement * * * to draw a bootstrap sample of n, X1, X2,..., Xn, from the set { x 1, x 2,..., x n } (B tmes, ndependently); and calculatng the estmate, t, from each bootstrap sample to obtan * * * * t1, t2,..., t B. Analyss of the tb then yelds nformaton on the samplng dstrbuton of t when the data come from the populaton that underles x 1,..., x n. For example, the sample standard devaton of the t * b s the bootstrap standard error of t. When the data are a szable sample from a survey wth survey weghts, both the calculaton of the area under the ROC curve and applcaton of the bootstrap requre consderable specal programmng. Secton 2 dscusses the use of SAS to calculate AUC n the presence of survey weghts. Secton 3 comments on comparng the predctve value of logstc regresson models. Secton 4 sketches the basc framework for applyng the bootstrap to a complex sample survey, and Secton 5 llustrates the use of PROC SURVEYSELECT to create bootstrap samples. When the survey weghts nvolve adjustments, the elements of a bootstrap sample cannot smply nhert the weghts that they had n the orgnal sample. Secton 6 dscusses the 1

2 need to recalculate weghts, so that each bootstrap sample has ts own complete set of replcate weghts. Secton 7 then reports on the use of the bootstrap to estmate the standard error of AUC for a logstc regresson model and the standard error of the dfference n AUC between two such models. Fnally, Secton 8 adds some concludng dscusson. 2. Usng SAS to Estmate the Area under the ROC Curve We often use PROC LOGISTIC to ft logstc regresson models to weghted survey data. In fttng a model, PROC LOGISTIC takes the survey weghts nto account, but t gnores them n calculatng the ngredents of the ROC curve. Those ngredents are stored n the OUTROC data set, whch keeps one record for each dstnct predcted probablty and has the followng varables (whose values correspond to usng that probablty as the threshold): _POS_ - the number of correctly predcted event responses; _ NEG_ - the number of correctly predcted non-event responses; _FALPOS_ - the number of falsely predcted event responses; _FALNEG_ - the number of falsely predcted non-event responses; _SENSIT_ - the senstvty, whch s the proporton of event observatons that were predcted to have an event response; and _1MSPEC_ - 1 mnus specfcty, whch s the proporton of non-event observatons that were predcted to have an event response. In the presence of survey weghts, the varables for the ROC curve are not computed correctly and look exactly the same as f there were no weghts. The predcted probabltes, however, are correct. To calculate the AUC n the presence of survey weghts, we wrote a macro, CALCAUC, whch takes the weghts nto account when calculatng the varables for the ROC curve. We gve an overvew of the macro below and consder ts applcaton, both as a stand-alone program and as a subroutne n a bootstrap procedure. Macro CALCAUC Algorthm The macro calculates the varables for the ROC curve and then the AUC n the presence of survey weghts. To formalze the algorthm, we ncorporate weghts n the defntons of the ROC curve varables descrbed n Chapter 39, The Logstc Procedure, of SAS/STAT documentaton (Verson 8). Let the weghted number of ndvduals n a sample havng a certan event be n 1. Let ths group be denoted by C 1, and let the group of the remanng n 2 (weghted) ndvduals who do not have the event be denoted by C 2. Let ˆp be an estmated probablty of the event n the weghted model. W( ) denotes the weghted ndcator functon. For example, f pˆ z, W( pˆ z) s the samplng weght of ndvdual. For each cutpont z, _POS_ (z) = Σ W C 1 _FALPOS_ (z) = Σ W ( p ˆ z ) C 2 (1) ( p ˆ z ) (2) _SENSIT_(z) = _POS_(z)/n 1 (3) 1MSPEC_(z) = _FALPOS_(z)/ n 2 (4) Note that _POS_ (z) s the weghted number of correctly predcted event responses, _FALPOS_ (z) s the weghted number of falsely predcted event responses, _SENSIT_(z) s the weghted senstvty of the model, and _1MSPEC_(z) s one mnus the weghted specfcty of the model. Havng calculated _SENSIT_ and _1MSPEC_, we use them to calculate the AUC as the sum of the area of trapezods. Formally, f S s a set of cutponts joned wth 0 as the ntal one and 1 as the last one, the AUC can be expressed by the formula : AUC =.5 ( _1MSPEC_ +1 - _1MSPEC_ )(_SENSIT_ +1 +_SENSIT_ ) S Exhbt 1 presents the macro (wth lne numbers). We now descrbe ts functons secton by secton and dscuss such ssues as computatonal effcency and resource consumpton. Overvew of the code Lnes 1 23 contan the macro s nput parameters; model represents the strng of explanatory varables, all of whch must be categorcal (otherwse the number of dstnct predcted probabltes could be very large); depvar s a response varable, assumed to have the value of 1 for an event and 0 for a non-event; round and acceler control effcency of the macro and wll be descrbed below; replca must be blank when runnng the macro as a stand-alone program; otherwse t must be assgned the name of a macro varable that serves as a replcate counter when calculaton of AUC s done for each bootstrap replcate. Lnes check that the varables n the model are present n the nput data set. If not, the macro outputs (5) 2

3 names of absent varables nto the LOG and stops (Lnes 45 and 60). Lnes represent PROC LOGISTIC s statements and optons. The data set specfed n the OUTROC opton wll contan the dstnct estmated probabltes (_PROB_), whch wll serve as the cutponts mentoned n the Algorthm secton. Also, the data set _PROBS specfed n the opton OUT wll nclude all varables of the nput data set, along wth the predcted probablty _P_HAT. Lnes contan optonal statements that are ntended to accelerate the computatonal process. Computng tme s especally senstve to the number of dstnct estmated probabltes. We suggest reducng computng tme by roundng the predcted probabltes (lnes 76 and 85). The mpact of the roundng on the precson of the calculated AUC s ordnarly mnmal: we observed a dfference only n the ffth decmal place. Lnes restore the orgnal descendng order of the predcted probabltes, as roundng could change the orderng. Lnes calculate the weghted varables assocated wth the ROC curve, _POS_ and _FALPOS_ n partcular, followng formulas (1) and (2). The outer DO-loop (lne 92) sets sequentally the dstnct predcted probablty from OUTROC data set and passes t through the whole _PROBS data set, whch s accessed drectly n the nner DO-loop (lne 98). Lnes calculate _SENSIT_ and _1MSPEC_ accordng to formulas (3) and (4). Lnes calculate the AUC by formula (5). If we are computng the AUC for each bootstrap replcate, the name of the output data set wth the calculated value of AUC contans the replcate number. In ths stuaton the name of the varable wth the calculated area contans the replcate number as well (lne 141). Exhbt 1: CALCAUC Macro 1 %macro calcauc(dsanal =, /* INPUT DATA SET */ 2 3 outds = c, /* OUTPUT DATA SET WITH AUC*/ 4 5 d =, /* ID VARIABLE */ 6 7 weght =, /* SURVEY WEIGHT */ 8 9 model =, /* ALL EXPLANATORY VAR's. */ 10 /* MUST BE CATEGORICAL */ depvar =, /* DEPENDENT VARIABLE */ round =.001, /* PRECISION AT WHICH TO */ 15 /* ROUND PRED PROBABIL */ replca =, /* COUNTER OF REPLICATES */ 18 /* WHEN BOOTSTRAP IS USED */ acceler = y ); /* ACCELERATE CALCULATIONS 21 BY ROUNDING PREDICTED 22 PROBABILITIES. &ROUND 23 MUST BE PRESENT */ %let control = 1; % macro check; %local dsd control nullstr rc varnum; %let model=%upcase(&model); 32 %let depvar=%upcase(&depvar); 33 %let strng=&model &depvar; %let =1; 36 %let nullstr=; %let dsd=%sysfunc(open(&dsanal)); %do %untl(%scan(&strng,&)=&nullstr); 41 %let varnum=%sysfunc(varnum(&dsd,%scan(&strng,&))); 42 %f &varnum=0 %then %do; 43 %let control=0; 44 %put ; 45 %put VARIABLE %scan(&strng,&) APPEARS IN THE 46 MODEL, BUT NOT IN THE INPUT DATA SET; 47 %put ; 48 %end; 49 %let =%eval(&+1); 50 %end; %let rc=%sysfunc(close(&dsd)); 53 %mend check; %f (&replca=) or (&replca=1) %then %check; %f &control = 0 %then %do; 58 %put **** MACRO TERMINATED BECAUSE OF ERRORS 59 ABOVE ******; 60 %goto ext; 61 %end; 62 %else %do; proc logstc descendng data=&dsanal; 65 weght &weght /norm; 66 class &model; 67 model &depvar= &model/ 68 outroc=_roc(keep=_prob_); 69 output out=_probs predcted=_p_hat; %f %upcase(&acceler) = Y %then %do; data _roc; 75 set _roc; 76 _prob_=round(_prob_, &round); proc sort nodupkey; 80 by descendng _prob_; data _probs; 84 set _probs; 85 _p_hat=round(_p_hat,&round); 86 3

4 87 88 %end; data _out1 (keep= _pos neg falpos falneg_); do =1 to numobroc; set roc1 nobs=numobroc pont= ; 95 retan _pos neg falpos falneg_ ; 96 pos=0; neg=0; falpos=0; falneg=0; do j=1 to numobpro; set probs nobs=numobpro pont=j; 101 f _p_hat >=_prob_ then _preddep=1; else _preddep=0; 102 f &depvar=1 and _preddep=1 then _pos_=_pos_+ &weght; 103 else 104 f &depvar=0 and _preddep=1 then _falpos_=_falpos_+ &weght; 105 else 106 f &depvar=1 and _preddep=0 then _falneg_=_falneg_+&weght; 107 else 108 _neg_=_neg_+&weght; f j = numobpro then output; end; 113 end; 114 stop; data _s; 118 set _out1; 119 f _n_=1 then set _out1(rename=(_pos_=_n1 120 _falpos_=_n2)) nobs=numout pont=numout; _senst_=_pos_/_n1; 123 _1mspec_=_falpos_/_n2; data &outds&replca(keep=area&replca); 127 set _s end=fn; 128 retan _w _z area&replca 0; 129 f _n_=1 then do; _w=0; _z=0; 130 end; 131 _x=_1mspec_-_w; 132 _y=(_senst_+_z)*0.5; 133 _z=_senst_; 134 _w=_1mspec_; area&replca=sum(area&replca,_x*_y); f fn then output; proc prnt data=&outds&replca; %end; 144 %ext:; 145 %mend calcauc; 3. Comparng the Predctve Value of Two Models Many analyses nvolve fttng two or more logstc regresson models to the same data. Then, n choosng among the models, t may be useful to compare ther predctve value, va the AUC. Often one of the models s the fnal model or full model from a stepwse logstc regresson, and the other models are subsets of the full model (e.g., the frst k predctors to enter a reduced model). The dfference between the AUC for the full model and the AUC for a reduced model can ad n judgng whether the full model offers a real advantage over the reduced model. (Ths applcaton of the dfference n AUC does not requre that the models be nested. It s applcable to the comparson of any two models.) To assess the sze of the dfference n AUC relatve to the varaton n the data, we need the estmated standard error of the dfference. One sutable approach s the bootstrap method, whch uses replcaton (Wolter 1985). As mentoned n Secton 1, the bootstrap nvolves drawng repeated ndependent samples (wth replacement) from the orgnal sample and then estmatng the AUC for each model and the dfference n AUC, for each of these bootstrap samples. The sample standard devaton of that dfference (over the bootstrap samples) s the bootstrap estmate of ts standard error. 4. The Bootstrap Method for Varance Estmaton Our data came from a stratfed one-stage cluster sample of over 20,000 persons that ncorporates several weghtng adjustments. The sample desgn entals stratfcaton of the U.S. nto 78 geographc areas. Wthn each stratum, a random sample of households s drawn. The survey collects data on all elgble household members, makng households the clusters n the sample desgn. Rust and Rao (1996) dscuss the use of replcaton methods to obtan standard errors for complex survey desgns. The applcaton of the bootstrap procedure to our sample desgn nvolves drawng the bootstrap samples (replcates) wthn each stratum. In connecton wth other analyses of the same data, we had prevously constructed 1,000 bootstrap replcates, n order to obtan bounds for 95% confdence ntervals drectly from the dstrbuton of the bootstrap estmates, as well as bootstrap standard errors. Thus, t was natural to use those 1,000 bootstrap replcates n estmatng the standard error of the dfference n AUC. The next secton descrbes the use of PROC SURVEYSELECT to construct bootstrap replcates. Then Secton 6 dscusses the further steps requred to produce samplng weghts specfc to each replcate. 5. Use of PROC SURVEYSELECT The followng statements show how the SURVEYSELECT procedure was used to draw the 1,000 bootstrap replcates. %let dd = ourdrectory; %let n_ter=1; %let max_ter=1000; %let n_smpfle=&dd..samplefle; %let n_nsze= geo_area_tot; /* CREATE A DATASET WITH SAMPLE SIZES TO BE DRAWN FROM EACH STRATUM */ 4

5 proc freq data=&n_smpfle; table geo_area/out=&n_nsze(rename=(count=_nsze_) drop=percent); /* MACRO TO DRAW 1000 BOOTSTRAP SAMPLES */ %MACRO BOOTREP; %do =&n_ter %to &max_ter; proc prntto new prnt="brep_&..lst"; proc prntto new log="brep_&..log"; ttle3 " REPLICATE =& URS selecton"; optons pageno=1; proc surveyselect data=&n_smpfle method=urs sampsze=&n_nsze out=&dd..urs_& outhts; strata geo_area; proc freq data=&dd..urs_&; tables NumberHts; %end; %MEND BOOTREP; %BOOTREP To draw the 1,000 sample replcates wth equal probablty and wth replacement, we used METHOD=URS (Unrestrcted Random Samplng). SAMPSIZE = GEO_AREA_TOT dentfes the SAS data set that contans _NSIZE_, the dfferent sample szes for the strata. The OUT=&dd..urs_& opton outputs each of the 1,000 samples nto a separate permanent SAS dataset. The OUTHITS opton outputs a separate observaton for each selecton when an observaton s selected more than once. The output dataset contans for each observaton the varable NumberHts, the number of tmes a household was selected nto the sample n a gven replcate. The STRATA statement defnes the varable GEO_AREA as the stratfcaton varable. 6. Calculaton of Replcate Weghts Rust and Rao (1986) gve a method for adjustng the fnal samplng weghts to obtan bootstrap weghts. They also note, however, that for the varance estmators to reman close to unbased, the weght adjustment steps appled to the orgnal sample should be appled to each bootstrap replcate. Ths s an mportant consderaton n our sample desgn, gven the consderable number of weght adjustments. Thus, for each bootstrap replcate we repeated all of the weght calculaton steps. As a result each of the 1,000 bootstrap replcates has ts own set of weghts. 7. Usng SAS and the Bootstrap Replcates to Estmate the Varance of the AUC Applyng the macro CALCAUC to the orgnal sample, we calculate the AUC for the weghted models wth 14 explanatory varables (full model) and 6 explanatory varables (reduced model). We denote them by AUC14 and AUC6, respectvely. In Exhbt 2 the macro ALLREPL uses the macro CALCAUC as a subroutne to reft a weghted logstc regresson model and obtan the AUC for each of 1,000 bootstrap replcates. Exhbt 2: ALLREPL Macro %let youranal = anal; /*ANALYTIC FILE WITH ALL DATA */ %let dsbswts = replwts; /* DATA SET WITH ID AND 1,000 REPLICATE WEIGHTS */ %let model = yourmodel; /* STRING WITH EXPLANAT ORY VARIABLES */ %let depvar = yourresponse; /* RESPONSE VARIABLE */ %macro allrepl (start,end); %do v=&start %to &end ; /* &START and &END ARE FIRST AND LAST REPLICATE TO PROCESS, 1 AND 1,000 IN OUR EXAMPLE */ data _anal; merge &youranal (n=_1 ) &dsbswts (keep=id w&v where=(w&v ne 0) n=_2); /* RETRIEVE &V-th REPLICATE WHERE V-th REPLICATE WEIGHT NE ZERO */ by ID; f _2; wgt=w&v; drop w&v; %calcauc( dsanal = _ANAL, /* CALCULATE AUC FOR &V-th REPLICATE */ outds = C, d = ID, weght = WGT, model = &YOURMODEL, /* REFIT MODEL TO DATA IN &V-th REPLICATE*/ depvar = &YOURESPONSE, round =.001, replca = &V, acceler = Y ); %end; %mend; %allrepl(1,1000) Applyng the ALLREPL macro for the full model and then for the reduced model, we obtan two sets of replcate AUCs: AUCR14 ( = 1 to 1,000), and AUCR6 ( = 1 to 1,000), respectvely. We denote the bootstrap sample AUC by AUCR to dstngush t from the one of the orgnal sample. To estmate the bootstrap standard errors of AUC14 and AUC6, we smply apply PROC UNIVARIATE to the AUCR14 and the AUCR6 ( = 1 to 1,000) to obtan the standard devatons. To estmate the standard error of the dfference n AUC between the two models, DIFF = AUC14 - AUC6, we apply PROC 5

6 UNIVARIATE to the dfferences AUCR14 - AUCR6 ( = 1 to 1,000) to obtan the standard devaton STDDIFF. Then t = abs(diff / STDDIFF) The correspondng sgnfcance level of a two-taled t test s gven by p = (1-PROBT(t, df))*2, The weghted AUC s.658 for the full model and.641 for the reduced model. The dfference n weghted AUC of.017 s hghly sgnfcant. Table 1 gves the results. Table 1: AUCs and Bootstrap Standard Errors Number Area or Dfference n Area Under the Curve of Area or Std. Err. Varance t p Predctors Dfference vs Dscusson The area under the recever operatng characterstc curve s an mportant and wdely used measure of the predctve ablty of a logstc regresson model. Most survey data fles have survey weghts attached. The LOGISTIC procedure does not take the weghts nto account n ts calculaton of the area under the ROC curve and therefore usually does not gve the correct value. The SAS macro CALCAUC uses the survey weghts n the calculaton of the area under the ROC curve by summng the area of trapezods. Hanley and McNel (1982) ndcate that the c statstc (whch PROC LOGISTIC reports n the Assocaton of Predcted Probabltes and Observed Responses table) s equvalent to the area under the ROC curve. We have also developed a SAS macro, not dscussed n ths paper, that calculates a weghted verson of the c statstc. We have used the area under the curve to compare the predctve ablty of two logstc regresson models estmated from the same survey data fle. Usng bootstrap samples, t s possble to test whether the two models have the same area under the ROC curve. We provde some background on how SURVEYSELECT can be used to create bootstrap samples. It s possble, however, that the survey fle beng analyzed already contans bootstrap samples and bootstrap replcate weghts. If not, t s wse to consult wth a statstcan who s famlar wth the bootstrap method of varance estmaton before creatng bootstrap samples and bootstrap replcate weghts. References Efron, B. (1982). The Jackknfe, the Bootstrap and Other Resamplng Plans. Phladelpha: Socety for Industral and Appled Mathematcs. Hanley, J.A. and McNel, B.J. (1982). The Meanng and Use of the Area under a Recever Operatng Characterstc (ROC) Curve, Radology, 143, Pepe, M.S. (2000). Recever Operatng Characterstc Methodology, Journal of the Amercan Statstcal Assocaton, 95, Rust, K.F. and Rao, J.N.K. (1996). Varance Estmaton for Complex Surveys Usng Replcaton Technques, Statstcal Methods n Medcal Research, 5, SAS Insttute Inc. (1995). Logstc Regresson Examples Usng the SAS System, Verson 6, Frst Edton. Cary, NC: SAS Insttute Inc. SAS Insttute Inc. (1999). SAS/STAT, Verson 8, Chapter 39. Cary, NC: SAS Insttute Inc. Wolter, K.M. (1985). Introducton to Varance Estmaton. New York: Sprnger-Verlag. Acknowledgment We thank Phlp Prmak, Prncpal Programmer at Genzyme Corporaton, for revewng the macros and makng valuable suggestons. Contact Informaton Davd Izrael Abt Assocates Inc. 55 Wheeler St. Cambrdge, MA davd_zrael@abtassoc.com 6

X- Chart Using ANOM Approach

X- Chart Using ANOM Approach ISSN 1684-8403 Journal of Statstcs Volume 17, 010, pp. 3-3 Abstract X- Chart Usng ANOM Approach Gullapall Chakravarth 1 and Chaluvad Venkateswara Rao Control lmts for ndvdual measurements (X) chart are