Chapter 9 Building the Regression Model I: Model Selection and Validation

Size: px

Start display at page:

Download "Chapter 9 Building the Regression Model I: Model Selection and Validation"

Alexandrina Patterson
6 years ago
Views:

1 Chapter 9 Building the Regression Model I: Model Selection and Validation 許湘伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 9 1 / 42

2 9.1 Polynomial Regression Models From previous: know Now how to fit models how to make inferences from these models Selection of the predictor variables for exploratory ( 探究的 ) observational studies hsuhl (NUK) LR Chap 9 2 / 42

3 9.1 Polynomial Regression Models Overview of Model-Building Process Strategy for building a regression model 1 Data collection and preparation hsuhl (NUK) LR Chap 9 3 / 42

4 9.1 Polynomial Regression Models Overview of Model-Building Process Strategy for building a regression model 1 Data collection and preparation 2 Reduction of number of explanatory variables hsuhl (NUK) LR Chap 9 3 / 42

5 9.1 Polynomial Regression Models Overview of Model-Building Process Strategy for building a regression model 1 Data collection and preparation 2 Reduction of number of explanatory variables 3 Model refinement( 精煉 ) and selection (tentative 暫時性的 ) hsuhl (NUK) LR Chap 9 3 / 42

6 9.1 Polynomial Regression Models Overview of Model-Building Process Strategy for building a regression model 1 Data collection and preparation 2 Reduction of number of explanatory variables 3 Model refinement( 精煉 ) and selection (tentative 暫時性的 ) 4 Model validation hsuhl (NUK) LR Chap 9 3 / 42

7 9.2 Surgical Unit Example Surgical Unit Example a particular type of liver( 肝臟 ) operation Variables: X 1 : blood clotting( 凝塊 ) score X 2 : prognostic( 預後的 ) index X 3 : enzyme( 酵 ) function test score X 4 : liver function test score X 5 : age, in years X 6 : indicator variable for gender (0=male; 1=female) X 7 and X 8 : indicator variables for history of alcohol use: hsuhl (NUK) LR Chap 9 4 / 42

8 9.2 Surgical Unit Example Surgical Unit Example Y : response; survival time Original data: 108 patients Preliminary study: the first 54 patients with the first four variables hsuhl (NUK) LR Chap 9 5 / 42

9 9.2 Surgical Unit Example basic plots/correlation diagnosis several cases as outlying examine the influence of these cases hsuhl (NUK) LR Chap 9 6 / 42

10 9.2 Surgical Unit Example residual plots: several cases as outlying curvature and nonconstant error variance some departure from normality hsuhl (NUK) LR Chap 9 7 / 42

11 9.2 Surgical Unit Example To make the distribution of the error terms more nearly normal to see if the same transformation would also reduce the apparent curvature Y = ln Y hsuhl (NUK) LR Chap 9 8 / 42

12 9.2 Surgical Unit Example Figure : JMP Scatter Plot Matrix and Correlation Matrix when Response Variable Is Y -Surgical Unit Example hsuhl (NUK) LR Chap 9 9 / 42

13 9.2 Surgical Unit Example linearly associated with Y, with X 3, X 4 showing the highest degrees of association and X 1 the lowest intercorrelations( 內相關 ; 交互相關 ) among the potential predictor variables (X 4 vs. X 1, X 2, X 3 ) Basic of the analyses: present the predictor variables in linear terms; not to include any interaction terms To examine whether all of the potential variables are needed or whether a subset of them is adequate model selection hsuhl (NUK) LR Chap 9 10 / 42

14 9.2 Surgical Unit Example Reference: hsuhl (NUK) LR Chap 9 11 / 42

15 9.3 Criteria for Model Selection Criteria for Model Selection From any set of p 1 predictors: 2 p 1 alternative models can be constructed There is the regression model with no X variables. hsuhl (NUK) LR Chap 9 12 / 42

16 9.3 Criteria for Model Selection Criteria for Model Selection (cont.) Model selection procedure have been developed to identify a small group of regression models that are good according to a specified criterion A detailed examination can then be made of a limited number of the more promising( 有希望的 ) or candidate models, leading to the selection of the final regression model to be employed. Focus on six criteria: Rp, 2 Ra,p, 2 C p, AIC p, SBC p, PRESS p (BIC p) hsuhl (NUK) LR Chap 9 13 / 42

17 9.3 Criteria for Model Selection Criteria for Model Selection (cont.) Some notations for model selection P 1: the number of potential X variables in the pool all regression models contain an intercept term β 0 (contains 1 in design matrix X P parameters) p 1: the number of X variables in a subset (p parameters in the regression function for this subset of X variables) n > P 1 p P hsuhl (NUK) LR Chap 9 14 / 42

18 9.3 Criteria for Model Selection R 2 p or SSE p Criterion the use of R 2 which R 2 is high R 2 p criterion the error sum of squares SSE p (subsets for which SSE p is small) R 2 p = 1 SSE p SSTO SSTO is constant for all possible regression models R 2 with #{X} The R 2 p criterion is not intended to identify the subsets that maximize this criterion. (max R 2 p when all P 1 potential X variables are included) to find the point where adding more X variables is not worthwhile( 值得做的 ) ( it leads to a very small increase in R 2 p) hsuhl (NUK) LR Chap 9 15 / 42

19 9.3 Criteria for Model Selection R 2 a,p or MSE p Criterion take account of the number of parameters R 2 a,p: the adjusted coefficient of multiple determination R 2 a,p = 1 ( ) n 1 SSEp n p SSTO = 1 MSE p SSTO n 1 R 2 a,p MSE p for the given Y observations max(r 2 a,p) can decrease as p increases ( the increase in max R 2 p becomes small it is not sufficient to offset the loss of an additional df.) R 2 a,p criterion: to find a few subsets for which R 2 a,p is at the maximum or so close to the maximum hsuhl (NUK) LR Chap 9 16 / 42

20 9.3 Criteria for Model Selection Mallow s C p Criterion concerned with the total mean squared error of the n fitted values The mean squared error for Ŷ i : (Ŷi µ i ) 2 = (E{Ŷi} µ i ) + (Ŷi E{Ŷi}) }{{}}{{} bias component random error component E{Ŷ i µ i } 2 = ( E{Ŷ i µ i } ) 2 + σ 2 {Ŷ i } The total mean squared error for all n fitted values Ŷ i : n [ (E{ Ŷ i µ i } ) 2 n ( + σ {Ŷi}] 2 = E{ Ŷ i µ i } ) 2 n + σ 2 {Ŷi} i=1 i=1 i=1 2 hsuhl (NUK) LR Chap 9 17 / 42

21 9.3 Criteria for Model Selection Mallow s C p Criterion (cont.) Γ p : the criterion measure is the total mean squared error divided by σ 2 Γp = 1 [ n ( E{ Ŷ σ 2 i µ i } ) ] 2 n + σ 2 {Ŷi} i=1 C p : the estimator of Γ p C p = i=1 SSE p (n 2p) MSE(X 1,..., X P 1 ) hsuhl (NUK) LR Chap 9 18 / 42

22 9.3 Criteria for Model Selection Mallow s C p Criterion (cont.) Why C p is an estimator of Γ p ni=1 σ 2 {Ŷ i } = trace(cov(xb)) = pσ 2 E{SSE p } = (E{Ŷi} µ i ) 2 + (n p)σ 2 E{SSE p } = E{Ŷi Y i } 2 = { ) } 2 E (Ŷi E{Ŷi} ε i + E{Ŷi} µ i [ n ] Γ p = 1 ) 2 n (E{Ŷ σ 2 i } µ i + σ 2 {Ŷ i } i=1 i=1 = E{SSE p} σ 2 (n 2p) SSE p C p = (n 2p) MSE(X 1,..., X P 1 ) hsuhl (NUK) LR Chap 9 19 / 42

23 9.3 Criteria for Model Selection Mallow s C p Criterion (cont.) When E{Ŷi} = µ i (no bias with p 1 X variables) E{C p } p the regression model containing all P 1 X variables: C P = SSE(X 1,..., X P 1 ) SSE(X 1,...,X P 1 ) n P (n 2P) = P The C p measure assumes that MSE(X 1,..., X P 1 ) is an unbiased estimator of σ 2. like to find the model: 1 C p is small 2 C p p hsuhl (NUK) LR Chap 9 20 / 42

24 9.3 Criteria for Model Selection AIC p and SBC p Criteria provide penalties for adding predictors: Akaike s information criterion: AIC p Schwarz Bayesian criterion: SBC p (Bayesian information criterion (BIC)) AIC p = n ln SSE p n ln n + 2p (penalty) BIC p = n ln SSE p n ln n + [ln n]p (penalty) Search for models that have small values of AIC p or SBC p SBC p criterion tends to favor more parsimonious models (large penalty if n 8) hsuhl (NUK) LR Chap 9 21 / 42

25 9.3 Criteria for Model Selection PRESS p Criterion PRESS p : prediction sum of squares n PRESS p = (Y i Ŷ i(i) ) 2 i=1 a measure of how well the use of the fitted values for a subset model can predict the observed responses Y i Y i(i) : the fitted value when i is being predicted from a model in which (i) was left out when the regression function was fitted (leave-one-out) When the prediction errors Y i Ŷi(i) are small, so are PRESS p PRESS p can be calculated without requiring n separate regression runs, each time deleting one of the n cases (Chap. 10) hsuhl (NUK) LR Chap 9 22 / 42

26 9.3 Criteria for Model Selection Example for different Criteria Surgical unit example Key R codes: leaps: performs an exhaustive search for the best subsets of the variables in x method=c("cp", "adjr2", "r2"): Calculate C p, adjusted R 2 a or R 2 int=true: Add an intercept to the model nbest: Number of subsets of each size to report for loop procedure hsuhl (NUK) LR Chap 9 23 / 42

27 9.3 Criteria for Model Selection Example for different Criteria (cont.) library(leaps) ex<-read.table("ch09ta01.txt",header=f) colnames(ex)<-c("x1","x2","x3","x4","x5","x6","x7","x8","y","ypr") attach(ex) ex.r2<-leaps( x=cbind(x1,x2,x3,x4,x5,x6,x7,x8), y=ypr, method= r2, nbest=6) p<-seq( min(ex.r2$size),max(ex.r2$size) ) ind<-as.data.frame(ex.r2[c(3:4)]) ind<-ind[with(ind, order(size,r2)), ] plot(ind[,c(1:2)],ylab=expression(r^2), xlab= p,col="red",pch=16) Rp2 = by( data=ex.r2[4],indices=factor(ex.r2$size), FUN=max) lines( Rp2 ~ p,col="blue" ) hsuhl (NUK) LR Chap 9 24 / 42

28 9.3 Criteria for Model Selection R codes for Table 9.2 #### Using for loop ex<-read.table("ch09ta01.txt",header=f) colnames(ex)<-c("x1","x2","x3","x4","x5","x6","x7","x8","y","ypr") attach(ex) nn<-length(ex[,1]) vars <- c("ypr","x1","x2","x3","x4")# N <- list(0,1,2,3,4)# COMB <- sapply(n, function(m) combn(x=vars[2:5], m)) COMB2 <- list() k=0 exlmf<-lm(ypr~x1+x2+x3+x4) SSEF<-deviance(aov(exlmf)) MSEF<-deviance(aov(exlmf))/(n-5) res.table<-null res.table0<-null for(i in seq(comb)) { tmp <- COMB[[i]] for(j in seq(ncol(tmp))) { hsuhl (NUK) LR Chap 9 25 / 42

29 9.3 Criteria for Model Selection R codes for Table 9.2 (cont.) k <- k + 1 if(length(tmp)==0) COMB2[[k]] <- formula(paste("ypr~1")) if(length(tmp)>0) COMB2[[k]] <- formula(paste("ypr", "~", paste(tmp[,j], collapse=" + "))) exlm0<-lm(comb2[[k]],data=ex) exlm<-summary(exlm0) np<-length(tmp[,1])+1 R2p<-exlm$r.squared R2ap<-exlm$adj.r.squared SSEp<-deviance(aov(exlm0)) Cp<-SSEp/MSEF-(nn-2*(np)) AICp<-nn*log(SSEp)-nn*log(nn)+2*np BICp<-nn*log(SSEp)-nn*log(nn)+log(nn)*np PRESSp<-sum((resid(exlm0)/(1 - lm.influence(exlm0)$hat))^2) res.table<-rbind(res.table,c(noquote(paste(tmp[,j], collapse=" + ")), format(round(c(np,r2p,r2ap,cp,aicp,bicp,pressp),3), nsmall = 3))) res.table0<-rbind(res.table0,c(np,r2p,r2ap,cp,aicp,bicp,pressp)) } } library(xtable) res.table hsuhl (NUK) LR Chap 9 26 / 42

30 9.3 Criteria for Model Selection R codes for Table 9.2 (cont.) No. Variables p Rp 2 Ra,p 2 C p AIC p SBC p PRESS p 1 None X X X X X1 + X X1 + X X1 + X X2 + X X2 + X X3 + X X1 + X2 + X X1 + X2 + X X1 + X3 + X X2 + X3 + X X1 + X2 + X3 + X hsuhl (NUK) LR Chap 9 27 / 42

31 9.3 Criteria for Model Selection R codes for Table 9.2 (cont.) hsuhl (NUK) LR Chap 9 28 / 42

32 9.3 Criteria for Model Selection R codes for Table 9.2 (cont.) hsuhl (NUK) LR Chap 9 29 / 42

33 9.4 Automatic Search Procedures for Model Selection Automatic Search Procedures for Model Selection #{possible models }(2 P 1 ) with #{predictors} Evaluating all of the possible alternatives can be a daunting( 使人氣餒的 ) endeavor( 努力 ). Automatic computer-search procedures: "Best" subsets algorithms stepwise regression hsuhl (NUK) LR Chap 9 30 / 42

34 9.4 Automatic Search Procedures for Model Selection "Best" Subset Algorithms Time-saving algorithms have been developed according to a specified criterion, ex: C p the best subsets Without requiring the fitting of all the possible subset regression models only a small fraction of all possible regression models is required to calculate. Several regression models can be identified as "good" for final consideration, depending on which criteria we use. hsuhl (NUK) LR Chap 9 31 / 42

35 9.4 Automatic Search Procedures for Model Selection "Best" Subset Algorithms (cont.) The surgical unit example R 2 a,p = 0.823: seven- or eight-parameter model min(c 7 ) = min(aic 7 ) = min(sbc 5 ) = min(press 5 ) = Not to identify a single best model; hope to identify a small set of promising models for further study hsuhl (NUK) LR Chap 9 32 / 42

36 9.4 Automatic Search Procedures for Model Selection "Best" Subset Algorithms (cont.) hsuhl (NUK) LR Chap 9 33 / 42

37 9.4 Automatic Search Procedures for Model Selection Selection Methods The forward stepwise regression procedure is probably the most widely used of the automatic search methods. The search method develops a sequence of regression models, at each step adding or deleting an X variable. (t or F statistics) the stepwise search procedures end with the identification of a single regression model as "best" Different formats: Forward Selection; Forward Stepwise Selection Backward Elimination; Backward Stepwise Elimination hsuhl (NUK) LR Chap 9 34 / 42

38 9.4 Automatic Search Procedures for Model Selection Forward Stepwise Selection The forward stepwise regression search algorithm: t statistics and the associated P-values 1 Fit a simple linear regression model for each of the P 1 X variables considered for inclusion. For each compute the t statistics for testing whether or not the slope is zero: t = b k s{b k } 2 Pick the largest out of the P 1 t k s and include the corresponding X variable in the regression model if t k exceeds some threshold (ex: comparing the P-value with the limits α in = 0.1). If no such t k stop the procedure hsuhl (NUK) LR Chap 9 35 / 42

39 9.4 Automatic Search Procedures for Model Selection Forward Stepwise Selection (cont.) 3 Using partial t statistics to add the second variable (ex: comparing the P-value with the limits α in = 0.1) 4 Test whether variables in the model need to be dropped from the model: Using partial t statistics (ex: comparing the P-value with the limits α out = 0.15) 5 Go to Step 3 and continue the procedure until stop. Note that allow an X variable brought into the model at an early stage, to be dropped subsequently if it is no longer helpful small α in are often underspecified α in < α out : avoid cycling hsuhl (NUK) LR Chap 9 36 / 42

40 9.4 Automatic Search Procedures for Model Selection Forward Stepwise Selection (cont.) forward selection Backward Elimination: start with all P 1 X-variables R codes: Backward Stepwise Elimination step(): Select a formula-based model by AIC. fastbw(): Fast Backward Variable Selection (library(rms)) add1() or drop1(): Add or Drop All Possible Single Terms to a Model hsuhl (NUK) LR Chap 9 37 / 42

41 Model Validation Model Validation the validation of the selected regression models checking a candidate model against independent data Three basic ways of validating a regression model: 1 Collection of new data to check the model and its predictive ability 2 Comparison of results: theoretical expectation; early results; simulation resutls 3 Holdout sample( 保留樣本 ): check the model and its predictive ability hsuhl (NUK) LR Chap 9 38 / 42

42 Model Validation Model Validation (cont.) Collection of New data to Check Model A means of measuring the actual predictive capability of the selected regression model is to use this model to predict each case in the new data set and then to calculate the mean of the squared prediction errors, to be denoted by MSPR, which stands for mean squared prediction error: MSPR = n i=1(y i Ŷ i ) 2 n Difficulties in replicating a study hsuhl (NUK) LR Chap 9 39 / 42

43 Model Validation Model Validation (cont.) Comparing with Theory, Empirical Evidence, or Simulation Results Unfortunately, there is often little theory that can be used to validate regression models hsuhl (NUK) LR Chap 9 40 / 42

44 Model Validation Model Validation (cont.) Data Splitting often called cross-validation ( 交叉驗證 ): the data set is large enough to split the data into two sets training sample: model-building set prediction set: used to evaluate the reasonableness and predictive ability of the selected model If the entire data set is not large enough for making an equal split, the validation data set will need to be smaller than the model-building data set. Splits of the data can be made at random. Drawback: the variances of the estimated regression coefficients developed from the model-building data set will usually be larger than those that would have been obtained from the fit to the entire data set hsuhl (NUK) LR Chap 9 41 / 42

45 Model Validation Model Validation (cont.) In any case, once the model has been validated, it is customary practice to use the entire data set for estimating the final regression model. Comments: double cross-validation procedure K-fold cross-validating hsuhl (NUK) LR Chap 9 42 / 42

STA121: Applied Regression Analysis

STA121: Applied Regression Analysis Variable Selection - Chapters 8 in Dielman Artin Department of Statistical Science October 23, 2009 Outline Introduction 1 Introduction 2 3 4 Variable Selection Model