Bayesian model selection and diagnostics

Size: px

Start display at page:

Download "Bayesian model selection and diagnostics"

Darren Hines
6 years ago
Views:

1 Bayesian model selection and diagnostics A typical Bayesian analysis compares a handful of models. Example 1: Consider the spline model for the motorcycle data, how many basis functions? Example 2: Consider the bone height data, do we need to have different slopes for each kid? Is a linear trend sufficient or should we also allow for a quadratic trend? We will explore several approaches for choosing between models: Cross validation Bayes factors Information criteria When the number of models is huge (linear regression with many covariates) we ll discuss stochastic search model selection. Often, rather than selecting one model you might want to use Bayesian model averaging. In addition to selecting one model from a finite set of models, we will discuss methods to determine if the model you select provides an adequate fit using goodness-of-fit diagnostics. ST740 (5) Model comparisons - Part 1 Page 1

2 Cross-validation This is the simplest and more interpretable ways to compare models. Say there are n observations Y = (Y 1,..., Y n ) T. In k fold cross validation, we 1. Split the data into k subsets. 2. Fit the model k times, each time to k 1 of the k subsets. 3. For each model you make predictions on the test set. 4. This gives a predicted value for each observation Ŷ = (Ŷ1,..., Ŷn) T. 5. Fit is summarized with a loss function, D(Ŷ, Y). This is repeated for each model (using the same partition into subsets) and the model with smallest loss is preferred. Predictions should be made from the posterior prediction distributions to account for all sources of uncertainty. It is also a good idea to report coverage of prediction intervals. Which loss function to use? ST740 (5) Model comparisons - Part 1 Page 2

3 CV for the motorcycle example ################ model ###############: # y N(int+x%*%beta,var=1/taue) # beta[j] N(0,var=var.b) # int flat # taue gamma(a1,b1) ########################################################: BayesSemiPar<-function(y,x,xp, iters=10000,burn=1000,update=1000, a1=0.01,b1=0.01,var.b=100){... #START THE MCMC: for(i in 1:iters){ } #update parameters taue<-rgamma(1,n/2+a1,sum((y-int-x%*%beta)ˆ2)/2+b1) int<-rnorm(1,mean(y-x%*%beta),1/sqrt(n*taue)) VVV<-solve(taue*txx+diag(taub)/var.b) MMM<-taue*t(x)%*%(y-int) beta<-vvv%*%mmm+t(chol(vvv))%*%rnorm(p) beta<-as.vector(beta) #Make and store preditions: yp <- int + xp%*%beta + rnorm(np,0,1/sqrt(taue)) keep.yp[i,] <- yp list(pred=keep.yp)} Code is available at reich/st740/code/motocv.r. ST740 (5) Model comparisons - Part 1 Page 3

4 CV for the motorcycle example library(splines) library(mass) y <- mcycle$accel t <- mcycle$times y <- (y-mean(y))/sd(y) t <- t/max(t) # Split into K=5 subsets set.seed(0820) nfolds <- 5 fold <- sample(1:nfolds,133,replace=true) #Fit the model with knots COV<-MSE<-MAD<-VAR<-NULL for(model in 1:4){ B <- ns(t,model*4) OUT <- matrix(0,133,5) for(f in 1:nfolds){ print(paste("model",model,"fold",f)) yo <- y[fold!=f] Bo <- B[fold!=f,] Bp <- B[fold==f,] yp <- BayesSemiPar(yo,Bo,Bp)$pred } OUT[fold==f,1] <- apply(yp,2,mean) OUT[fold==f,2] <- apply(yp,2,median) OUT[fold==f,3] <- apply(yp,2,var) OUT[fold==f,4] <- apply(yp,2,quantile,0.05) OUT[fold==f,5] <- apply(yp,2,quantile,0.95) } MSE MAD VAR COV <- c(mse,mean((y-out[,1])ˆ2)) <- c(mad,mean(abs(y-out[,2]))) <- c(var,mean(out[,3])) <- c(cov,mean(y>out[,4] & y<out[,5])) RESULTS <- cbind(mse,mad,var,cov) rownames(results)<-paste(10*1:4,"basis functions") > round(results,3) > MSE MAD VAR COV > 10 Basis functions > 20 Basis functions > 30 Basis functions > 40 Basis functions ST740 (5) Model comparisons - Part 1 Page 4

5 Bayes factors Cross validation is very useful for many problems, but Unreliable for small datasets Cumbersome for really large datasets Not a formal testing procedure. Bayes factors (BF) are in some ways the gold standard for model comparison. Consider a finite collection of models, M 1,..., M K. For example, Bayesians represent uncertainty about the model by putting a prior on the model M {M 1,..., M K }: Bayes rule then gives the posterior probability of each model: ST740 (5) Model comparisons - Part 1 Page 5

6 Bayes factors The Bayes factor comparing model M 2 to model M 1 is Example: Say Y N(µ, 1) and M 1 : µ = 0 and M 2 : µ N(0, σ 2 ). The Bayes factor assuming Prob(M 1 ) = Prob(M 2 ) (derived in the handout) is BF = p(y M 2) p(y M 1 ) = ( 1 + σ 2) [ ( ) ] 1/2 1 σ 2 exp y σ 2 With everything else fixed, what happens (and why) as 1. σ 0 2. σ 3. y 0 4. y ST740 (5) Model comparisons - Part 1 Page 6

7 Bayes factors The BF is related to the likelihood ratio: The BF is not valid with improper priors: This can by fixed by splitting data into Y = (Y 1, Y 2 ) and using the posterior from the first subset p(θ Y 1 ) as prior in the analysis of Y 2 used to compute the BF: ST740 (5) Model comparisons - Part 1 Page 7

8 Rule of thumb How large does the BF have to be before we have sufficent evidence for M 2? We could set this up as a decision problem. This leads to the general rule of thumb: ST740 (5) Model comparisons - Part 1 Page 8

9 Computing the BF In most cases, the BF is hard or impossible to compute. It requires the marginal distribution of Y, p(y M), which is the quantity we ve been avoiding all semester. It can be computed for linear regression. The Bayesian information criteria (BIC) is a selection criteria based on a large-sample approximation of the BF of the model compared to the null model with no predictors. The BIC is BIC = D(y, ˆθ) + log(n)dim(θ), where D(y, θ) = 2 log[p(y θ)] is the deviance and ˆθ is the MLE. The model with smallest BIC is preferred. Since the prior is asymptotically irrelevant, this is not the most attractive Bayesian criteria. ST740 (5) Model comparisons - Part 1 Page 9

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing