Using the package glmbfp: a binary regression example.

Similar documents
Linear Modeling with Bayesian Statistics

Markov Chain Monte Carlo (part 1)

MCMC Diagnostics. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) MCMC Diagnostics MATH / 24

Illustration of Adaptive MCMC

Package sparsereg. R topics documented: March 10, Type Package

Bayesian Statistics Group 8th March Slice samplers. (A very brief introduction) The basic idea

Multiple imputation using chained equations: Issues and guidance for practice

ST440/540: Applied Bayesian Analysis. (5) Multi-parameter models - Initial values and convergence diagn

Bayesian Modelling with JAGS and R

Bayesian Multiple QTL Mapping

Bayesian model selection and diagnostics

Package acebayes. R topics documented: November 21, Type Package

BeviMed Guide. Daniel Greene

Overview. Monte Carlo Methods. Statistics & Bayesian Inference Lecture 3. Situation At End Of Last Week

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

MCMC Methods for data modeling

Package ICEbox. July 13, 2017

Package reglogit. November 19, 2017

Package bayesdp. July 10, 2018

Issues in MCMC use for Bayesian model fitting. Practical Considerations for WinBUGS Users

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Package binomlogit. February 19, 2015

Package beast. March 16, 2018

Expectation-Maximization Methods in Population Analysis. Robert J. Bauer, Ph.D. ICON plc.

Package TBSSurvival. January 5, 2017

GLM Poisson Chris Parrish August 18, 2016

Monte Carlo for Spatial Models

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Package mfbvar. December 28, Type Package Title Mixed-Frequency Bayesian VAR Models Version Date

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

From Bayesian Analysis of Item Response Theory Models Using SAS. Full book available for purchase here.

Bayesian Analysis for the Ranking of Transits

Stochastic Simulation: Algorithms and Analysis

Bias-variance trade-off and cross validation Computer exercises

Package HKprocess. R topics documented: September 6, Type Package. Title Hurst-Kolmogorov process. Version

Generalized least squares (GLS) estimates of the level-2 coefficients,

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve.

BAYESIAN OUTPUT ANALYSIS PROGRAM (BOA) VERSION 1.0 USER S MANUAL

An Introduction to Markov Chain Monte Carlo

Package Bergm. R topics documented: September 25, Type Package

Discussion on Bayesian Model Selection and Parameter Estimation in Extragalactic Astronomy by Martin Weinberg

CS281 Section 9: Graph Models and Practical MCMC

Package INLABMA. February 4, 2017

Multiple-imputation analysis using Stata s mi command

A GENERAL GIBBS SAMPLING ALGORITHM FOR ANALYZING LINEAR MODELS USING THE SAS SYSTEM

The glmmml Package. August 20, 2006

Chapter 1. Introduction

Markov chain Monte Carlo methods

1. Start WinBUGS by double clicking on the WinBUGS icon (or double click on the file WinBUGS14.exe in the WinBUGS14 directory in C:\Program Files).

A noninformative Bayesian approach to small area estimation

Package BKPC. March 13, 2018

1 Methods for Posterior Simulation

Predicting Diabetes using Neural Networks and Randomized Optimization

A quick introduction to First Bayes

Package bsam. July 1, 2017

Package LCMCR. R topics documented: July 8, 2017

A Basic Example of ANOVA in JAGS Joel S Steele

Also, for all analyses, two other files are produced upon program completion.

Quantitative Biology II!

Estimating Variance Components in MMAP

Simulated example: Estimating diet proportions form fatty acids and stable isotopes

DPP: Reference documentation. version Luis M. Avila, Mike R. May and Jeffrey Ross-Ibarra 17th November 2017

Computer vision: models, learning and inference. Chapter 10 Graphical Models

Tutorial using BEAST v2.4.1 Troubleshooting David A. Rasmussen

Package hergm. R topics documented: January 10, Version Date

mmpf: Monte-Carlo Methods for Prediction Functions by Zachary M. Jones

A stochastic approach of Residual Move Out Analysis in seismic data processing

Exponential Random Graph Models for Social Networks

Modeling Criminal Careers as Departures From a Unimodal Population Age-Crime Curve: The Case of Marijuana Use

Bayesian Computation with JAGS

INTRO TO THE METROPOLIS ALGORITHM

Introduction to Statistical Analyses in SAS

Package hmeasure. February 20, 2015

Design of Experiments

A Short Introduction to the caret Package

5. Key ingredients for programming a random walk

CHAPTER 1 INTRODUCTION

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

Poisson Regression and Model Checking

BART STAT8810, Fall 2017

Package cwm. R topics documented: February 19, 2015

Fathom Dynamic Data TM Version 2 Specifications

Test Run to Check the Installation of OpenBUGS, JAGS, BRugs, R2OpenBUGS, Rjags & R2jags

Compensatory and Noncompensatory Multidimensional Randomized Item Response Analysis of the CAPS and EAQ Data

Bayesian Model Averaging over Directed Acyclic Graphs With Implications for Prediction in Structural Equation Modeling

Package rjags. R topics documented: October 19, 2018

Predictor Selection Algorithm for Bayesian Lasso

Package citools. October 20, 2018

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Missing Data Analysis for the Employee Dataset

A Bayesian approach to artificial neural network model selection

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

NONPARAMETRIC REGRESSION WIT MEASUREMENT ERROR: SOME RECENT PR David Ruppert Cornell University

Package mmeta. R topics documented: March 28, 2017

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

EBSeqHMM: An R package for identifying gene-expression changes in ordered RNA-seq experiments

Samuel Coolidge, Dan Simon, Dennis Shasha, Technical Report NYU/CIMS/TR

Estimation of Item Response Models

Package sspse. August 26, 2018

Transcription:

Using the package glmbfp: a binary regression example. Daniel Sabanés Bové 3rd August 2017

This short vignette shall introduce into the usage of the package glmbfp. For more information on the methodology, see Sabanés Bové and Held (2011). There you can also find the references for the other tools mentioned here. If you have any questions or critique concerning the package, write an email to me: daniel.sabanesbove@ifspm.uzh.ch. Many thanks in advance! 0.1 Pima Indians diabetes data We will have a look at the Pima Indians diabetes data, which is available in the package MASS: > library(mass) > pima <- rbind(pima.tr, Pima.te) > pima$hasdiabetes <- as.numeric(pima$type == "Yes") > pima.nobs <- nrow(pima) Setup For n = 532 women of Pima Indian heritage, seven possible predictors for the presence of diabetes are recorded. We would like to investigate with a binary regression, which of them are relevant, and what form the statistical association has is it a linear effect, or rather a nonlinear effect? Here we will model possible nonlinear effects with the fractional polynomials. First, we need to decide on the prior distributions to use. We are going to use the generalised hyper-g priors for GLMs (Sabanés Bové and Held, 2011). They are automatic and are supposed to yield reasonable results. We only need to specify which hyper-prior to put on the factor g. One possible choice is the Zellner-Siow hyper-prior which says g IG(1/2, n/2): > library(glmbfp) > ## define the prior distributions for g which we are going to use: > prior <- InvGammaGPrior(a=1/2, + b=pima.nobs/2) > ## (the warning can be ignored) This corresponds to the F1 prior in Sabanés Bové and Held (2011). Another possible choice is the hyper-g/n prior. For this there is no special constructor function, instead you can directly specify the log prior density, as follows: > ## You may also use the hyper-g/n prior: > prior.f2 <- CustomGPrior(logDens=function(g) + - log(pima.nobs) - 2 * log(1 + g / pima.nobs)) Stochastic model search Next, we will do a stochastic search on the (very large) model space to find good models. Here we have to decide on the model prior, and in this example we use the sparse type which was also used in the paper. We use a chainlength 2

of 100, which is very small but enough for illustration purposes (usually one should use at least 10 000 as a rule of thumb), and save all models (in general nmodels is the number of models which are saved from all visited models). Finally, we decide that we do not want to use OpenMP acceleration (this would parallelise loops over all observations on all cores of your processor) and that we want to do higher order correction for the Laplace approximations. In order to be able to reproduce the analysis, it is advisable to set a seed for the random number generator before starting the stochastic search. > set.seed(102) > time.pima <- + system.time(models.pima <- + glmbayesmfp(type ~ + bfp(npreg) + + bfp(glu) + + bfp(bp) + + bfp(skin) + + bfp(bmi) + + bfp(ped) + + bfp(age), + data=pima, + family=binomial("logit"), + priorspecs= + list(gprior=prior, + modelprior="sparse"), + nmodels=1e3l, + chainlength=1e1l, + method="sampling", + useopenmp=false, + higherordercorrection=true)) Starting sampler... 0% ---------- Number of non-identifiable model proposals: 0 Number of total cached models: 9 Number of returned models: 9 > time.pima user system elapsed 0.372 0.004 0.427 > attr(models.pima, "numvisited") [1] 9 3

Wee see that the search took 0 seconds, and 9 models were found. Now, if we want to have a table of the found models, with their posterior probability, the log marginal likelihood, the log prior probability, and the powers for every covariate the and the number of times that the sampler encountered that model: > table.pima <- as.data.frame(models.pima) > table.pima posterior logmarglik logprior age bmi bp glu npreg ped skin 1 5.752526e-01-265.8779-11.849169 1-0.5 2 4.036782e-01-266.2321-11.849169 2-0.5 3 7.971175e-03-270.1569-11.849169 2 3 4 6.638447e-03-268.2604-13.928611 0 2-0.5 5 6.134410e-03-272.4983-9.769728 2 6 3.250886e-04-273.3564-11.849169 3 2 7 2.089908e-20-310.6396-11.849169 2-0.5 8 2.131992e-23-317.5274-11.849169 2 3 9 2.726801e-31-339.8609-7.690286 Note that while frequency refers to the frequency of the models in the sampling chain, thus providing a Monte Carlo estimate of the posterior model probabilities, posterior refers to the renormalised posterior model probabilities. The latter has the advantage that ratios of posterior probabilities between any two models are exact, while the former is unbiased (but obviously has larger variance). Inclusion probabilities are also saved: The estimated marginal inclusion probabilities for all covariates > round(attr(models.pima, "inclusionprobs"),2) age bmi bp glu npreg ped skin 0.00 0.00 0.01 1.00 0.00 0.00 0.99 Sampling model parameters If we now want to look at the estimated covariate effects in the estimated MAP model which has the configuration given in the last seven columns of table.pima, then we first need to generate parameter samples from that model: > ## MCMC settings > mcmcoptions <- McmcOptions(iterations=1e4L, + burnin=1e3l, + step=2l) > ## get samples from the MAP model > set.seed(634) > mapsamples <- sampleglm(models.pima[1l], + mcmc=mcmcoptions, + useopenmp=false) 4

Taking the linear approximation method 0% ----------------------------------------------------------------------------------------- Finished MCMC simulation with acceptance ratio 0.868 With the function McmcOptions, we have defined an S4 object of MCMC settings, comprising the number of iterations, the length of the burn-in, the thinning step (here save every second iteration), here the acceptance rate was 0.87). Note that you can also get predictive samples for new data points via the newdata option of sampleglm. The result mapsamples has the following structure: > str(mapsamples) List of 5 $ tbf : logi FALSE $ acceptanceratio: num 0.868 $ logmarglik :List of 5..$ estimate : Named num -266....- attr(*, "names")= chr "numeratorterms"..$ standarderror : num [1, 1] 0.00237..$ numeratorterms : num [1:4500] 0.542 0.657 0.664 0.619 0.559.....$ denominatorterms : num [1:4500] 1 1 0.989 1 1.....$ highdensitypointlogunposterior: num -266 $ coefficients : num [1:3, 1:4500] -0.932 4.211-5.46-0.827 3.745.....- attr(*, "dimnames")=list of 2....$ : chr [1:3] "(Intercept)" "glu^1" "skin^-0.5"....$ : NULL $ samples :Formal class 'GlmBayesMfpSamples' [package "glmbfp"] with 8 slots....@ fitted : num [1:532, 1:4500] -2.29 2.55-2.11 1.66-1.6.........- attr(*, "dimnames")=list of 2........$ : chr [1:532] "1" "2" "3" "4"...........$ : NULL....@ predictions : logi[0, 0 ]....@ fixcoefs :List of 1......$ (Intercept): num [1, 1:4500] -0.932-0.827-0.932-0.936-1.073...........- attr(*, "dimnames")=list of 2..........$ : chr "(Intercept)"..........$ : NULL....@ z : num [1:4500] 5.32 5.44 7.19 5.08 5.57.......@ bfpcurves :List of 2......$ glu : num [1:327, 1:4500] -3.05-3.02-3.01-2.99-2.96...........- attr(*, "scaledgrid")= num [1:327, 1] 0.56 0.567 0.57 0.574 0.581.............- attr(*, "dimnames")=list of 2............$ : NULL 5

............$ : chr "glu"........- attr(*, "whereobsvals")= int [1:532] 62 318 40 251 113 88 55 313 197 163.......$ skin: num [1:251, 1:4500] -3.67-3.47-3.28-3.25-3.11...........- attr(*, "scaledgrid")= num [1:251, 1] 0.7 0.746 0.791 0.8 0.837.............- attr(*, "dimnames")=list of 2............$ : NULL............$ : chr "skin"........- attr(*, "whereobsvals")= int [1:532] 67 83 108 115 57 63 76 28 25 95.......@ uccoefs : list()....@ shiftscalemax: num [1:7, 1:4] 0 0 0 0 1 0 0 10 100 100.........- attr(*, "dimnames")=list of 2........$ : chr [1:7] "age" "bmi" "bp" "glu"...........$ : chr [1:4] "shift" "scale" "maxdegree" "cardpowerset"....@ nsamples : int 4500 It is a list with the acceptanceratio of the Metropolis-Hastings proposals, an MCMC estimate for the log marginal likelihood including an associated standard error (logmarglik), the coefficients samples of the model, and an S4 object samples. This S4 object includes the fitted samples on the linear predictor scale (in our case on the log Odds Ratio scale), possibly predictions samples, samples of the intercept (fixed), samples of z = log(g), samples of the fractional polynomial curves (bfpcurves), coefficients of uncertain but fixed form covariates (uccoefs), the shifts and scales applied to the original covariates (shiftscalemax) and the number of samples (nsamples). You can read more details on the results on the help page by typing?"glmbayesmfpsamples-class" in R. If we wanted to get posterior fitted values on the probability scale, we can use the following code: > mapfit <- rowmeans(plogis(mapsamples$samples@fitted)) > head(mapfit) 1 2 3 4 5 6 0.1017081 0.8967206 0.1096366 0.7864968 0.1794264 0.1412198 We can also analyse the MCMC output in greater detail by applying the functions in the coda package: > library(coda) > coefmcmc <- mcmc(data=t(mapsamples$coefficients), + start=mcmcoptions@burnin + 1, + thin=mcmcoptions@step) > str(coefmcmc) mcmc [1:4500, 1:3] -0.932-0.827-0.932-0.936-1.073... - attr(*, "dimnames")=list of 2 6

..$ : NULL..$ : chr [1:3] "(Intercept)" "glu^1" "skin^-0.5" - attr(*, "mcpar")= num [1:3] 1001 9999 2 > ## standard summary table for the coefficients: > summary(coefmcmc) Iterations = 1001:9999 Thinning interval = 2 Number of chains = 1 Sample size per chain = 4500 1. Empirical mean and standard deviation for each variable, plus standard error of the mean: Mean SD Naive SE Time-series SE (Intercept) -0.9298 0.1144 0.001705 0.001812 glu^1 3.8503 0.3963 0.005907 0.006548 skin^-0.5-4.0626 1.0093 0.015046 0.017079 2. Quantiles for each variable: 2.5% 25% 50% 75% 97.5% (Intercept) -1.158-1.008-0.929-0.8514-0.7113 glu^1 3.085 3.581 3.846 4.0998 4.6522 skin^-0.5-6.053-4.736-4.057-3.3781-2.1071 > autocorr(coefmcmc),, (Intercept) (Intercept) glu^1 skin^-0.5 Lag 0 1.000000000-0.198545301 0.22364820 Lag 2 0.060585352-0.021893505 0.02726208 Lag 10-0.009556659 0.011005546 0.02048678 Lag 20-0.008033578 0.001619819-0.02348717 Lag 100 0.009020242-0.025143099-0.01664317,, glu^1 (Intercept) glu^1 skin^-0.5 Lag 0-0.198545301 1.000000000 0.033350565 Lag 2-0.015649028 0.102578846-0.011644669 Lag 10-0.006802564-0.007056528-0.009870305 Lag 20 0.025457688-0.016328556-0.004131433 7

Lag 100 0.010054085 0.004816710 0.006559927,, skin^-0.5 (Intercept) glu^1 skin^-0.5 Lag 0 0.22364820 0.0333505647 1.0000000000 Lag 2 0.05308624 0.0007788737 0.0863690518 Lag 10 0.01053958-0.0013760574 0.0004120857 Lag 20-0.02397061 0.0008594713-0.0206584865 Lag 100-0.02560623-0.0044481064-0.0312199771 > ## etc. > > plot(coefmcmc) 8

Trace of (Intercept) Density of (Intercept) 1.2 0.8 0.0 1.5 3.0 2000 4000 6000 8000 10000 Iterations 1.4 1.2 1.0 0.8 0.6 0.4 N = 4500 Bandwidth = 0.02254 Trace of glu^1 Density of glu^1 2.5 3.5 4.5 0.0 0.4 0.8 2000 4000 6000 8000 10000 Iterations 2.5 3.0 3.5 4.0 4.5 5.0 5.5 N = 4500 Bandwidth = 0.07638 Trace of skin^ 0.5 Density of skin^ 0.5 7 4 1 0.0 0.2 0.4 2000 4000 6000 8000 10000 Iterations 8 6 4 2 0 N = 4500 Bandwidth = 0.1989 > ## samples of z: > zmcmc <- mcmc(data=mapsamples$samples@z, + start=mcmcoptions@burnin + 1, + thin=mcmcoptions@step) > plot(zmcmc) > ## etc. 9

Trace of var1 Density of var1 4 6 8 10 0.0 0.1 0.2 0.3 0.4 2000 6000 10000 Iterations 4 6 8 10 12 N = 4500 Bandwidth = 0.1785 Curve estimates Now we can use the samples to plot the estimated effects of the MAP model covariates, with the plotcurveestimate function. For example: > plotcurveestimate(termname="skin", + samples=mapsamples$samples) 10

Average partial predictor g(skin) after the transform skin skin 10 4 3 2 1 0 1 20 40 60 80 100 skin Model averaging Model averaging works in principle similar to sampling from a single model, but multiple model configurations are supplied and their respective log posterior probabilities. For example, if we wanted to average the top three models found, we would do the following: > set.seed(312) > bmasamples <- + samplebma(models.pima[1:3], + mcmc=mcmcoptions, + useopenmp=false, + nmargliksamples=1000) 11

Starting sampling... Now at model 1... Taking the linear approximation method 0% ----------------------------------------------------------------------------------------- Finished MCMC simulation with acceptance ratio 0.875 Now at model 2... Taking the linear approximation method 0% ----------------------------------------------------------------------------------------- Finished MCMC simulation with acceptance ratio 0.864 Now at model 3... Taking the linear approximation method 0% ----------------------------------------------------------------------------------------- Finished MCMC simulation with acceptance ratio 0.886 > ## look at the list element names: > names(bmasamples) [1] "modeldata" "samples" > ## now we can see how close the MCMC estimates ("marglikestimate") > ## are to the ILA estimates ("logmarglik") of the log marginal likelihood: > bmasamples$modeldata[, c("logmarglik", "marglikestimate")] logmarglik marglikestimate 1-265.8779-265.8792 2-266.2321-266.2333 3-270.1569-270.1514 > ## the "samples" list is again of class "GlmBayesMfpSamples": > class(bmasamples$samples) [1] "GlmBayesMfpSamples" attr(,"package") [1] "glmbfp" Then internally, first the models are sampled, and for each sampled model so many samples are drawn as determined by the model frequency in the model average sample. The result is a list with two elements: modeldata is similar to the table.pima, and contains in addition to that the BMA probability and frequency in the sample, the MCMC acceptance ratios (which should be high). On the second element samples, which is again of class GlmBayesMfpSamples, the above presented functions can again be applied (e.g. plotcurveestimate). 12

Bibliography D. Sabanés Bové and L. Held. Hyper-g priors for generalized linear models. 6(3):387 410, 2011. doi: 10.1214/11-BA615. URL http://projecteuclid.org/euclid.ba/ 1339616469. 13