8.3 simulating from the fitted model Chris Parrish July 3, 2016

Size: px

Start display at page:

Download "8.3 simulating from the fitted model Chris Parrish July 3, 2016"

Alexander Riley
6 years ago
Views:

1 8. simulating from the fitted model Chris Parrish July, 6 Contents speed of light (Simon Newcomb, 88) simulate data, fit the model, and check the coverage of the conf intervals model fit post create the replicated data create fake data figure roaches 7 data model fit post figure model fit post switch to glm figure simulating from the fitted model reference: - ARM chapter 8, github library(arm) # for sim library(rstan) rstan_options(auto_write = TRUE) options(mc.cores = parallel::detectcores()) library(ggplot) library(reshape) # for melt speed of light (Simon Newcomb, 88) algorithm for model checking : compare distribution of data y to distributions of simulated ỹ. data : y. use data to fit model : y with parameters β and σ find beta and sigma such that y = Xβ + ɛ. use model to generate n hypothetical values of ỹ ỹ = Xβ + ɛ

2 . replicate step n.sims times result is a matrix with dims n.sims, n 5. compare the distribution of the real data y with the distributions of the simulated ỹ check the model : if the distributions of the simulated ỹ do not correspond to the distribution of the original data y, then the model is suspect simulate data, fit the model, and check the coverage of the conf intervals source("lightspeed.data.r", echo = TRUE) ## ## > "y" <- c(8, 6,,,, -, 7, 6,, -, ## + 9,,,, 5,,, 9,, 9,,, 6,, 6, ## + 8, 5,, 8, 9, 7,... [TRUNCATED] ## ## > "N" <- 66 str(y) ## num [:66] model lightspeed.stan data { int<lower=> N; vector[n] y; parameters { vector[] beta; real<lower=> sigma; model { y ~ normal(beta[], sigma); fit ## Model fit (lightspeed.stan) ## lm (y ~ ) datalist. <- c("n","y") lightspeed.sf <- stan(file='lightspeed.stan', data=datalist., iter=, chains=) plot(lightspeed.sf) ## ci_level:.8 (8% intervals) ## outer_level:.95 (95% intervals)

3 beta[] sigma 5 5 pairs(lightspeed.sf) 6 beta[] 6 6 sigma lp print(lightspeed.sf) ## Inference for Stan model: lightspeed. ## chains, each with iter=; warmup=5; thin=; ## post-warmup draws per chain=5, total post-warmup draws=. ## ## mean se_mean sd.5% 5% 5% 75% 97.5% n_eff ## beta[] ## sigma ## lp ## Rhat ## beta[] ## sigma ## lp ## ## Samples were drawn using NUTS(diag_e) at Fri Jul 8 :9: 6. ## For each parameter, n_eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=). ## The estimated Bayesian Fraction of Missing Information is a measure of ## the efficiency of the sampler with values close to being ideal. ## For each chain, these estimates are

4 ## post post <- extract(lightspeed.sf) str(post) ## List of ## $ beta : num [:, ] ##..- attr(*, "dimnames")=list of ##....$ iterations: NULL ##....$ : NULL ## $ sigma: num [:(d)] ##..- attr(*, "dimnames")=list of ##....$ iterations: NULL ## $ lp : num [:(d)] ##..- attr(*, "dimnames")=list of ##....$ iterations: NULL create the replicated data ## Create the replicated data n.sims <- create fake data ## Create fake data n <- 5 y.rep <- array (NA, c(n.sims, n)) for (s in :n.sims){ y.rep[s,] <- rnorm (n, post$beta[s], post$sigma[s]) str(y.rep) ## num [:, :5] ## Histogram of replicated data (Figure 8.) y.new <- melt(y.rep) y.new$var <- factor(y.new$var, levels=c('','','','','5','6','7','8','9','','','','','','5'), labels=c('replication #','Replication #','Replication #','Replication #', 'Replication #5','Replication #6','Replication #7','Replication #8', 'Replication #9','Replication #','Replication #','Replication #', 'Replication #','Replication #','Replication #5')) str(y.new) ## 'data.frame': 5 obs. of variables: ## $ Var : int ## $ Var : Factor w/ 5 levels "Replication #",..:... ## $ value: num

5 p <- ggplot(y.new, aes(value)) + geom_histogram(colour = "seashell", fill = "wheat", binwidth=5) + theme_gray() + facet_wrap( ~ Var, ncol=5) + theme(axis.title.y = element_blank(), axis.title.x=element_blank()) print(p) Replication # Replication # Replication # Replication # Replication #5 5 5 Replication #6 Replication #7 Replication #8 Replication #9 Replication # 5 5 Replication # Replication # Replication # Replication # Replication # ## Write a function to make histograms with specified bin widths and ranges Hist.preset <- function (a, width, xtitle,ytitle,maintitle){ # dev.new() a.hi <- max (a, na.rm=true) a.lo <- min (a, na.rm=true) if (is.null(width)) width <- min (sqrt(a.hi-a.lo), e-5) bin.hi <- width*ceiling(a.hi/width) bin.lo <- width*floor(a.lo/width) frame = data.frame(x=a) p <- ggplot(frame,aes(x=x)) + geom_histogram(colour = "seashell", fill = "wheat", binwidth=width) + 5

6 theme_gray() + scale_x_continuous(xtitle) + scale_y_continuous(ytitle) + labs(title=maintitle) print(p) ## Run the function for (s in :){ Hist.preset (y.rep[s,], width=5, "","",paste("replication #",s,sep="")) Replication # 6

7 Replication # 5 Replication # 7

8 Replication # 5 Replication #5 8

9 Replication #6 6 Replication #7 6 9

10 Replication #8 Replication #9

11 Replication # Replication #

12 Replication # 5 6 Replication #

13 Replication # 5 Replication #5 5

14 Replication #6 5 5 Replication #7 5

15 Replication #8 5 Replication #9 5

16 Replication # ## Numerical test Test <- function (y){ min (y) test.rep <- rep (NA, n.sims) for (s in :n.sims){ test.rep[s] <- Test (y.rep[s,]) str(test.rep) ## num [:] figure 8.5 ## Histogram Figure 8.5 # dev.new() frame = data.frame(x = test.rep) frame <- data.frame(x = Test(y)) p <- ggplot(frame, aes(x = x)) + geom_histogram(colour = "seashell", fill = "wheat") + geom_segment(aes(x = x, y =, xend = x, yend =, color = "saddlebrown"), data = frame) + theme_gray() + theme(legend.position="none") + labs(title="observed T(y) and distribution of T(y.rep)") print(p) 6

17 Observed T(y) and distribution of T(y.rep) 5 count x roaches data ############################################################################## ## Read the cleaned data # All data are at # if bad initial values, this model fails # NOTE: can't find same exact data set as ARM book uses.. roachdata <- read.csv ("roachdata.csv") str(roachdata) ## 'data.frame': 6 obs. of 6 variables: ## $ X : int ## $ y : int ## $ roach : num ## $ treatment: int... ## $ senior : int... ## $ exposure: num attach(roachdata) ## The following object is masked _by_.globalenv: ## ## y 7

18 model roaches.stan data { int<lower=> N; vector[n] exposure; vector[n] roach; vector[n] senior; vector[n] treatment; int y[n]; transformed data { vector[n] log_expo; log_expo = log(exposure); parameters { vector[] beta; model { y ~ poisson_log(log_expo + beta[] + beta[] * roach + beta[] * treatment + beta[] * senior); fit datalist. <- list(n=length(roachdata$y), y=roachdata$y,roach=roachdata$roach, treatment=roachdata$treatment,exposure=roachdata$exposure, senior=roachdata$senior) roaches.sf <- stan(file='roaches.stan', data=datalist., iter=5, chains=) print(roaches.sf) ## Inference for Stan model: roaches. ## chains, each with iter=5; warmup=5; thin=; ## post-warmup draws per chain=5, total post-warmup draws=. ## ## mean se_mean sd.5% 5% 5% 75% ## beta[] ## beta[] ## beta[] ## beta[] ## lp ## 97.5% n_eff Rhat ## beta[]..9 ## beta[]..7 ## beta[] -.. ## beta[] ## lp ## ## Samples were drawn using NUTS(diag_e) at Fri Jul 8 :5:6 6. 8

19 ## For each parameter, n_eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=). ## The estimated Bayesian Fraction of Missing Information is a measure of ## the efficiency of the sampler with values close to being ideal. ## For each chain, these estimates are ##.9 post post <- extract(roaches.sf) ## Comparing the data to a replicated dataset n <- length(roachdata$y) X <- cbind (rep(,n), roach, treatment, senior) y.hat <- exposure * exp (X %*% colmeans(post$beta)) y.rep <- rpois (n, y.hat) print (mean (roachdata$y==)) ## [] print (mean (y.rep==)) ## [] ## Comparing the data to replicated datasets n.sims <- y.rep <- array (NA, c(n.sims, n)) for (s in :n.sims){ y.hat <- exposure * exp (X %*% post$beta[s,]) y.rep[s,] <- rpois (n, y.hat) # test statistic Test <- function (y){ mean (y==) test.rep <- rep (NA, n.sims) for (s in :n.sims){ test.rep[s] <- Test (y.rep[s,]) # p-value print (mean (test.rep > Test(roachdata$y))) ## [] figure ## Histogram Figure # dev.new() frame = data.frame(x = test.rep) frame5 = data.frame(x = Test(roachdata$y)) 9

20 p <- ggplot(frame, aes(x=x)) + geom_histogram(colour = "seashell", fill = "wheat") + geom_segment(aes(x = x, y =, xend = x, yend = 5, color = "saddlebrown"), data = frame5) + theme_gray() + theme(legend.position="none") + labs(title="observed T(y) and distribution of T(y.rep)") print(p) ## `stat_bin()` using `bins = `. Pick better value with `binwidth`. Observed T(y) and distribution of T(y.rep) 75 count x T(y) =.6, but all the values of test.rep are much smaller. summary(test.rep) ## Min. st Qu. Median Mean rd Qu. Max. ## model roaches_overdispersion.stan data { int<lower=> N; vector[n] exposure; vector[n] roach; vector[n] senior; vector[n] treatment;

21 int y[n]; transformed data { vector[n] log_expo; log_expo = log(exposure); parameters { vector[] beta; vector[n] lambda; real<lower=> tau; transformed parameters { real<lower=> sigma; sigma =. / sqrt(tau); model { tau ~ gamma(.,.); for (i in :N) { lambda[i] ~ normal(, sigma); y[i] ~ poisson_log(lambda[i] + log_expo[i] + beta[] + beta[]*roach[i] + beta[]*senior[i] + beta[]*treatment[i]); fit ## Checking the overdispersed model # NOTE: can't find same exact data set as ARM book uses.. roaches_overdispersion.sf <- stan(file='roaches_overdispersion.stan', data=datalist., iter=, chains=) # print(roaches_overdispersion.sf) post post <- extract(roaches_overdispersion.sf) switch to glm glm. <- glm(y ~ roach + treatment + senior, data = roachdata, family=quasipoisson, offset=log(exposure)) sim. <- sim(glm., n.sims) # replicated datasets y.rep <- array (NA, c(n.sims, n)) overdisp <- summary(glm.)$dispersion

22 for (s in :n.sims){ y.hat <- exposure * exp (X %*% sim.@coef[s,]) a <- y.hat/(overdisp-) # using R's parametrization for the y.rep[s,] <- rnegbin (n, y.hat, a) # negative binomial distribution test.rep <- rep (NA, n.sims) for (s in :n.sims){ test.rep[s] <- Test (y.rep[s,]) compare each value of test.rep with the number Test(roachdata$y) # p-value summary(test.rep) ## Min. st Qu. Median Mean rd Qu. Max. ## print (mean (test.rep > Test(roachdata$y))) ## [].68 Test(roachdata$y) ## [] figure ## Histogram Figure # dev.new() frame = data.frame(x = test.rep) frame5 = data.frame(x = Test(roachdata$y)) p5 <- ggplot(frame, aes(x=x)) + geom_histogram(colour = "seashell", fill = "wheat") + geom_segment(aes(x = x, y =, xend = x, yend =, color = "saddlebrown"), data = frame5) + theme_gray() + theme(legend.position="none") + labs(title="observed T(y) and distribution of T(y.rep)") print(p5) ## `stat_bin()` using `bins = `. Pick better value with `binwidth`.

23 Observed T(y) and distribution of T(y.rep) 75 count x

8.1 fake data simulation Chris Parrish July 3, 2016

8.1 fake data simulation Chris Parrish July 3, 2016 Contents fake-data simulation 1 simulate data, fit the model, and check the coverage of the conf intervals............... 1 model....................................................