Poisson Regression and Model Checking

Similar documents
Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Regression Analysis and Linear Regression Models

Generalized Additive Models

Stat 4510/7510 Homework 4

Package citools. October 20, 2018

NCSS Statistical Software

Beta-Regression with SPSS Michael Smithson School of Psychology, The Australian National University

Statistics Lab #7 ANOVA Part 2 & ANCOVA

From logistic to binomial & Poisson models

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Section 2.3: Simple Linear Regression: Predictions and Inference

Analysis of variance - ANOVA

Multiple Linear Regression

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Modelling Proportions and Count Data

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Binary Regression in S-Plus

Unit 5 Logistic Regression Practice Problems

Modelling Proportions and Count Data

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Discriminant analysis in R QMMA

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

DATA ANALYSIS USING HIERARCHICAL GENERALIZED LINEAR MODELS WITH R

Organizing data in R. Fitting Mixed-Effects Models Using the lme4 Package in R. R packages. Accessing documentation. The Dyestuff data set

Study Guide. Module 1. Key Terms

Model Selection and Inference

Bayes Estimators & Ridge Regression

Bayesian model selection and diagnostics

Week 4: Simple Linear Regression III

The linear mixed model: modeling hierarchical and longitudinal data

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Exercises R For Simulations Columbia University EPIC 2015 (no answers)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

run ld50 /* Plot the onserved proportions and the fitted curve */ DATA SETR1 SET SETR1 PROB=X1/(X1+X2) /* Use this to create graphs in Windows */ gopt

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Zero-Inflated Poisson Regression

Lecture 25: Review I

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus

Regression on the trees data with R

Multinomial Logit Models with R

Nina Zumel and John Mount Win-Vector LLC

Discussion Notes 3 Stepwise Regression and Model Selection

Lecture 13: Model selection and regularization

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

1 The SAS System 23:01 Friday, November 9, 2012

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Package addhaz. September 26, 2018

Stat 8053, Fall 2013: Additive Models

Solution to Bonus Questions

Week 5: Multiple Linear Regression II

Applied Statistics and Econometrics Lecture 6

Canopy Light: Synthesizing multiple data sources

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Multiple Regression White paper

Generalized least squares (GLS) estimates of the level-2 coefficients,

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Lecture 16: High-dimensional regression, non-linear regression

Minitab 17 commands Prepared by Jeffrey S. Simonoff

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Section 2.2: Covariance, Correlation, and Least Squares

Chapitre 2 : modèle linéaire généralisé

5.5 Regression Estimation

Box-Cox Transformation for Simple Linear Regression

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

22s:152 Applied Linear Regression

Model selection. Peter Hoff. 560 Hierarchical modeling. Statistics, University of Washington 1/41

36-402/608 HW #1 Solutions 1/21/2010

22s:152 Applied Linear Regression

Package mcemglm. November 29, 2015

Splitting the follow-up C&H 6

Variable selection is intended to select the best subset of predictors. But why bother?

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

Chapter 6: Linear Model Selection and Regularization

An introduction to SPSS

Chapter 10: Extensions to the GLM

Salary 9 mo : 9 month salary for faculty member for 2004

Comparing Fitted Models with the fit.models Package

Model selection. Peter Hoff STAT 423. Applied Regression and Analysis of Variance. University of Washington /53

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

8.3 simulating from the fitted model Chris Parrish July 3, 2016

STATS PAD USER MANUAL

Inference for Generalized Linear Mixed Models

Fathom Dynamic Data TM Version 2 Specifications

A (very) brief introduction to R

Generalized Additive Models

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Stat 5303 (Oehlert): Response Surfaces 1

Splines and penalized regression

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

IQR = number. summary: largest. = 2. Upper half: Q3 =

SYS 6021 Linear Statistical Models

Stat 5100 Handout #14.a SAS: Logistic Regression

Chapter 6: DESCRIPTIVE STATISTICS

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Transcription:

Poisson Regression and Model Checking Readings GH Chapter 6-8 September 27, 2017

HIV & Risk Behaviour Study The variables couples and women_alone code the intervention: control - no counselling (both 0) the couple was counselled together (couple=1, women_alone=0) only the woman was counselled (couple=0, women_alone=1) bs_hiv indicates whether the member reporting sex acts was HIV-positive at baseline (the beginning of the study) bupacts - number of unprotected sex acts reported at baseline fupacts - number of unprotected sex acts reported at the end of the study sex - factor with levels woman and man. This is the member of the couple that reports sex acts to the researcher

Data, Design & balance ## sex couples women_alone ## woman:217 Min. :0.0000 Min. :0.0000 ## man :217 1st Qu.:0.0000 1st Qu.:0.0000 ## Median :0.0000 Median :0.0000 ## Mean :0.3733 Mean :0.3364 ## 3rd Qu.:1.0000 3rd Qu.:1.0000 ## Max. :1.0000 Max. :1.0000 ## bs_hiv bupacts fupacts ## negative:337 Min. : 0.00 Min. : 0.00 ## positive: 97 1st Qu.: 5.00 1st Qu.: 0.00 ## Median : 15.00 Median : 5.00 ## Mean : 25.91 Mean : 16.49 ## 3rd Qu.: 36.00 3rd Qu.: 21.00 ## Max. :300.00 Max. :200.00

Poisson Distribution Y i λ i P(λ i ) p(y i ) = λy i i e λ i y i! y i = 0, 1,..., λ i > 0 Used for counts with no upper limit E(Y i ) = V (Y i ) = λ i How to build in covariates into the mean? λ i > 0 log(λ i ) = η i R log link

Generalized Linear Model Canonical Link function for Poisson data is the log link log(λ i ) = η i = β 0 + X 1 β 1 +... X p β p λ = exp(β 0 + X 1 β 1 +... X p β p ) Holding all other X s fixed a 1 unit change in X j λ = exp(β 0 + X 1 β 1 +... (X j + 1)β j +... X p β p ) λ = exp(β j ) exp(β 0 + X 1 β 1 +... X j β j +... X p β p ) λ = exp(β j )λ λ /λ = exp(β j ) exp(β j ) is called a relative risk (risk relative to some baseline)

Model hiv.glm = glm(fupacts ~ bs_hiv + log(bupacts +1) + sex + couples + women_alone, data=hiv, family=poisson(link="log")) ## Estimate Std. Error z value Pr(> z ) ## (Intercept) 1.10334 0.04706 23.445 < 2e-16 ** ## bs_hivpositive -0.40556 0.03543-11.445 < 2e-16 ** ## log(bupacts + 1) 0.66456 0.01217 54.596 < 2e-16 ** ## sexman -0.08181 0.02368-3.454 0.000551 ** ## couples -0.30894 0.02799-11.038 < 2e-16 ** ## women_alone -0.50952 0.03031-16.810 < 2e-16 ** ## ## (Dispersion parameter for poisson family taken to be 1) ## ## Null deviance: 13298.6 on 433 degrees of freedom ## Residual deviance: 9184.3 on 428 degrees of freedom ## AIC: 10521

Model Choice and Model Checking 2 Questions: 1. Is my Model good enough? (no alternative models in mind) 2. Which Model is best? (comparison among a subset of models) Note that 2 does not imply 1.

Over-Dispersion, Goodness of Fit or Lack of Fit? deviance is -2 log(likelihood) evaluated at the MLE of the parameters in that model ˆλ i = exp(xi T ˆβ) 2 i (y i log(ˆλ i ) ˆλ i log(y i!)) smaller is better (larger likelihood) null deviance is the deviance under the Null model, that is a model with just an intercept or λ i = λ and ˆλ = Ȳ saturated model deviance is the deviance of a model where each observation has there own unique λ i and the MLE of ˆλ i = y i, the change in deviance has a Chi-squared distribution with degrees of freedom equal to the change in number of parameters in the models.

Residual Deviance the residual deviance is the change in the deviance between the given model and the saturated model. Substituting the expressions for deviance, we have D = 2 i ( y i log(ˆλ i ) ˆλ ) i log(y i!) 2 (y i log(y i ) y i log(y i!)) i =2 ( ) y i (log(y i ) log(ˆλ i )) (y i ˆλ i )) i =2 ( y i (log(y i /ˆλ i ) (y i ˆλ ) i )) = i di This has a chi squared distribution with n p degrees of freedom. (p is the number of parameters in the linear predictor)

Test in R ## Residual deviance: 9184.3 on 428 degrees of freedom Estimate of overdispersion: Residual Deviance/ Residual df = 21.46 Overdispersion if significantly greater than 1. Formal Test pchisq(hiv.glm$deviance, hiv.glm$df.residual, lower.tail=f) ## [1] 0 The above p-value suggests that a residual deviance as large or larger than what we observed under the model in hiv.glm is highly unlikely! Suggests that the model is not adequate or lack of fit.

Diagnostics Plots Residuals 10 20 Residuals vs Fitted 343 32788 0 1 2 3 4 Std. deviance resid. 10 20 Normal Q Q 343 327 88 3 1 1 2 3 Predicted values Theoretical Quantiles Std. deviance resid. 0 3 Scale Location 343 32788 0 1 2 3 4 Std. Pearson resid. 10 30 Residuals vs Leverage 343 128 186 Cook's distance 0.5 1 1 0.5 0.00 0.04 0.08 0.12 Predicted values Leverage

Standardized Residuals 20 10 r 0 10 0 20 40 60 80 fitted

Binned Residuals library(arm) Average residual 3 1 0 1 2 3 0 10 20 30 40 50 Expected Values

Overdispersed Poisson - QuasiLikelihood hiv.glmod = glm(fupacts ~ bs_hiv + log(bupacts +1) + sex + couples + women_alone, data=hiv, family=quasipoisson(link="log")) ## (Dispersion parameter for quasipoisson family taken to b ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 1.10334 0.24820 4.445 1.12e-05 ** ## bs_hivpositive -0.40556 0.18689-2.170 0.03055 * ## log(bupacts + 1) 0.66456 0.06420 10.352 < 2e-16 ** ## sexman -0.08181 0.12491-0.655 0.51283 ## couples -0.30894 0.14762-2.093 0.03695 * ## women_alone -0.50952 0.15986-3.187 0.00154 ** ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## ## Null deviance: 13298.6 on 433 degrees of freedom ## Residual deviance: 9184.3 on 428 degrees of freedom

Negative Binomial Distribution The formulation of the negative binomial distribution as a gamma mixture of Poissons can be used to model count data with overdispersion. p(y µ, θ) = Γ(y + θ) y!γ(θ) ( θ ) θ ( µ ) y θ + µ θ + µ The negative binomial distribution has two parameters: µ is the mean or expected value of the distribution µ i = exp(xi T β) a is the over dispersion parameter V (Y ) = µ + µ 2 /θ When θ the negative binomial distribution is the same as a Poisson distribution

Review of Mixtures Y λ Poi(λ) p(y λ) = λy e λ y! λ µ, θ Gamma(θ, θ/µ) p(λ µ, θ) = (θ/µ)θ Γ(θ) λθ 1 e λθ/µ p(y µ, θ) = p(y λ)p(λ θ, θ/µ)dλ = Γ(y + θ) y!γ(θ) Y µ, θ NegBin(µ, θ) ( θ ) θ ( µ ) y θ + µ θ + µ

Iterated Expectations Review expectation E[Y ] = E λ [E Y [Y λ]] variance Var[Y ] = Var λ [E Y [Y λ]] + E λ [Var Y [Y λ]] Variance(Expected Value) + Expected Value(Variance) You should be able to derive the mean and variance of the NegBin using the above expressions (HW)

Fitting the Negative Binomial Model in R library(mass) hiv.glm.nb = glm.nb(fupacts ~ bs_hiv + log(bupacts + 1) + sex + couples + women_alone, data=hiv) associated summary and plot functions available

Model Summary (subset) ## (Dispersion parameter for Negative Binomial(0.4358) fami ## Coefficients: ## Estimate Std. Error z value Pr(> z ) ## (Intercept) 1.25829 0.24261 5.186 2.14e-07 ** ## bs_hivpositive -0.51314 0.18384-2.791 0.005251 ** ## log(bupacts + 1) 0.61832 0.06470 9.557 < 2e-16 ** ## sexman 0.05974 0.14917 0.400 0.688796 ## couples -0.36679 0.18531-1.979 0.047779 * ## women_alone -0.64007 0.18901-3.386 0.000708 ** ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## ## Null deviance: 603.09 on 433 degrees of freedom ## Residual deviance: 487.97 on 428 degrees of freedom ## AIC: 2953.3

Estimates of Relative Risks RR 2.5 97.5 (Intercept) 3.52 2.21 5.69 bs_hivpositive 0.60 0.42 0.88 log(bupacts + 1) 1.86 1.64 2.10 sexman 1.06 0.79 1.43 couples 0.69 0.48 1.01 women_alone 0.53 0.36 0.77 1 = no change Values less than 1 imply decrease Values greater than 1 imply increase to obtain percent increase RR - 1 or CI - 1 and multiply by 100% to obtain percent decrease 1 - RR or 1 - CI and multiply by 100%

Interpretation The intervention had a significant impact on reducing the number of unprotected sex acts: In couples where only the woman took part in the counseling sessions, we estimated a significant decrease in unprotected sex acts of 47%; 95% CI: (23, 64) When both partners were counseled unprotected acts are expected to decrease by 31% (although p.value >.05) There is no evidence to suggest that the sex of partner who reports to the researcher has an effect on the number of unprotected acts. There is evidence to suggest that if the partner who reports is HIV positive there is a significant reduction of unprotected acts of 40%; 95% CI: ( 12, 58)

Using Simulation to Check the Model Find a test statistic (meaningful quantity) simulate 1000 replicates of Y s from the model compute the test statistics for each set of replicate data estimate distribution of test statistics from the simulations compare observed statistics to the simulated data (predictive p-value)

R Code nsim = 10000 n = nrow(hiv) X = model.matrix(hiv.glm.nb) class(hiv.glm.nb) <- "glm" # over-ride class of "glm.nb" sim.hiv.nb = sim(hiv.glm.nb, nsim) # use GLM to generate b sim.hiv.nb@sigma = rnorm(nsim, hiv.glm.nb$theta, hiv.glm.nb$se.theta) # add slot fo y.rep = array(na, c(nsim, nrow(hiv))) for (i in 1:nsim) { mu = exp(x %*% sim.hiv.nb@coef[i,]) y.rep[i,] = rnegbin(n, mu=mu, theta=sim.hiv.nb@sigma[i]) } perc_0 = apply(y.rep, 1, function(x) {mean(x == 0)}) perc_10 = apply(y.rep, 1, function(x) {mean( x > 10)})

Comparison 1200 1000 900 750 count 600 count 500 300 250 0 0 0.15 0.20 0.25 0.30 0.35 perc_0 0.25 0.30 0.35 0.40 0.45 perc_10

Confidence Intervals Observed proportion at zero is 0.29 and proportion at 0, 95% CI from simulated replicates: ## 2.5% 97.5% ## 0.19 0.30 Observed proportion > 10 is 0.36 and 95% CI from simulated replicates ## 2.5% 97.5% ## 0.29 0.39 Observed data seem to have summaries in line with simulated replicated data based on Negative Binomial model Model appears to capture these features adequately (may change with other summaries)

Energetic Student provide an interpretation of the coefficient for log(bupacts + 1) offsets are used to remove effects, so that the mean = exp(offset + Xβ). In R this is expressed by adding log(bupacts +1) as an offset is equivalent to constraining its β to be 1 does the previously fitted model support that this should be handled by an offset? Do any of the coefficient estimates change by using an offset? Which model seems to be better? What happens if you do not add 1 to bupacts? Do you think that how we handle zero values of bupacts has an impact on our predictive check for predicting at 0?