Practical 4: Mixed effect models

Similar documents
The linear mixed model: modeling hierarchical and longitudinal data

Organizing data in R. Fitting Mixed-Effects Models Using the lme4 Package in R. R packages. Accessing documentation. The Dyestuff data set

A short explanation of Linear Mixed Models (LMM)

Performing Cluster Bootstrapped Regressions in R

Week 4: Simple Linear Regression III

Package RLRsim. November 4, 2016

lme for SAS PROC MIXED Users

Introduction to mixed-effects regression for (psycho)linguists

Output from redwing2.r

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Mixed Effects Models. Biljana Jonoska Stojkova Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC.

Statistical Analysis of Series of N-of-1 Trials Using R. Artur Araujo

Package simr. April 30, 2018

Multiple Linear Regression

Regression Analysis and Linear Regression Models

Random coefficients models

Random coefficients models

Generalized Additive Models

Week 5: Multiple Linear Regression II

Inference in mixed models in R - beyond the usual asymptotic likelihood ratio test

Course Business. types of designs. " Week 4. circle back to if we have time later. ! Two new datasets for class today:

davidr Cornell University

The lmekin function. Terry Therneau Mayo Clinic. May 11, 2018

Week 4: Simple Linear Regression II

Study Guide. Module 1. Key Terms

Simulation and resampling analysis in R

PSY 9556B (Feb 5) Latent Growth Modeling

Package citools. October 20, 2018

Package mcemglm. November 29, 2015

Repeated Measures Part 4: Blood Flow data

Introduction to Mixed Models: Multivariate Regression

Robust Linear Regression (Passing- Bablok Median-Slope)

Package gamm4. July 25, Index 10

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Simulating power in practice

- 1 - Fig. A5.1 Missing value analysis dialog box

An Econometric Study: The Cost of Mobile Broadband

Package dglm. August 24, 2016

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

An introduction to SPSS

Introduction. Overview

Missing Data Analysis for the Employee Dataset

Chapter 15 Mixed Models. Chapter Table of Contents. Introduction Split Plot Experiment Clustered Data References...

Package caic4. May 22, 2018

Analysis of variance - ANOVA

Subset Selection in Multiple Regression

Section 2.3: Simple Linear Regression: Predictions and Inference

Zero-Inflated Poisson Regression

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

Introduction to Mixed-Effects Models for Hierarchical and Longitudinal Data

Package GLDreg. February 28, 2017

Frequently Asked Questions Updated 2006 (TRIM version 3.51) PREPARING DATA & RUNNING TRIM

One way ANOVA when the data are not normally distributed (The Kruskal-Wallis test).

Linear Modeling with Bayesian Statistics

Additional Issues: Random effects diagnostics, multiple comparisons

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Online Supplementary Appendix for. Dziak, Nahum-Shani and Collins (2012), Multilevel Factorial Experiments for Developing Behavioral Interventions:

Poisson Regression and Model Checking

IQR = number. summary: largest. = 2. Upper half: Q3 =

MACAU User Manual. Xiang Zhou. March 15, 2017

Model selection. Peter Hoff. 560 Hierarchical modeling. Statistics, University of Washington 1/41

RSM Split-Plot Designs & Diagnostics Solve Real-World Problems

NCSS Statistical Software

Package rptr. January 4, 2018

Split-Plot General Multilevel-Categoric Factorial Tutorial

Comparing Fitted Models with the fit.models Package

Two-Stage Least Squares

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Statistical Methods for the Analysis of Repeated Measurements

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

36-402/608 HW #1 Solutions 1/21/2010

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

The glmmml Package. August 20, 2006

Multiple Regression White paper

Recall the expression for the minimum significant difference (w) used in the Tukey fixed-range method for means separation:

STA 570 Spring Lecture 5 Tuesday, Feb 1

Lecture 13: Model selection and regularization

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Fathom Dynamic Data TM Version 2 Specifications

Applied Statistics : Practical 9

Chapter 6: DESCRIPTIVE STATISTICS

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Machine Learning. Topic 4: Linear Regression Models

SLStats.notebook. January 12, Statistics:

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Beta-Regression with SPSS Michael Smithson School of Psychology, The Australian National University

Introduction. Overview

NCSS Statistical Software. Robust Regression

A fast Mixed Model B-splines algorithm

TABEL DISTRIBUSI DAN HUBUNGAN LENGKUNG RAHANG DAN INDEKS FASIAL N MIN MAX MEAN SD

Applied Multivariate Analysis

Product Catalog. AcaStat. Software

Missing Data Analysis for the Employee Dataset

STATS PAD USER MANUAL

NONPARAMETRIC REGRESSION SPLINES FOR GENERALIZED LINEAR MODELS IN THE PRESENCE OF MEASUREMENT ERROR

Transcription:

Practical 4: Mixed effect models This practical is about how to fit (generalised) linear mixed effects models using the lme4 package. You may need to install it first (using either the install.packages command or the menu bar). Orthodontic data Let s first look at the data we considered in class. We start by plotting the individual growth curves of all children studied in the dataset. filepath <- "http://www.statslab.cam.ac.uk/~tw389/teaching/slp18/data/" filename <- "orthodont" orth <- read.table(paste0(filepath, filename), header = T) head(orth) # plot data, 'alpha' sets transparency level to display overlapping points plot(orth$age, orth$distance, pch=20, col=rgb(0,0,0,alpha=0.4)) # plot individual growth curves for (i in 1:30) { ind <- orth$pid==paste0("p",i) points(orth$age[ind], orth$distance[ind],type="l", col=i) } 1

For the ith individual s observation at time j = 1, 2, 3, 4, we will denote the quantity distance by Y ij, and the individual s age by x ij. Consider the linear model Y ij = β 0 + β 1 x ij +ɛ ij, where ɛ ij iid N(0, σ 2 ). We already saw that this model doesn t seem reasonable, but fit it anyway (offset the age so that 8 years is the baseline). orth.lm1 <- lm(distance ~ I(age-8), data = orth) summary(orth.lm1) plot(orth.lm1, which = c(1,2,3,4)) Look at the diagnostic plots: what do you see? What is the moral of this? Let s consider a model for linear dependence of distance upon age, but with a random intercept for the individuals; write down the algebraic form of the model, making sure you define each parameter and effect. We can fit these models with the function lmer() in lme4, but remember that the syntax is a little tricky. Start with library(lme4) orth.lme1 <- lmer(distance ~ I(age - 8) + (1 pid), data=orth) summary(orth.lme1) 2

Linear mixed model fit by REML ['lmermod'] Formula: distance ~ I(age - 8) + (1 pid) Data: orth REML criterion at convergence: 347.7 Scaled residuals: Min 1Q Median 3Q Max -2.4762-0.5798-0.0163 0.5046 1.9298 Random effects: Groups Name Variance Std.Dev. pid (Intercept) 14.9512 3.867 Residual 0.2725 0.522 Number of obs: 120, groups: pid, 30 Fixed effects: Estimate Std. Error t value (Intercept) 24.26333 0.71044 34.15 I(age - 8) 0.49667 0.04262 11.65 Correlation of Fixed Effects: (Intr) I(age - 8) -0.090 The term (1 pid) in the formula says that random effects should be at the level of intercepts (hence the 1) for each individual. The summary gives a lot of output; check that you can identify (i) the estimates of β 0 and β 1 ; (ii) the estimates of σ and τ. Notice that the fit was done by REML by default. We can get a maximum likelihood fit by adding the option REML=FALSE. Fit the model with maximum likelihood, calling the object orth.lme1b. Compare your answers, noting what is different from orth.lme1 and what is not. orth.lme1b <- lmer(distance ~ I(age - 8) + (1 pid), data=orth, REML=FALSE) summary(orth.lme1b) Compare the estimated values and standard errors of τ 2, σ 2 in orth.lme1b to those in orth.lme1, what do you observe and is that expected? The ordinary linear model and orth.lme1b are nested. Assume the likelihood ratio test is valid. The function loglik() returns log-likelihood values of linear models and mixed effects models. What do you conclude? lrstat <- 2*(logLik(orth.lme1b)[1] - loglik(orth.lm1)[1]) lrstat 3

How valid is the assumption that the likelihood ratio statistic is approximately chi-square distributed? Let s assume that the linear model fitted in orth.lm1 is the true model and generate an empirical distribution of the likelihood ratio test statistic. set.seed(2018) # so that we all get the same 'random' numbers B <- 100; lrstatmc <- rep(0,b) attach(orth) for (b in 1:B){ distancemc <- 24.3 + (age - 8) * 0.497 + rnorm(30*4, sd=0.522) boot.lm <- lm(distancemc ~ I(age - 8)) boot.lme <- lmer(distancemc ~ I(age - 8) + (1 pid), REML=FALSE) lrstatmc[b] <- 2*(logLik(boot.lme)[1] - loglik(boot.lm)[1]) } lrstatmc sum(lrstatmc<1e-10) What do you observe? The synthetic data distancemc generated in the above manner is known as parametric bootstrap samples. If the MLEs for the parameters in linear model are close to the true parameters, then for large B, the empirical distribution in lrstatmc is close to the likelihood ratio test statistic under the null hypothesis. We can therefore construct parametric bootstrap p-value of the null hypothesis by comparing lrstat to the vector lrstatmc. boot.pval <- sum(lrstatmc>lrstat)/b In this case, since the likelihood ratio statistic is so large, the bootstrap p-value is also essentially zero. Next let s fit a model with both a random intercept and a random slope. orth.lme2 <- lmer(distance ~ I(age - 8) + (I(age - 8) pid), data=orth) summary(orth.lme2) Write down the algebraic form of the model being fitted and identify the coefficient estimates. Is the additional random effect in the slope included in orth.lme2 significant? 4

Word frequency data We look at the writtenvariationlijk dataset in languager package. This dataset records the number of times certain words appears in Dutch newspapers. It contains the following variables: Corpus: a factor with as levels the sampled newspapers: belang (Het Belang van Limburg), gazet (De Gazet van Antwerpen), laatnieu (Het Laatste Nieuws), limburg (De Limburger), nrc (NRC Handelsblad), stand (De Standaard), and tele (De Telegraaf). Word: a factor with the 80 most frequent words ending in -lijk. Count: a numeric vector with token counts in the CONDIV corpus. Country a factor with levels Flanders and Netherlands. Register a factor with levels National, Quality and Regional coding the type of newspaper. We are interested to know how word use varies across country and newspapers. The word frequency certainly also depends on the word itself, but we are uninterested in the exact relation and would prefer to treat the by-word variation as a random effect. library(languager) data(writtenvariationlijk) head(writtenvariationlijk) lijk.lme1 <- glmer(count ~ Country + Register + (1 Word), data = writtenvariationlijk, family = "poisson") summary(lijk.lme1) The glmer function in lme4 package fits a generalised linear mixed effect model by maximum likelihood (there is not yet a satisfactory way to generalise REML to generalised linear mixed effect models, so a REML fit is not implemented in glmer). How to interpret the estimated random effect coefficient? How to interpret the estimated coefficient for CountryNetherlands? The anova function compares two models using their likelihood ratio test statistic. Are the two tests valid? 5